OmniData Connector

Overview

The OmniData connector allows querying data stored in the remote Hive data warehouse. It pushes the operators of openLooKeng down to the storage node to achieve near-data calculation, thereby reducing the amount of network transmission data and improving computing performance.

For more information, please see: OmniData and OmniData connector.

Supported File Types

The following file types are supported for the OmniData connector:

  • ORC
  • Parquet
  • Text

Configuration

Create etc/catalog/omnidata.properties with the following configurations, replacing example.net:9083 with the correct host and port for your Hive metastore Thrift service:

connector.name=omnidata-openlookeng
hive.metastore.uri=thrift://example.net:9083

HDFS Configuration

For basic setups, openLooKeng configures the HDFS client automatically and does not require any configuration files. In some cases, such as when using federated HDFS or NameNode high availability, it is necessary to specify additional HDFS client options in order to access your HDFS cluster. To do so, add the hive.config.resources property to reference your HDFS config files:

hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

Only specify additional configuration files if necessary for your setup. We also recommend reducing the configuration files to have the minimum set of required properties, as additional properties may cause problems.

The configuration files must exist on all openLooKeng nodes. If you are referencing existing Hadoop config files, make sure to copy them to any openLooKeng nodes that are not running Hadoop.

OmniData Configuration Properties

Property NameDescriptionDefault
hive.metastoreThe type of Hive metastorethrift
hive.config.resourcesAn optional comma-separated list of HDFS configuration files. These files must exist on the machines running openLooKeng. Only specify this if absolutely necessary to access HDFS. Example: /etc/hdfs-site.xml
hive.omnidata-enabledAllows push-down operators to execute on the storage side. If disabled, all operators will not be pushed down.true
hive.min-offload-row-numberIf the number of rows in the table is less than the threshold, all operators of the table will not be pushed down.500
hive.filter-offload-enabledAllows the filter operator to be pushed down to the storage side. If disabled, the filter operator will not be pushed down.true
hive.filter-offload-factorOnly when the selection rate of the filter operator is less than the threshold, it will be pushed down.0.25
hive.aggregator-offload-enabledAllows the aggregator operator to be pushed down to the storage side. If disabled, the aggregator operator will not be pushed down.true
hive.aggregator-offload-factorOnly when the aggregation rate of the aggregator operator is less than the threshold, it will be pushed down.0.25

For more configuration, please refer to the [Hive Configuration Properties](./hive.html#Hive Configuration Properties) chapter.

Querying OmniData

The SQL query plan after some operators are pushed down:

lk:tpch_flat_orc_date_1000> explain select sum(l_extendedprice * l_discount) as revenue
				 		 -> from
				 		 -> lineitem
				 		 -> where
				 		 -> l_shipdate >= DATE '1993-01-01'
				 		 -> and l_shipdate < DATE '1994-01-01'
				 		 -> and l_discount between 0.06 - 0.01 and 0.06 + 0.01
				 		 -> and l_quantity < 25;
				 							Query Plan
------------------------------------------------------------------------------------------------------
Output[revenue]
 Layout: [sum:double]
 Estimates: {rows: 4859991664 (40.74GB), cpu: 246.43G, memory: 86.00GB, network: 45.26GB}
 revenue := sum
└─ Aggregate(FINAL)
 Layout: [sum:double]
 Estimates: {rows: 4859991664 (40.74GB), cpu: 246.43G, memory: 86.00GB, network: 45.26GB}
 sum := sum(sum_4)
└─ LocalExchange[SINGLE] ()
 Layout: [sum_4:double]
 Estimates: {rows: 5399990738 (45.26GB), cpu: 201.17G, memory: 45.26GB, network: 45.26GB}
└─ RemoteExchange[GATHER]
 Layout: [sum_4:double]
 Estimates: {rows: 5399990738 (45.26GB), cpu: 201.17G, memory: 45.26GB, network: 45.26GB}
└─ Aggregate(PARTIAL)
 Layout: [sum_4:double]
 Estimates: {rows: 5399990738 (45.26GB), cpu: 201.17G, memory: 45.26GB, network: 0B}
 sum_4 := sum(expr)
└─ ScanProject[table = hive:tpch_flat_orc_date_1000:lineitem offload={ filter=[AND(AND(BETWEEN(l_discount, 0.05, 0.07), LESS_THAN(l_quantity, 25.0)), AND(GREATER_THAN_OR_EQUAL(l_shipdate, 8401), LESS_THAN(l_shipdate, 8766)))]} ]
 Layout: [expr:double]
 Estimates: {rows: 5999989709 (50.29GB), cpu: 100.58G, memory: 0B, network: 0B}/{rows: 5999989709 (50.29GB), cpu: 150.87G, memory: 0B, network: 0B}
 expr := (l_extendedprice) * (l_discount)
 l_extendedprice := l_extendedprice:double:5:REGULAR
 l_discount := l_discount:double:6:REGULAR

OmniData Connector Limitations

  • The OmniData service needs to be deployed on the storage node.
  • Only the pushdown of Filter, Aggregator, and Limit operators are supported.

有奖捉虫

“有虫”文档片段

0/500

存在的问题

文档存在风险与错误

● 拼写,格式,无效链接等错误;

● 技术原理、功能、规格等描述和软件不一致,存在错误;

● 原理图、架构图等存在错误;

● 版本号不匹配:文档版本或内容描述和实际软件不一致;

● 对重要数据或系统存在风险的操作,缺少安全提示;

● 排版不美观,影响阅读;

内容描述不清晰

● 描述存在歧义;

● 图形、表格、文字等晦涩难懂;

● 逻辑不清晰,该分类、分项、分步骤的没有给出;

内容获取有困难

● 很难通过搜索引擎,openLooKeng官网,相关博客找到所需内容;

示例代码有错误

● 命令、命令参数等错误;

● 命令无法执行或无法完成对应功能;

内容有缺失

● 关键步骤错误或缺失,无法指导用户完成任务,比如安装、配置、部署等;

● 逻辑不清晰,该分类、分项、分步骤的没有给出

● 图形、表格、文字等晦涩难懂

● 缺少必要的前提条件、注意事项等;

● 描述存在歧义

0/500

您对文档的总体满意度

非常不满意
非常满意

请问是什么原因让您参与到这个问题中

您的邮箱

创Issue赢奖品
根据您的反馈,会自动生成issue模板。您只需点击按钮,创建issue即可。
有奖捉虫