ORC Cache

ORC Cache feature improves the query performance by caching frequently accessed data. ORC Cache reduces time spent on TableScan operation because the network IO This in turn reduces the query latency.

This feature is most beneficial for caching raw data from tables that are most frequently accessed and not co-located with the openLooKeng deployment. If enabled, workers automatically cache file tail, stripe footer, row index, bloom index of all ORC files because they are small. However, row group data tends to be huge and caching row group data for all files in not practically feasible because of the limitation with cache size.

Check this link to know about ORC specification.

CACHE TABLE SQL command can be used to configure the table and partition for which row data should be cached by the Worker.

The following sections briefly explains how the entire row data cache implementation works.

SplitCacheMap

Users can use CACHE TABLE sql statement to configure which table and data must be cached by Hive connector. The partitions to cache are defined as predicates and are stored in SplitCacheMap. SplitCacheMap is stored in local memory of the coordinator.

Sample query to cache sales table data for days between 2020-01-04 and 2020-01-11.

cache table hive.default.sales where sales_date BETWEEN date '2020-01-04' AND date'2020-01-11'

Check CACHE TABLE, SHOW CACHE, and DROP CACHE commands for more information.

SplitCacheMap stores two kinds of information

  1. Table name along with predicates provided via CACHE TABLE command.
  2. Split to Worker mapping

Connector

When caching is enabled and a predicate is provided through CACHE TABLE SQL command, HiveSplits will be flagged by the connector as cacheable if the corresponding partitioned ORC file matches the predicate.

SplitCacheAwareNodeSelector

SplitCacheAwareNodeSelector is implemented to support cache affinity scheduling. SplitCacheAwareNodeSelector is like any other node selector responsible for assigning splits to workers. When a split is scheduled for first time, the node selector stores the split and worker on which the split was scheduled. For subsequent scheduling, this information is used to determine whether split has already been processed by a worker. If so, the node selector schedules the split on the worker that previously processed it. If not, SplitCacheAwareNodeSelector falls back to default node selector to schedule the split. Workers which process the splits will cache the data mapped by the split in local memory.

Workers

Workers rely on ConnectorSplit.isCacheable method to determine whether split data must be cached. If property is set to true, the HiveConnector tries to retrieve the data from Cache. In case of cache miss, the data is read from HDFS and stored in Cache for future use. Workers will purge their caches by expiry time or by reaching size limit, independently of the coordinator.

Check ORC Cache Configuration under Hive connector to know more about cache config.

有奖捉虫

“有虫”文档片段

0/500

存在的问题

文档存在风险与错误

● 拼写,格式,无效链接等错误;

● 技术原理、功能、规格等描述和软件不一致,存在错误;

● 原理图、架构图等存在错误;

● 版本号不匹配:文档版本或内容描述和实际软件不一致;

● 对重要数据或系统存在风险的操作,缺少安全提示;

● 排版不美观,影响阅读;

内容描述不清晰

● 描述存在歧义;

● 图形、表格、文字等晦涩难懂;

● 逻辑不清晰,该分类、分项、分步骤的没有给出;

内容获取有困难

● 很难通过搜索引擎,openLooKeng官网,相关博客找到所需内容;

示例代码有错误

● 命令、命令参数等错误;

● 命令无法执行或无法完成对应功能;

内容有缺失

● 关键步骤错误或缺失,无法指导用户完成任务,比如安装、配置、部署等;

● 逻辑不清晰,该分类、分项、分步骤的没有给出

● 图形、表格、文字等晦涩难懂

● 缺少必要的前提条件、注意事项等;

● 描述存在歧义

0/500

您对文档的总体满意度

非常不满意
非常满意

请问是什么原因让您参与到这个问题中

您的邮箱

创Issue赢奖品
根据您的反馈,会自动生成issue模板。您只需点击按钮,创建issue即可。
有奖捉虫