Join Query Support

StarTree Cube can help optimize aggregation over join queries as well. The optimizer looks for aggregation subtree pattern in the logical plan that typically looks like following.

  AggregationNode
  |- ProjectNode[Optional]
  .  |- ProjectNode[Optional]
  .  .  |- FilterNode[Optional]
  .  .  .  |- JoinNode
  .  .  .  .  [More Joins]
  .  .  .  .  .
  .  .  .  .  |- ProjectNode[Optional] - Left
  .  .  .  .  .  |- TableScanNode [Fact Table]
  .  .  .  .  |- ProjectNode[Optional] - Right
  .  .  .  .  .  |- TableScanNode [Dim Table]

If the query matches the pattern, the optimizer rewrites the logical plan by replacing the Fact TableScanNode with Cube TableScanNode. This is similar to the single table rewrite.

Star Schema Support

Join Query optimizer supports star schema only. A star schema is a data warehousing architecture model where one fact table references multiple dimension tables, which, when viewed as a diagram, looks like a star with the fact table in the center and the dimension tables radiating from it. All kinds of joins are supported.

star-schema

Cube Management

Create Cube can be still be used to define Cubes to optimize Join queries as well. The difficult part is identifying GROUP construct while building the Cubes. With single table queries, the GROUP BY clause will contain columns only from same the table. But with join queries, especially star schema queries, the GROUP BY contain columns from Dimension tables and not the Fact table. Let’s analyze more with following query

SELECT SUM(lo_revenue) AS lo_revenue, d_year, p_brand
FROM lineorder
LEFT JOIN dates ON lo_orderdate = d_datekey
LEFT JOIN part on lo_partkey = p_partkey
LEFT JOIN supplier on lo_suppkey = s_suppkey
WHERE p_category = 'MFGR#12' AND s_region = 'AMERICA'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;

Here lineorder is the Fact table and dates, part, supplier are the Dimension tables. Cubes will be defined on the lineorder table. The group by columns d_year, p_brand are part of the dimension tables dates and part appropriately. They cannot be used directly in CREATE CUBE statement. The proper solution is to use the foreign key columns of lineorder table in the GROUP construct while building Cubes.

CREATE CUBE lineorder_cube ON lineorder WITH(
AGGREGATIONS = (sum(lo_revenue)),
GROUP = (lo_orderdate, lo_partkey, lo_suppkey));

The optimizer parses the join conditions and uses those columns to identify the matching Cubes. The performance gain is realized if Cube size is smaller than fact table.

Limitations

  • Only star schema is supported.
  • Count distinct not supported because Cube does not store actual dimension values.
  • Queries won’t be optimized if Cubes are defined on both Fact and Dimension as the optimizer does not have capability to differentiate between two.
  • If Cubes are defined on more than one table of the Join query - then Optimizer does not work. Assumption is that Cubes are defined only the Fact table.
  • Supports only simple aggregation like SUM, COUNT, AVG, MIN, MAX - defined on Single column. Cube does not support SUM(revenue - supplycost) aggregation. The following query cannot be optimized using Cube.
   SELECT sum(lo_extendedprice * lo_discount) AS revenue
   FROM lineorder
   WHERE toYear(lo_orderdate) = 1993 AND lo_discount BETWEEN 1 AND 3 AND lo_quantity < 25;

Future

  • Support for snowflake schema
  • Building a single cube over multiple tables

有奖捉虫

“有虫”文档片段

0/500

存在的问题

文档存在风险与错误

● 拼写,格式,无效链接等错误;

● 技术原理、功能、规格等描述和软件不一致,存在错误;

● 原理图、架构图等存在错误;

● 版本号不匹配:文档版本或内容描述和实际软件不一致;

● 对重要数据或系统存在风险的操作,缺少安全提示;

● 排版不美观,影响阅读;

内容描述不清晰

● 描述存在歧义;

● 图形、表格、文字等晦涩难懂;

● 逻辑不清晰,该分类、分项、分步骤的没有给出;

内容获取有困难

● 很难通过搜索引擎,openLooKeng官网,相关博客找到所需内容;

示例代码有错误

● 命令、命令参数等错误;

● 命令无法执行或无法完成对应功能;

内容有缺失

● 关键步骤错误或缺失,无法指导用户完成任务,比如安装、配置、部署等;

● 逻辑不清晰,该分类、分项、分步骤的没有给出

● 图形、表格、文字等晦涩难懂

● 缺少必要的前提条件、注意事项等;

● 描述存在歧义

0/500

您对文档的总体满意度

非常不满意
非常满意

请问是什么原因让您参与到这个问题中

您的邮箱

创Issue赢奖品
根据您的反馈,会自动生成issue模板。您只需点击按钮,创建issue即可。
有奖捉虫