Hive Connector GCS Tutorial

Preliminary Steps

Ensure Access to GCS

Access to Cloud Storage data is possible thanks to Hadoop Cloud Storage connector.

If your data is publicly available, you do not need to do anything here. However, in most cases data is not publicly available, and the openLooKeng cluster needs to have access to it. This is typically achieved by creating a service account which has permissions to access your data. You can do this on the service accounts page in GCP. Once you create a service account, create a key for it and download the key in JSON format.

Hive Connector configuration

Another requirement is that you have enabled and configured a Hive connector in openLooKeng. The connector uses Hive metastore for data discovery and is not limited to data residing on HDFS.

Configuring Hive Connector

  • URL to Hive metastore:

    • New Hive metastore on GCP:

      If your openLooKeng nodes are provisioned by GCP, your Hive metastore should also be on GCP to minimize latency and costs. The simplest way to create a new Hive metastore on GCP is to create a small Cloud DataProc cluster (1 master, 0 workers), accessible from your openLooKeng cluster. Follow the steps for existing Hive metastore after finishing this step.

    • Existing Hive metastore:

      To use an existing Hive metastore with a openLooKeng cluster, you need to set the hive.metastore.uri property in your Hive catalog properties file to thrift://${METASTORE_ADDRESS}:${METASTORE_THRIFT_PORT}. If the metastore uses authentication, please refer to Hive Security Configuration.

  • GCS access:

    Here are example values for all GCS configuration properties which can be set in Hive catalog properties file:

    # JSON key file used to access Google Cloud Storage
    hive.gcs.json-key-file-path=/path/to/gcs_keyfile.json
    
    # Use client-provided OAuth token to access Google Cloud Storage
    hive.gcs.use-access-token=false
    

Hive Metastore configuration

If your Hive metastore uses StorageBasedAuthorization it will also need to access GCS to perform POSIX permission checks. Configuring GCS access for Hive is outside the scope of this tutorial, but there are some excellent guides online:

GCS access is typically configured in core-site.xml, to be used by all components using Apache Hadoop.

GCS connector for Hadoop provides an implementation of a Hadoop FileSystem. Unfortunately GCS IAM permissions don't map to POSIX permissions required by Hadoop FileSystem, so the GCS connector presents fake POSIX file permissions.

When Hive metastore accesses GCS, it will by default see fake POSIX permissions equal to 0700. If openLooKeng and Hive metastore are running as different user accounts, this will cause Hive metastore to deny openLooKeng data access. There are two possible solutions to this problem:

  • Run openLooKeng service and Hive service as the same user.
  • Make sure Hive GCS configuration includes afs.gs.reported.permissions property with a value of 777.

Accessing GCS Data From openLooKeng for the First Time

Accessing Data Already Mapped in the Hive metastore

If you migrate to openLooKeng from Hive, chances are that your GCS data is already mapped to SQL tables in the metastore. In that case, you should be able to query it.

Accessing Data Not Yet Mapped in the Hive metastore

To access GCS data that is not yet mapped in the Hive metastore you need to provide the schema of the data, the file format, and the data location. For example, if you have ORC or Parquet files in an GCS bucket my_bucket, you will need to execute a query:

-- select schema in which the table will be defined, must already exist
USE hive.default;

-- create table
CREATE TABLE orders (
     orderkey bigint,
     custkey bigint,
     orderstatus varchar(1),
     totalprice double,
     orderdate date,
     orderpriority varchar(15),
     clerk varchar(15),
     shippriority integer,
     comment varchar(79)
) WITH (
     external_location = 'gs://my_bucket/path/to/folder',
     format = 'ORC' -- or 'PARQUET'
);

Now you should be able to query the newly mapped table:

SELECT * FROM orders;

Writing GCS Data with openLooKeng

Prerequisites

Before you attempt to write data to GCS, make sure you have configured everything necessary to read data from GCS.

Create Export Schema

If Hive metastore contains schema(s) mapped to GCS locations, you can use them to export data to GCS. If you don't want to use existing schemas (or there are no appropriate schemas in the Hive metastore), you need to create a new one:

CREATE SCHEMA hive.gcs_export WITH (location = 'gs://my_bucket/some/path');

Export Data to GCS

Once you have a schema pointing to a location where you want to export the data, you can issue the export using a CREATE TABLE AS statement and select your desired file format. The data will be written to one or more files within the gs://my_bucket/some/path/my_table namespace.

Example:

CREATE TABLE hive.gcs_export.orders_export
WITH (format = 'ORC')
AS SELECT * FROM tpch.sf1.orders;

有奖捉虫

“有虫”文档片段

0/500

存在的问题

文档存在风险与错误

● 拼写,格式,无效链接等错误;

● 技术原理、功能、规格等描述和软件不一致,存在错误;

● 原理图、架构图等存在错误;

● 版本号不匹配:文档版本或内容描述和实际软件不一致;

● 对重要数据或系统存在风险的操作,缺少安全提示;

● 排版不美观,影响阅读;

内容描述不清晰

● 描述存在歧义;

● 图形、表格、文字等晦涩难懂;

● 逻辑不清晰,该分类、分项、分步骤的没有给出;

内容获取有困难

● 很难通过搜索引擎,openLooKeng官网,相关博客找到所需内容;

示例代码有错误

● 命令、命令参数等错误;

● 命令无法执行或无法完成对应功能;

内容有缺失

● 关键步骤错误或缺失,无法指导用户完成任务,比如安装、配置、部署等;

● 逻辑不清晰,该分类、分项、分步骤的没有给出

● 图形、表格、文字等晦涩难懂

● 缺少必要的前提条件、注意事项等;

● 描述存在歧义

0/500

您对文档的总体满意度

非常不满意
非常满意

请问是什么原因让您参与到这个问题中

您的邮箱

创Issue赢奖品
根据您的反馈,会自动生成issue模板。您只需点击按钮,创建issue即可。
有奖捉虫