Build a Data Lake with Trino, Kubernetes, Helm, and Glue

Introduction

This article is part of series centering on using Kubernetes to host Big Data infrastructure in the cloud. I recommend you first read Run Trino/Presto on Minikube on AWS.

Each day, we generate 2.5 quintillion (2.5 x 1018) bytes of data. This data could be unstructured or structured but the pace of production outstrips what a traditional data warehouse can handle. Its storage is insufficient and the compute cannot process the data in reasonable time.

The result? Unstructured data is discarded before it can be analyzed. Businesses are unable to extract value from the data they generate.

The solution? A Data Lake.

Data Lakes have been trending in Google Trends since 2016, but what’s so special about them?

Data Lakes store data in its raw unprocessed form. Unlike traditional data warehouses, Data Lakes do not require the data schema to be specified on write. A Data Lake accommodates data from any source including relational databases, webserver logs, mobile apps etc. Schema is only written at read-time.

Data Lakes enable analytics. Data processing technologies such as Apache Spark and Trino are used to query the data . High performance is achieved with in-memory processing. The technologies power use cases such as interactive analysis, batch processing, and dashboards.

Data Lakes use a metastore to catalog and store the data schema. The metadata is used by analytic tools to discover data and make it readable at the time of analysis. A common solution is the Hive metastore.

In this article, I will demonstrate how to build a Data Lake using the following techonogies

  1. S3 is AWS’s object storage service
  2. Glue Data Catalog is AWS’s metadata repository
  3. Kubernetes is the industry standard for container orchestration
  4. Helm is a package manager for Kubernetes
  5. Trino – A lightening-fast distributed SQL query engine

You may ask, why these 4 tools? A Data Lake’s enormous size requires scalable distributed storage. Enter object stores such as AWS S3 or Azure Blob Storage which have generous throughput and practically no storage limits in the case of S3.

A object storage service with few limits

Glue Data Catalog is AWS’s managed data metadata repository. It is compatible with the Hive metastore service and provides a single place to store metadata across multiple AWS services such as AWS EMR, Athena and Redshift Spectrum

A cloud managed metadata repository

In addition, they are cheap. In Asia-Pacific (Singapore), S3 charges $0.025 per GB for the first 50TB per month. Glue Data Catalog is free for the first million objects stored and first million requests per month.

Kubernetes is used in combination with helm to deploy Trino in a way and Trino was chosen for its speed in data-processing and its support for ANSI SQL.

Getting started

I assume you have an AWS account and have set up AWS CLI.

You should also have an EC2 instance with Minikube, kubectl and helm installed. You should also have a working knowledge of Kubernetes and Trino. If not, read this first.

Setup S3 and IAM

We begin by creating the storage layer and setting up credentials to access it.

We first create an S3 bucket to store our data in. You will need to replace {your-bucket-name} and {your-aws-region}. I used norman-datalake as my S3 bucket name.

aws s3api create-bucket --acl private --bucket {your-bucket-name} --create-bucket-configuration LocationConstraint={your-aws-region}
Create an S3 bucket to store your Data Lake data

Next we create an IAM user named TrinoUser. The Trino cluster will use the TrinoUser credentials to access S3 and the Glue metastore.

aws iam create-user --user-name TrinoUser
TrinoUser will be used by the Trino Cluster

Now clone the git repo build-a-datalake to your working directory. In this section, we will use the JSON documents in aws/iam to create IAM policies

git clone https://github.com/frenoid/build-a-datalake.git
build-a-datalake contains IAM policies and data for the datalake

First we create the TrinoGluePolicies from the aws\iam\TrinoGluePolicy.json. This policy gives permissions to take glue actions such as CreateDatabase and CreateTable and GetDatabase and GetTable. Overall, it grants permissions to take CRUD actions in the Glue metastore.

aws iam create-policy --policy-name TrinoGluePolicy --policy-document file://"C:\Users\xing.kang.norman.lim\13-Blog\build-a-datalake\aws\iam\TrinoGluePolicy.json"
Creating the TrinoGluePolicy

Next, we take a look at aws/iam/TrinoAccessPolicy.json . This grants permissions to Get, List and Delete objects in your datalake S3 bucket. You need to replace “norman-datalake” with your actual bucket name

Replace norman-datalake with your S3 bucket!

Once you have updated the policy with your bucket name, go ahead and create the policy

aws iam create-policy --policy-name TrinoAccessPolicy --policy-document file://"C:\Users\xing.kang.norman.lim\13-Blog\build-a-datalake\aws\iam\TrinoAccessPolicy.json"
Creating the TrinoAccessPolicy

Examine the policies you have created. Notice that the JSON policy object contains a key “Arn”, this is an Amazon Resource Name and it uniquely identifies resources in AWS.

Be sure to take note of the ARNs of TrinoAccessPolicy and TrinoGluePolicy

 aws iam list-policies
Each policy has an ARN that uniquely identifies it in AWS.

We now attach the 2 policies to TrinoUser by using the policy ARN and the attach-user-policy command. Replace {arn-number} with the policy ARN.

 aws iam attach-user-policy --user-name TrinoUser --policy-arn arn:aws:iam::{arn-number}:policy/TrinoAccessPolicy
aws iam attach-user-policy --user-name TrinoUser --policy-arn arn:aws:iam::{arn-number}:policy/TrinoGluePolicy
Attaching 2 policies to TrinoUser

Finally we generate credentials for TrinoUser

aws iam create-access-key --TrinoUser
AWS returns the keys in a JSON response. Keep these keys safe.

Once the Trino cluster acquires the TrinoUser credentials, it will be able read and write from Glue and the S3 datalake bucket.

Setup Glue and IAM

Despite being a managed AWS service., Glue needs to be granted permissions in IAM to access your AWS resources. This section takes excerpts from Setting up IAM Permissions for AWS Glue

First, we create the service policy for the Glue Role. This grants Glue permissions to read and write from the aws-glue bucket as well as list a user’s bucket amongst many other things.

aws iam create-policy --policy-name GlueServiceRolePolicy --policy-document file://"C:\Users\xing.kang.norman.lim\13-Blog\build-a-datalake\aws\iam\GlueServiceRolePolicy.json"
Creating the GlueServiceRolePolicy

We will note create the actual service role for Glue called AWSGlueServiceRoleNorman

aws iam create-user --user-name TrinoUser

Now attach GlueServiceRolePolicy to AWSGlueServiceRoleNorman

aws iam attach-user-policy --user-name AWSGlueServiceRoleNorman --policy-arn arn:aws:iam::{arn-number}:policy/GlueServiceRolePolicy

Now Glue is able to access AWS resources.

Setup Trino on Minikube

Start up your EC2 instance with Minikube installed and wait for it to enter the Running state. Take note of your public IP

Use your public-ip and SSH key to SSH into the Minikube EC2 instance. Remember your public-ip, you will need it later

ssh -i {private-key-path} ubuntu@{ec2-public-ip}

Start minikube

minikube start

Let’s run a few diagnostics to make ensure everything is running fine.

kubectl get pods -A
There are only kube-system and kube-dashboard pods running. Looks like no applications are installed.

Let’s first install helm using the following commands from installing helm.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Let’s use helm to see if any releases are installed on Minikube

helm list
There are no applications installed on minikube

We are now going to prepare the helm chart Trino. But we can’t use the default chart as is, it’s missing configurations to connect to glue and S3

First create a directory to store the Trino chart in

mkdir trino-glue

Now download version 0.3.0 of the trino chart into the trino-glue directory and extract

helm pull trino/trino --version 0.3.0 --destination trino-glue --untar

Let’s take a look inside

ls -ltr trino-glue/trino
A typical layout for a helm chart

A helm chart has the following layout

  1. Chart.yaml has information about the chart itself such as app version, chart version and maintainers
  2. values.yaml contains the default configuration values for the chart. These values are often overridden during installation
  3. templates is a directory of templates which are combined with values to produce valid Kubernetes manifests
  4. README.md is a human readable README file

See The Chart File Structure for a complete explanation of helm chart structure.

One of the templates contains configs for the Trino catalog: configmap-catalog.yaml .

A Trino Catalog contains a list of schemas and references a data source via a connetor. A fully-qualified table name is Trino has the format catalog.schema.table. The catalog.properties file in located in every Trino node contains information about how to access each catalog’s data connector.

Let’s now edit values.yaml to add the hive connector configurations

vim ~/trino-glue/trino/values.yaml

Trino’s Hive connector supports AWS glue and S3 object storage out of the box. We will need to provide Trino with Hive metastore type, AWS S3 keys and the default warehouse dir

Add the following section under additionalCatalogs.

additionalCatalogs:
  hive: |
    connector.name=hive
    hive.metastore=glue
    hive.metastore.glue.region=ap-southeast-1
    hive.metastore.glue.default-warehouse-dir=s3://{your-datalake-bucket}
    hive.metastore.glue.aws-access-key={TrinoUserAccessKey}
    hive.metastore.glue.aws-secret-key={TrinoUserSecretKey}
    hive.s3.aws-access-key={TrinoUserAccessKey}
    hive.s3.aws-secret-key={TrinoUserSecretKey}

The configuration properties are mostly self-explanatory.

hive.metastore.glue.default-warehouse-dir refers to default data storage location when tabe and schema location is unspecified. Enter the S3 URI of the datalake bucket you created above.

A complete explanation of Hive connector catalog properties is given here.

The | symbol indicates multi-line string. In this case, additionalCatalogs.hive contains a multi-line string which will be entered into templates/configmap-catalog.yaml to create the final ConfigMap manifest.

hive catalog configurations

Let’s take a look at the final manifest. We can combine values.yaml and the files in template/ without actually installing the release by using helm install together with the dry-run flag.

helm install trino-glue . --dry-run
A valid Kubernetes ConfigMap manifest

The final manifest for configmap-catalog looks good. Observe that the multi-line string has been entered under data.hive.properties.

Now let’s install the Trino chart. We will name our release trino-glue

helm install trino-glue ~/trino-glue/trino
Deploying trino-glue

Ignore the port-forward command given above. We will instead access the coordinator via a service.

Let’s check that the release is correctly deployed

helm list
List the releases in Minikube

Let’s check if the Trino pods are all in Running state and the containers are ready to accept requests.

kubectl get pods
List the pods in the Minikube default namespace

Notice that there is 1 coordinator pod and 2 worker pods. The coordinator pod accepts queries from clients, parses them, plans them and sends tasks to worker nodes.

Check that the status of all pods is Running and the all containers are in the Ready 1/1 state.

Because Kubernetes pods are non-permanent, we instead expose pods via services. Services defines a logical set of Pods and a policy by which to access them. While the coordinator pod may be deleted and re-created, the service maintains a constant IP and DNS by which the coordinator can be accessed.

Let’s take a look at the services created by the trino-glue release

kubectl get svc
List the services in Minikube

We see above that the trino-glue service has the cluster IP of 10.103.204.233 and is available on port 8080. It also a ClusterIP type of service. This IP is only reachable from within the cluster.

To make the Trino cluster accessible from the Internet, we will use kubectl the port-forward the trino-glue service to all addresses on port 8080

kubectl port-forward --address 0.0.0.0 svc/trino-glue 8080:8080

Trino comes with a WebUI which by default runs on port 8080. Let’s try accessing from your local by using the EC2 instance’s public IP

http://{ec2-public-ip}:8080/ui

In this case, authentication is not enabled. You can enter any user and no password is required.

Trino login screen

You will land on a dashboard giving you the status of the cluster and details on any queries that are running, completed and failed.

Trino cluter dashboard

Great – you’ve successfully deployed a Trino cluster on Minikube using Helm. But a Datalake isn’t a Datalake without data, let’s insert some data using the Trino cluster.

Insert and query data

In this section we will create a table and insert data by sending queries to our Trino cluster. We will use the BI tool Dbeaver. The community edition is available for free and support Trino out of the box. Go ahead and download and install Dbeaver, the open it.

We will first need to set up a new Database, go to Database > New Database Connection.

Pick Trino

Under Host, enter your EC2 instance public IP and make sure the port is 8080. Your Database/Schema should be hive.

Enter any username you desire and click on Test Connetion …

If the test connection succeeds, then click on Finish. If not make your Host IP and port number is correct. Also make sure that your EC2 security group enables ingress on port 8080 from your local IP.

Right-click on the newly created Database under Database Navigator and click on SQL Editor > New SQL script

Let’s query from information_schema.

SHOW TABLES FROM information_schema;

The list of tables returned in the bottom pane. information_schema is an ANSI set of read-only views that provide information about all of the tables, views, columns, and procedures in a database. It also gives user and access control information.

Let’s create a schema called order_data and a table called orders inside of it. We will also insert 3 rows.

create schema order_data;
create table order_data.orders (
	order_id int,
	order_date varchar,
	order_status varchar
);
insert into order_data.orders (order_id, order_date, order_status) values (1, '2021-01-01', 'COMPLETED'), (2, '2021-01-02', 'COMPLETED'), (3, '2021-01-03', 'IN PROGRESS');
select * from  order_data.orders;
Your orders table has 3 rows with 3 columns

Bear in mind that Trino and the Hive metastore was not designed for transactional use-cases requiring ACID compliance. For OLTP use-cases involving single row-writes and reads requiring rapid response time, you should use MySQL or PostgresSQL.

Trino and the Hive metastore was designed for the OLAP and is best used to run complex analytical queries on data that is written-once but read many times over.

We can examine the orders table DDL using the show create table command

SHOW CREATE TABLE order_data.orders

Since a format was not specified in the CREATE TABLE statement, Trino used the default ORC format for the Hive connector. ORC is a columnar file format optimized for OLAP workloads through features such as run-length encoding and support for complex data types such as struct and list. Another popular columnar file format is Parquet.

Recall that we set the hive.metastore.glue.default-warehouse-dir=s3://{your-datalake-bucket} . Since a table location was not set for orders, the table data is stored as ORC files by default in s3://{your-datalake-bucket/order_data/orders/

We can view the files using the S3 CLI

aws s3 ls s3://{your-datalake-bucket/order_data/orders/
ORC files created by Trino in S3

Notice the data is stored in an object store in an open-source file format. Data created by Trino is readable by other tools such as Spark, Greenplum, BigQuery etc.

Use Trino to ingest data

We used Trino to put objects in S3 by creating a table and running inserting rows. But can we do the reverse? Can we write files to S3 then have Trino to read them as tables?

Let us try by copying a csv file into s3 then creating a table called raw.flights on top of it. Once Trino can read the table, we can create new views and aggregates using CTAS statements.

Use the flight_small.csv file from the repo and copy it to s3://{your-s3-bucket/raw/flights/flights_small.csv

aws s3 cp data/flights_small.csv s3://norman-datalake/datalake/raw/flights/flights_small.csv

Next we will use the AWS Glue WebUI to create a database called raw. Click on Add database

Databases in the Glue catalog.
The previously created order_data Trino schema is listed as a database in Glue

Now enter the database name as raw and the location as s3://{your-s3-bucket}/raw

Adding a new database in Glue

Now we will add the flights table. Click on Add tables > Add table manually.

Add a table in Glue

Enter table name as flights and database as raw

Specify table properties

Pick S3 as the data store and select Specified path in my account

Include path is s3://{your-s3-bucket}/datalake/flights

Specify data location

Pick CSV as the format

Specify data format

Now add 16 columns in the table. Enter each column’s name and data type. A list of column name and datatypes are given below.

Adding columns to the flights table in Glue

You should see the table flights in now available in the Glue metastore

There is an additional step, the first row of the CSV file are column names. To handle this, we add the skip.header.line.count=1 property to the Serde parameters in the flight table. This tells Trino to avoid reading the first row.

Click on the flights table > Edit Table

Table properties of raw.flight

Add the key skip.header.line.count and the value 1.

Edit Serde parameters of raw.flights

Use Dbeaver to query the raw.flights table

SELECT * FROM raw.flights;
Data from the flights_small.csv is readable by Trino

Congratulations – you’ve just ingested data into your Datalake. We see the advantage of schema-on-read here where we can store our data first (csv files in S3), then impose a schema only when we want to read it. This speeds up data storage compared to schema-on-write where less organized data would probably be thrown out.

Let’s create a new table called proceesed.flights_united_airlines containing only United Airlines flights in raw.flights using a CTAS statement

CREATE SCHEMA processed;
CREATE TABLE processed.flights_united_airlines AS (
  SELECT * FROM raw.flights where carrier='UA'
);
SELECT * FROM processed.flights_united_airlines;
Creating and querying processed_flights.united_airlines

Creating a table with CTAS is simple since the output table can infer its schema from the source tables.

We can view the table in S3 at the path s3://{your-s3-bucket/processed/flights_united_airlines/

aws s3 ls s3://{your-s3-bucket/processed/flights_united_airlines/
The S3 location of processed.flights_united_airlines

Once a data is ingested by copying data into S3 and creating a table in the Glue metastore, it can be queried interactively using Trino or even integrated into a Spark ETL pipeline to build datamarts and genereate business reports.

Cleanup

We will now terminate our Trino cluster and shutdown the EC2 instance to avoid continuous expense.

Use helm to delete the trino-glue release

helm delete trino-glue

Now stop minikube

minikube stop

Go to the AWS EC2 Console and stop the EC2 instance.

Conclusion

Congratulations – you’ve successfully created a Datalake using Trino on Kubernetes as the compute, S3 as storage and AWS Glue as the metastore.

This combination can be used in production with a few tweaks: Switch to deploying Trino on EKS and use the Glue Crawler to discover new tables in S3.

In my next article I will demonstrate how we can perform ETL using Spark on Kubernetes. I will also begin moving away from AWS managed products such as S3 and Glue have a more cloud agnostic architecture.

References

Github: Build a Datalake https://github.com/frenoid/build-a-datalake

Helm Charts https://helm.sh/docs/topics/charts/

Setting up IAM Permissions for AWS Glue https://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html

Trino Community Kubernetes Helm Charts https://github.com/trinodb/charts

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s