Introduction
This article is part of series centering on using Kubernetes to host Big Data infrastructure in the cloud. I recommend you first read Run Trino/Presto on Minikube on AWS.

Each day, we generate 2.5 quintillion (2.5 x 1018) bytes of data. This data could be unstructured or structured but the pace of production outstrips what a traditional data warehouse can handle. Its storage is insufficient and the compute cannot process the data in reasonable time.
The result? Unstructured data is discarded before it can be analyzed. Businesses are unable to extract value from the data they generate.
The solution? A Data Lake.

Data Lakes have been trending in Google Trends since 2016, but what’s so special about them?
Data Lakes store data in its raw unprocessed form. Unlike traditional data warehouses, Data Lakes do not require the data schema to be specified on write. A Data Lake accommodates data from any source including relational databases, webserver logs, mobile apps etc. Schema is only written at read-time.
Data Lakes enable analytics. Data processing technologies such as Apache Spark and Trino are used to query the data . High performance is achieved with in-memory processing. The technologies power use cases such as interactive analysis, batch processing, and dashboards.
Data Lakes use a metastore to catalog and store the data schema. The metadata is used by analytic tools to discover data and make it readable at the time of analysis. A common solution is the Hive metastore.
In this article, I will demonstrate how to build a Data Lake using the following techonogies
- S3 is AWS’s object storage service
- Glue Data Catalog is AWS’s metadata repository
- Kubernetes is the industry standard for container orchestration
- Helm is a package manager for Kubernetes
- Trino – A lightening-fast distributed SQL query engine
You may ask, why these 4 tools? A Data Lake’s enormous size requires scalable distributed storage. Enter object stores such as AWS S3 or Azure Blob Storage which have generous throughput and practically no storage limits in the case of S3.

Glue Data Catalog is AWS’s managed data metadata repository. It is compatible with the Hive metastore service and provides a single place to store metadata across multiple AWS services such as AWS EMR, Athena and Redshift Spectrum

In addition, they are cheap. In Asia-Pacific (Singapore), S3 charges $0.025 per GB for the first 50TB per month. Glue Data Catalog is free for the first million objects stored and first million requests per month.
Kubernetes is used in combination with helm to deploy Trino in a way and Trino was chosen for its speed in data-processing and its support for ANSI SQL.
Getting started
I assume you have an AWS account and have set up AWS CLI.
You should also have an EC2 instance with Minikube, kubectl and helm installed. You should also have a working knowledge of Kubernetes and Trino. If not, read this first.
Setup S3 and IAM
We begin by creating the storage layer and setting up credentials to access it.
We first create an S3 bucket to store our data in. You will need to replace {your-bucket-name} and {your-aws-region}. I used norman-datalake as my S3 bucket name.
aws s3api create-bucket --acl private --bucket {your-bucket-name} --create-bucket-configuration LocationConstraint={your-aws-region}

Next we create an IAM user named TrinoUser. The Trino cluster will use the TrinoUser credentials to access S3 and the Glue metastore.
aws iam create-user --user-name TrinoUser

Now clone the git repo build-a-datalake to your working directory. In this section, we will use the JSON documents in aws/iam to create IAM policies
git clone https://github.com/frenoid/build-a-datalake.git

First we create the TrinoGluePolicies from the aws\iam\TrinoGluePolicy.json. This policy gives permissions to take glue actions such as CreateDatabase and CreateTable and GetDatabase and GetTable. Overall, it grants permissions to take CRUD actions in the Glue metastore.
aws iam create-policy --policy-name TrinoGluePolicy --policy-document file://"C:\Users\xing.kang.norman.lim\13-Blog\build-a-datalake\aws\iam\TrinoGluePolicy.json"

Next, we take a look at aws/iam/TrinoAccessPolicy.json . This grants permissions to Get, List and Delete objects in your datalake S3 bucket. You need to replace “norman-datalake” with your actual bucket name

Once you have updated the policy with your bucket name, go ahead and create the policy
aws iam create-policy --policy-name TrinoAccessPolicy --policy-document file://"C:\Users\xing.kang.norman.lim\13-Blog\build-a-datalake\aws\iam\TrinoAccessPolicy.json"

Examine the policies you have created. Notice that the JSON policy object contains a key “Arn”, this is an Amazon Resource Name and it uniquely identifies resources in AWS.
Be sure to take note of the ARNs of TrinoAccessPolicy and TrinoGluePolicy
aws iam list-policies

We now attach the 2 policies to TrinoUser by using the policy ARN and the attach-user-policy command. Replace {arn-number} with the policy ARN.
aws iam attach-user-policy --user-name TrinoUser --policy-arn arn:aws:iam::{arn-number}:policy/TrinoAccessPolicy
aws iam attach-user-policy --user-name TrinoUser --policy-arn arn:aws:iam::{arn-number}:policy/TrinoGluePolicy

Finally we generate credentials for TrinoUser
aws iam create-access-key --TrinoUser

Once the Trino cluster acquires the TrinoUser credentials, it will be able read and write from Glue and the S3 datalake bucket.
Setup Glue and IAM
Despite being a managed AWS service., Glue needs to be granted permissions in IAM to access your AWS resources. This section takes excerpts from Setting up IAM Permissions for AWS Glue
First, we create the service policy for the Glue Role. This grants Glue permissions to read and write from the aws-glue bucket as well as list a user’s bucket amongst many other things.
aws iam create-policy --policy-name GlueServiceRolePolicy --policy-document file://"C:\Users\xing.kang.norman.lim\13-Blog\build-a-datalake\aws\iam\GlueServiceRolePolicy.json"

We will note create the actual service role for Glue called AWSGlueServiceRoleNorman
aws iam create-user --user-name TrinoUser
Now attach GlueServiceRolePolicy to AWSGlueServiceRoleNorman
aws iam attach-user-policy --user-name AWSGlueServiceRoleNorman --policy-arn arn:aws:iam::{arn-number}:policy/GlueServiceRolePolicy
Now Glue is able to access AWS resources.
Setup Trino on Minikube
Start up your EC2 instance with Minikube installed and wait for it to enter the Running state. Take note of your public IP


Use your public-ip and SSH key to SSH into the Minikube EC2 instance. Remember your public-ip, you will need it later
ssh -i {private-key-path} ubuntu@{ec2-public-ip}

Start minikube
minikube start

Let’s run a few diagnostics to make ensure everything is running fine.
kubectl get pods -A

Let’s first install helm using the following commands from installing helm.
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
Let’s use helm to see if any releases are installed on Minikube
helm list

We are now going to prepare the helm chart Trino. But we can’t use the default chart as is, it’s missing configurations to connect to glue and S3
First create a directory to store the Trino chart in
mkdir trino-glue
Now download version 0.3.0 of the trino chart into the trino-glue directory and extract
helm pull trino/trino --version 0.3.0 --destination trino-glue --untar
Let’s take a look inside
ls -ltr trino-glue/trino

A helm chart has the following layout
- Chart.yaml has information about the chart itself such as app version, chart version and maintainers
- values.yaml contains the default configuration values for the chart. These values are often overridden during installation
- templates is a directory of templates which are combined with values to produce valid Kubernetes manifests
- README.md is a human readable README file
See The Chart File Structure for a complete explanation of helm chart structure.
One of the templates contains configs for the Trino catalog: configmap-catalog.yaml .
A Trino Catalog contains a list of schemas and references a data source via a connetor. A fully-qualified table name is Trino has the format catalog.schema.table. The catalog.properties file in located in every Trino node contains information about how to access each catalog’s data connector.
Let’s now edit values.yaml to add the hive connector configurations
vim ~/trino-glue/trino/values.yaml
Trino’s Hive connector supports AWS glue and S3 object storage out of the box. We will need to provide Trino with Hive metastore type, AWS S3 keys and the default warehouse dir
Add the following section under additionalCatalogs.
additionalCatalogs:
hive: |
connector.name=hive
hive.metastore=glue
hive.metastore.glue.region=ap-southeast-1
hive.metastore.glue.default-warehouse-dir=s3://{your-datalake-bucket}
hive.metastore.glue.aws-access-key={TrinoUserAccessKey}
hive.metastore.glue.aws-secret-key={TrinoUserSecretKey}
hive.s3.aws-access-key={TrinoUserAccessKey}
hive.s3.aws-secret-key={TrinoUserSecretKey}
The configuration properties are mostly self-explanatory.
hive.metastore.glue.default-warehouse-dir refers to default data storage location when tabe and schema location is unspecified. Enter the S3 URI of the datalake bucket you created above.
A complete explanation of Hive connector catalog properties is given here.
The | symbol indicates multi-line string. In this case, additionalCatalogs.hive contains a multi-line string which will be entered into templates/configmap-catalog.yaml to create the final ConfigMap manifest.

Let’s take a look at the final manifest. We can combine values.yaml and the files in template/ without actually installing the release by using helm install together with the dry-run flag.
helm install trino-glue . --dry-run

The final manifest for configmap-catalog looks good. Observe that the multi-line string has been entered under data.hive.properties.
Now let’s install the Trino chart. We will name our release trino-glue
helm install trino-glue ~/trino-glue/trino

Ignore the port-forward command given above. We will instead access the coordinator via a service.
Let’s check that the release is correctly deployed
helm list

Let’s check if the Trino pods are all in Running state and the containers are ready to accept requests.
kubectl get pods

Notice that there is 1 coordinator pod and 2 worker pods. The coordinator pod accepts queries from clients, parses them, plans them and sends tasks to worker nodes.
Check that the status of all pods is Running and the all containers are in the Ready 1/1 state.
Because Kubernetes pods are non-permanent, we instead expose pods via services. Services defines a logical set of Pods and a policy by which to access them. While the coordinator pod may be deleted and re-created, the service maintains a constant IP and DNS by which the coordinator can be accessed.
Let’s take a look at the services created by the trino-glue release
kubectl get svc

We see above that the trino-glue service has the cluster IP of 10.103.204.233 and is available on port 8080. It also a ClusterIP type of service. This IP is only reachable from within the cluster.
To make the Trino cluster accessible from the Internet, we will use kubectl the port-forward the trino-glue service to all addresses on port 8080
kubectl port-forward --address 0.0.0.0 svc/trino-glue 8080:8080

Trino comes with a WebUI which by default runs on port 8080. Let’s try accessing from your local by using the EC2 instance’s public IP
http://{ec2-public-ip}:8080/ui
In this case, authentication is not enabled. You can enter any user and no password is required.

You will land on a dashboard giving you the status of the cluster and details on any queries that are running, completed and failed.

Great – you’ve successfully deployed a Trino cluster on Minikube using Helm. But a Datalake isn’t a Datalake without data, let’s insert some data using the Trino cluster.
Insert and query data
In this section we will create a table and insert data by sending queries to our Trino cluster. We will use the BI tool Dbeaver. The community edition is available for free and support Trino out of the box. Go ahead and download and install Dbeaver, the open it.
We will first need to set up a new Database, go to Database > New Database Connection.

Pick Trino

Under Host, enter your EC2 instance public IP and make sure the port is 8080. Your Database/Schema should be hive.
Enter any username you desire and click on Test Connetion …

If the test connection succeeds, then click on Finish. If not make your Host IP and port number is correct. Also make sure that your EC2 security group enables ingress on port 8080 from your local IP.
Right-click on the newly created Database under Database Navigator and click on SQL Editor > New SQL script

Let’s query from information_schema.
SHOW TABLES FROM information_schema;

The list of tables returned in the bottom pane. information_schema is an ANSI set of read-only views that provide information about all of the tables, views, columns, and procedures in a database. It also gives user and access control information.
Let’s create a schema called order_data and a table called orders inside of it. We will also insert 3 rows.
create schema order_data;
create table order_data.orders (
order_id int,
order_date varchar,
order_status varchar
);
insert into order_data.orders (order_id, order_date, order_status) values (1, '2021-01-01', 'COMPLETED'), (2, '2021-01-02', 'COMPLETED'), (3, '2021-01-03', 'IN PROGRESS');
select * from order_data.orders;

Bear in mind that Trino and the Hive metastore was not designed for transactional use-cases requiring ACID compliance. For OLTP use-cases involving single row-writes and reads requiring rapid response time, you should use MySQL or PostgresSQL.
Trino and the Hive metastore was designed for the OLAP and is best used to run complex analytical queries on data that is written-once but read many times over.
We can examine the orders table DDL using the show create table command
SHOW CREATE TABLE order_data.orders

Since a format was not specified in the CREATE TABLE statement, Trino used the default ORC format for the Hive connector. ORC is a columnar file format optimized for OLAP workloads through features such as run-length encoding and support for complex data types such as struct and list. Another popular columnar file format is Parquet.
Recall that we set the hive.metastore.glue.default-warehouse-dir=s3://{your-datalake-bucket} . Since a table location was not set for orders, the table data is stored as ORC files by default in s3://{your-datalake-bucket/order_data/orders/
We can view the files using the S3 CLI
aws s3 ls s3://{your-datalake-bucket/order_data/orders/

Notice the data is stored in an object store in an open-source file format. Data created by Trino is readable by other tools such as Spark, Greenplum, BigQuery etc.
Use Trino to ingest data
We used Trino to put objects in S3 by creating a table and running inserting rows. But can we do the reverse? Can we write files to S3 then have Trino to read them as tables?
Let us try by copying a csv file into s3 then creating a table called raw.flights on top of it. Once Trino can read the table, we can create new views and aggregates using CTAS statements.
Use the flight_small.csv file from the repo and copy it to s3://{your-s3-bucket/raw/flights/flights_small.csv
aws s3 cp data/flights_small.csv s3://norman-datalake/datalake/raw/flights/flights_small.csv
Next we will use the AWS Glue WebUI to create a database called raw. Click on Add database

The previously created order_data Trino schema is listed as a database in Glue
Now enter the database name as raw and the location as s3://{your-s3-bucket}/raw

Now we will add the flights table. Click on Add tables > Add table manually.

Enter table name as flights and database as raw

Pick S3 as the data store and select Specified path in my account
Include path is s3://{your-s3-bucket}/datalake/flights

Pick CSV as the format

Now add 16 columns in the table. Enter each column’s name and data type. A list of column name and datatypes are given below.


You should see the table flights in now available in the Glue metastore

There is an additional step, the first row of the CSV file are column names. To handle this, we add the skip.header.line.count=1 property to the Serde parameters in the flight table. This tells Trino to avoid reading the first row.
Click on the flights table > Edit Table

Add the key skip.header.line.count and the value 1.

Use Dbeaver to query the raw.flights table
SELECT * FROM raw.flights;

Congratulations – you’ve just ingested data into your Datalake. We see the advantage of schema-on-read here where we can store our data first (csv files in S3), then impose a schema only when we want to read it. This speeds up data storage compared to schema-on-write where less organized data would probably be thrown out.
Let’s create a new table called proceesed.flights_united_airlines containing only United Airlines flights in raw.flights using a CTAS statement
CREATE SCHEMA processed;
CREATE TABLE processed.flights_united_airlines AS (
SELECT * FROM raw.flights where carrier='UA'
);
SELECT * FROM processed.flights_united_airlines;

Creating a table with CTAS is simple since the output table can infer its schema from the source tables.
We can view the table in S3 at the path s3://{your-s3-bucket/processed/flights_united_airlines/
aws s3 ls s3://{your-s3-bucket/processed/flights_united_airlines/

Once a data is ingested by copying data into S3 and creating a table in the Glue metastore, it can be queried interactively using Trino or even integrated into a Spark ETL pipeline to build datamarts and genereate business reports.
Cleanup
We will now terminate our Trino cluster and shutdown the EC2 instance to avoid continuous expense.
Use helm to delete the trino-glue release
helm delete trino-glue

Now stop minikube
minikube stop

Go to the AWS EC2 Console and stop the EC2 instance.

Conclusion
Congratulations – you’ve successfully created a Datalake using Trino on Kubernetes as the compute, S3 as storage and AWS Glue as the metastore.
This combination can be used in production with a few tweaks: Switch to deploying Trino on EKS and use the Glue Crawler to discover new tables in S3.
In my next article I will demonstrate how we can perform ETL using Spark on Kubernetes. I will also begin moving away from AWS managed products such as S3 and Glue have a more cloud agnostic architecture.
References
Github: Build a Datalake https://github.com/frenoid/build-a-datalake
Helm Charts https://helm.sh/docs/topics/charts/
Setting up IAM Permissions for AWS Glue https://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html
Trino Community Kubernetes Helm Charts https://github.com/trinodb/charts