Introduction

This is the beginning of a new series centering on the use of Kubernetes to host Big Data infrastructure. In this article I will run a single-node Trino cluster in local Kubernetes cluster called minikube
We will be combining 3 technologies
- A state of the art query engine: Trino
- A sophisticated container orchestration tool: Kubernetes
- A local Kubernetes cluster: minikube
Trino was previously known as Presto until build 350 when a spat about the Presto trademark led the original founders and creator of Presto to rebrand PrestoSQL as Trino. I have discussed Trino/Presto in detail in previous articles. Read them for detailed insights into the use-case and architecture of Trino.
- Creating a Presto Cluster on EC2
- Create a multi-node Presto Cluster
- Querying a MySQL instance using Presto

Simply put, Trino is a lightening fast query engine that scales by separating compute from storage and supports a wide range of data sources through its modular connectors. It is used in top-tier tech companies such as Stripe, Shopify and Linkedin for both batch and interactive workloads.
What is Kubernetes?
Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.
https://kubernetes.io/

Kubernetes can be used to deploy any containerized application and scale it up to tens of thousands of machines. It originated as Google’s Borg which was developed to manage thousands of jobs and applications across thousands of machines.
Lastly, we have minikube. Minikube is a local Kubernetes distribution designed to be setup easily and run on machines with limited resources such as a laptop. I found it invaluable as a learning and development environment for Kubernetes.

In this article we will accomplish the following
- Launch an EC2 instance using the Ubuntu Server 18.04 AMI
- Install and run minikube and the Docker engine
- Install the kubectl command line tool
- Use kubectl to deploy a single-node Trino cluster
- Install and run trino-cli to send queries to the Trino cluster
Launch EC2 instance
We will first have to launch an EC2 instance on the AWS console. We will be using the t2.xlarge instance. This will cost you approximately 0.70 USD in the Asia-Pacific (Singapore) region for the 3 hours it takes to complete this guide.
Proceed to the EC2 Instances page and click on the Launch Instances button

Choose Ubuntu Server 18.04 LTS (HVM) SSD Volume Type AMI. Ubuntu 18.04 is the Long Term Support version of Ubuntu whose support will only cease in April 2023.

Then choose the t2.xlarge instance. We want a minimum of 3 vCPUs for our single node Minikube cluster since it is serving as both control plane and worker. t2.xlarge has 4 vCPUs and 16 GB of RAM.

In the next screen: Configure Instance Details, there is no need to change anything. We will not need to assign an IAM role our EC2 instance since it will not be using other AWS resources such as S3.
Click on “Next: Add Storage”

In “Step 4: Add Storage”, make sure there is at least 20 GiB of storage and the Volume Type is General Purpose SSD.

This next step is optional but useful for tracking costs in AWS billing. You can add tags in the from of key-value pairs.
I added two tags: Name=MinikubeCluster and Application=Minikube

Next we will configure a security group so that only traffic originating from your IP is allowed to reach your EC2 instance.
First click on “Create a new security group”, then select All traffic and choose My IP as the source.

Finally, we reach the Review Instance Launch screen. Verify that the settings are correct and click on the Launch button

You will be asked to select an existing key pair or create a new one. This key pair is necessary to create an SSH connection to your EC2 instance. If you do not understand what a key pair is, see this guide.
Click on Launch Instances to start initializing the EC2 instance.

It may take a few minutes for your EC2 instance to change to the Running state.

Attach an Elastic IP (Optional)
These next steps are optional. We will attach an Elastic IP to your EC2 instance so that the the public IP remains unchanged between cluster restarts. If you do not plan to access the minikube api-server from outside the EC2 intance, you can skip this step
If you plan to connect to your minikube cluster using kubectl from outside the EC2 instance (e.g. from your laptop), your minikube cluster should have a stable IP address.
Minikube issues TLS certificates which are tied to specific hostnames the first-time you start the cluster. If the minikube IP address changes because of a restart, this can cause TLS verification to fail
From the EC2 instance management screen, Click on the Elastic IPs link under Network & Security

Click on the orange Allocate Elastic IP address button

Choose to allocate an EIP from Amazon’s pool of IPv4 addresses. Note that AWS does not charge for EIPs as long as the EIP is associated with a running EC2 instance or with a network interface attached to a running instance.
Click on the orange “Allocate” button

Allocation of a new EIP should be near instantaneous. Note that AWS limits an account to 5 EIPs by default but this quota can be raised by raising a service request to AWS

Now, select the new EIP and click on Actions > Associate Elastic IP address

Select Instance under Resource type and select the instance you created in the previous step. This is where having meaningful tags on your instance will help.
Click on “Associate“

Install Minikube and Docker
Here comes the fun part! We’re going to install and run minikube as well as the Docker engine which is the Kubernetes container runtime.
This section references material from the minikube Get Started guide.
Your EC2 instance will now be reachable on the internet using the Elastic IP address you associated with it OR by the Public IPv4 address given on the EC2 management console
Test it by initiating an SSH connection with your private key.
ssh -i <private-key-file> ubuntu@<public-ip-address>

You have successfully started a secure shell connection to your EC2 instance.
The first order of business is to update the package lists of the Aptitude package manager
sudo apt-get update -y

Now, let’s download the latest minikube package for Linux using curl
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64

Take a look at downloaded file. It’s a file designed for the Linux install command

We will use the install command to copy files into the correct directory and set permissions
sudo install minikube-linux-amd64 /usr/local/bin/minikube

Great minikube is now installed, but before we can start it, we need to install the container runtime: Docker.
Use apt install to install the Docker engine
sudo apt install docker.io -y

Before you start Docker, you’ll have to modify user permissions so that Docker can run without root permissions
sudo usermod -aG docker $USER && newgrp docker

Here comes the moment of truth, we will now start the minikube cluster.
We will also specify docker as the Kubernetes driver and if you previously associated an EIP you can also add the IP with the –api-server-ips flag to include that IP in the TLS certificate.
minikube start --driver=docker --apiserver-ips "<public-ip>"

Let’s take a look at minikube’s processes and verify the cluster is running as expected.

Upon starting the cluster, minikube also populates this file ~/.kube/config .
It contains information about the TLS certificate issued by minikube and information about how to connect to the cluster. This file will be used by the kubectl tool introduced in the next section.

Install kubectl
We will use the kubectl command line tool to issue API calls to the minikube cluster. Kubectl can be used to query the cluster status, its nodes, the pods running on it and other information.
kubectl is necessary to deploy the ConfigMaps, Services, and Deployments that constitute the Trino release
The section refers to material from Install and Set Up kubectl on Linux.
We begin by downloading the kubectl files from using curl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"

Then we use the install tool to copy the files to /usr/local/bin/kubectl. We set the owner as root, the group owner as root and the mode to 0755
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

Let’s test out kubectl by get all objects on the minikube cluster in all namespaces
kubectl get all -A

Notice that there are already objects such as pods and services created in minikube even before we have created any Trino objects.
These are Kubernetes system objects. Explaining their functions is beyond the scope of this article. You can learn more at Kubernetes Components
Deploy Trino on Minikube
This section makes references to the material in Trino Community Broadcast: 24: Trinetes I: Trino on Kubernetes which sources material from a PR in https://github.com/trinodb/charts containing contributions from Valeriano Manassero.
We are going to create multiple objects in minikube:
- ConfigMap: tcb-trino-catalog
- ConfigMap tcb-trino-coordinator
- Service: tcb-trino
- Deployment: tcb-trino-coordinator
Collectively, these objects constitute a release of Trino on minikube/Kubernetes. I will explain more about each object as we create them.
I am aware we can use helm to deploy a trino chart from https://trinodb.github.io/charts/ but we are using kubectl to create each object individually to maximize learning.
We will first create a ConfigMap named tcb-trino-catalog.
kubectl apply -f - <<EOF
# Source: trino/templates/configmap-catalog.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tcb-trino-catalog
labels:
app: trino
chart: trino-0.2.0
release: tcb
heritage: Helm
role: catalogs
data:
tpch.properties: |
connector.name=tpch
tpch.splits-per-node=4
tpcds.properties: |
connector.name=tpcds
tpcds.splits-per-node=4
EOF
This ConfigMap is used by Trino to configure 2 catalogs: tpch and tpcds which use the tpch and tpcds connector. These connectors generate data on the fly using a deterministic algorithm when you query a schema and are used to benchmark databases
When this ConfigMap is mounted under /etc/trino/catalog, it makes the tpch and tpcds catalogs available for query in Trino.

Next we will create the ConfigMap named tcb-trino-coordinator
kubectl apply -f - <<EOF
# Source: trino/templates/configmap-coordinator.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tcb-trino-coordinator
labels:
app: trino
chart: trino-0.2.0
release: tcb
heritage: Helm
component: coordinator
data:
node.properties: |
node.environment=production
node.data-dir=/data/trino
plugin.dir=/usr/lib/trino/plugin
jvm.config: |
-server
-Xmx8G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
-XX:-UseBiasedLocking
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.nio.maxCachedBufferSize=2000000
config.properties: |
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=4GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
memory.heap-headroom-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
log.properties: |
io.trino=INFO
EOF
This ConfigMap contains the configuration for the coordinator node including information such as plugins for Trino, the garbage collection parameters for the JVM and query memory limits. I have chosen to explain a few important configs
jvm.config.Xmx8G means a maximum of 8GB of memory can be allocated to the Java Virtual Machine
config.properties.coordinator is a boolean that determines if the node assumes the coordinator role
config.properties.node-scheduler.include-coordinator is a boolean that determines if the scheduling work on the coordinator is allowed. This is set to false in production environments but we set it to true since we are running a single node Trino cluster for testing
config.properties.http-server.http.port is an integer that sets the port on which the HTTP service is exposed. Since we are running Trino on Kubernetes, users will connect to a different port specified in the service object later on.
config.properties.query.max-memory is string that specifies a max amount of memory a single query is allowed to use before Trino kills the query. This configuration is frequently changed in production to balance query SLA with resource utilization.
A full explanation of all config parameters is out of scope but you can learn more at Trino Properties reference

You can verify that the 2 ConfigMaps were created correctly using kubectl
kubectl get configmaps

Next we will create a service to expose the Trino deployment (created in the next step). Services are used in Kubernetes to make pods reachable from within and outside the Kubernetes cluster
kubectl apply -f - <<EOF
# Source: trino/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: tcb-trino
labels:
app: trino
chart: trino-0.2.0
release: tcb
heritage: Helm
spec:
type: NodePort
ports:
- port: 8080
targetPort: http
protocol: TCP
name: http
selector:
app: trino
release: tcb
component: coordinator
EOF
We are using the NodePort type service which makes the service available on a high port (30000-32767) of the EC2 instance. This NodePort is automatically chosen by minikube but we can discover the port number using kubectl (shown below).
spec.ports[0].port refers to the service port
spec.ports[0].targetPort refers to the port exposed on the Pod
spec.selector gives the labels that are used to select which pods are exposed by the service

You can use kubectl to discover which high port is used to expose the Service
kubectl get svc

Finally, we create the core of the Trino release: the Deployment. The Deployment object creates a ReplicaSet and specifies the deployment strategy. It also contains the Pod template in which specifies the Trino docker images.
kubectl apply -f - <<EOF
# Source: trino/templates/deployment-coordinator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tcb-trino-coordinator
labels:
app: trino
chart: trino-0.2.0
release: tcb
heritage: Helm
component: coordinator
spec:
selector:
matchLabels:
app: trino
release: tcb
component: coordinator
template:
metadata:
labels:
app: trino
release: tcb
component: coordinator
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
volumes:
- name: config-volume
configMap:
name: tcb-trino-coordinator
- name: catalog-volume
configMap:
name: tcb-trino-catalog
imagePullSecrets:
- name: registry-credentials
containers:
- name: trino-coordinator
image: "trinodb/trino:latest"
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /etc/trino
name: config-volume
- mountPath: /etc/trino/catalog
name: catalog-volume
ports:
- name: http
containerPort: 8080
protocol: TCP
livenessProbe:
httpGet:
path: /v1/info
port: http
readinessProbe:
httpGet:
path: /v1/info
port: http
resources:
{}
EOF

I will explain a few key parameters.
spec.template.spec.containers[0].image specifies the Docker image used in the container. In this case: trinodb/trino:latest refers to release 365 as of 7-Dec-2021.
spec.template.spec.volumes mentions the 2 ConfigMaps we created: tcb-trino-coordinator and tcb-trino-catalog. These volumes are mounted as specified in spec.template.spec.containers[0].volumeMounts under the paths /etc/trino and /etc/trino/catalog
A full explanation of a Deployment specification can be found here.
Let’s review all the Kubernetes objects we created using kubectl
kubectl get all

Minikube provides a handy shortcut to provide a URL to the tcb-trino service.
minikube service tcb-trino --url

Note that this URL is only accessible internally since Minikube runs as separate VM within the EC2 instance. Save the URL because we will be using it in the next section.
(Optional) To make your Trino cluster accessible externally, you have to port-forward traffic from your EC2 instance to the Trino service using kubectl
k port-forward --address 0.0.0.0 svc/tcb-trino 8080:8080

You can access your cluster at http://{ec2-public-ip}:8080/ui/login.html . Enter a username of your choice since authentication is not enabled.

You will land on the Cluster Overview page of the Trino WebUI. I have seen data analysts at some companies spend countless hours staring at this page.

You can view running, completed, and failed queries in this screen. Notice there are 2 active workers (excluding the coordinator). To learn more about the Web UI, see the Trino documentation.
Now that Trino is now accessible externally, you get bonus marks if you can configure your BI client or use trino-python-client to send queries in. Note the EC2 security group still blocks incoming traffic not originating from your IP.
Install Trino-CLI
This section makes use of material available at Trino.io: CLI
Great, we’ve installed and started Trino on minikube, but how do we send queries in? Enter the Trino-CLI. What’s that?
The Trino CLI provides a terminal-based, interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.
https://trino.io/docs/current/installation/cli.html
Notice that the CLI is a JAR file which will need the Java Runtime Environment (JRE) to run. We will use Java 11.
We can install JRE 11 using apt install. We will be installing a JRE designed for headless environments such as servers
sudo apt install openjdk-11-jre-headless

We can verify the installation by checking its version
java -version

As a Java artifact, the Trino CLI jar is available on a Maven repo. We will use the version intended for build 365 for compatibility. We will use wget to download it into a file named trino.
wget https://repo1.maven.org/maven2/io/trino/trino-cli/365/trino-cli-365-executable.jar -o trino

We will need to add the executable flag to the trino file using chmod
chmod +x trino

Now we’re ready to use the Trino CLI to send queries to the Trino cluster.
Send queries to Trino
Remember that connection string you saved? I hope you still have it because you’ll be using it now
Use the trino CLI and specify the ip-address, port number and the tpcds catalog
./trino --server <ip-address>:<port-no> --catalog tpcds
If your connection is successful, you will be presented with the trino prompt “trino>”. Now, you can send in Trino queries.

Let’s explore the tpcds catalog by first listing schemas
SHOW SCHEMAS;

There are 11 schemas in tpcds including the information_schema. The number suffix in each schema determines the no of rows in the schema’s child tables. Knowing the data sizes is useful for those who use the tpcds dataset for benchmarking databases
Next, let’s list the tables in the sf10000 schema.

We observe 25 tables in the sf10000 schema. The tables contain deterministically generated fact and dimension tables designed to mock the data warehouse of a retail company. A full description of the TPC-DS specification can be found here.
Let’s query the web_sales tables
SELECT * FROM sf10000.web_sales LIMIT 10

The Trino CLI returns the results in the terminal using ASCII characters to represent a table. For a more convenient way to explore the data, you should use the JDBC driver in BI clients such as Dbeaver and SQL workbench.
Lastly, let’s query the Trino system catalog and view the no of nodes
SELECT * FROM system.runtime.nodes

As expected we find 1 running node whose HTTP service is running on port 8080. We also see that the node is also the coordinator.
Cleanup
Now, that we’ve sent queries to Trino, we are ready to cleanup. We will remove all the Kubernetes objects we created, stop minikube and then shutdown the EC2 instance.
Let’s use kubectl to delete the pods, deployments, services and, configmaps we created. This may takes some time since Kubernetes gives the pod some time to shutdown gracefully
kubectl delete pod --all
kubectl delete replicaset --all
kubectl delete service tcb-trino
kubectl delete deployment tcb-trino-coordinator
kubectl delete configmap --all

We can now stop the minikube process
minikube stop

If you previously allocated and associate an Elastic IP, you must now disassociate and release it from the Elastic IP screen.


It’s now safe to terminate the EC2 instance. Do so from the EC2 management console using Actions > Terminate.
This step is important to prevent further charges from being incurred.

Conclusion
Congratulations you’ve successfully deployed a state-of-the-art query engine: Trino on a sophisticated container orchestration platform: Kubernetes
Kubernetes can be used to spin up hundreds of Trino clusters, each with different configurations just be issuing a few commands. You could even achieve perfect resource isolation by spinning up a Trino cluster on demand for each query. I see Kubernetes as the tool by which all Big Data infrastructure will be managed in the future.
In the next few articles, I will demonstrate how to use a multi-node Trino cluster on Kubernetes to query data on S3 and how to run Spark application on Kubernetes.
References
minikube documentation: minikube start: https://minikube.sigs.k8s.io/docs/start/
minikube documentation: docker: https://minikube.sigs.k8s.io/docs/drivers/docker/
Kubernetes.io: Install and Set Up kubectl on Linux https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/
Trino.io : Trino Community Broadcast 24: Trinetes I: Trino on Kubernetes https://trino.io/episodes/24.html
Trino.io: Trino 365 Documentation: Command line interface: https://trino.io/docs/current/installation/cli.html
Trino Community Kubernetes Helm Charts https://github.com/trinodb/charts
2 thoughts on “Run Trino/Presto on Minikube on AWS”