Run Trino/Presto on Minikube on AWS

Introduction

This is the beginning of a new series centering on the use of Kubernetes to host Big Data infrastructure. In this article I will run a single-node Trino cluster in local Kubernetes cluster called minikube

We will be combining 3 technologies

  • A state of the art query engine: Trino
  • A sophisticated container orchestration tool: Kubernetes
  • A local Kubernetes cluster: minikube

Trino was previously known as Presto until build 350 when a spat about the Presto trademark led the original founders and creator of Presto to rebrand PrestoSQL as Trino. I have discussed Trino/Presto in detail in previous articles. Read them for detailed insights into the use-case and architecture of Trino.

  1. Creating a Presto Cluster on EC2
  2. Create a multi-node Presto Cluster
  3. Querying a MySQL instance using Presto

Simply put, Trino is a lightening fast query engine that scales by separating compute from storage and supports a wide range of data sources through its modular connectors. It is used in top-tier tech companies such as Stripe, Shopify and Linkedin for both batch and interactive workloads.

What is Kubernetes?

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.

https://kubernetes.io/

Kubernetes can be used to deploy any containerized application and scale it up to tens of thousands of machines. It originated as Google’s Borg which was developed to manage thousands of jobs and applications across thousands of machines.

Lastly, we have minikube. Minikube is a local Kubernetes distribution designed to be setup easily and run on machines with limited resources such as a laptop. I found it invaluable as a learning and development environment for Kubernetes.

In this article we will accomplish the following

  1. Launch an EC2 instance using the Ubuntu Server 18.04 AMI
  2. Install and run minikube and the Docker engine
  3. Install the kubectl command line tool
  4. Use kubectl to deploy a single-node Trino cluster
  5. Install and run trino-cli to send queries to the Trino cluster

Launch EC2 instance

We will first have to launch an EC2 instance on the AWS console. We will be using the t2.xlarge instance. This will cost you approximately 0.70 USD in the Asia-Pacific (Singapore) region for the 3 hours it takes to complete this guide.

Proceed to the EC2 Instances page and click on the Launch Instances button

EC2 Management Console

Choose Ubuntu Server 18.04 LTS (HVM) SSD Volume Type AMI. Ubuntu 18.04 is the Long Term Support version of Ubuntu whose support will only cease in April 2023.

Choose Ubuntu Server 18.04

Then choose the t2.xlarge instance. We want a minimum of 3 vCPUs for our single node Minikube cluster since it is serving as both control plane and worker. t2.xlarge has 4 vCPUs and 16 GB of RAM.

Choose t2.xlarge. We need lots of cores and memory for Kubernetes.

In the next screen: Configure Instance Details, there is no need to change anything. We will not need to assign an IAM role our EC2 instance since it will not be using other AWS resources such as S3.

Click on “Next: Add Storage”

No changes needed here

In “Step 4: Add Storage”, make sure there is at least 20 GiB of storage and the Volume Type is General Purpose SSD.

Add more storage for the EC2 instance

This next step is optional but useful for tracking costs in AWS billing. You can add tags in the from of key-value pairs.

I added two tags: Name=MinikubeCluster and Application=Minikube

Tag your instance to make it searchable and auditable

Next we will configure a security group so that only traffic originating from your IP is allowed to reach your EC2 instance.

First click on “Create a new security group”, then select All traffic and choose My IP as the source.

Secure your instance with security groups

Finally, we reach the Review Instance Launch screen. Verify that the settings are correct and click on the Launch button

Review your EC2 settings

You will be asked to select an existing key pair or create a new one. This key pair is necessary to create an SSH connection to your EC2 instance. If you do not understand what a key pair is, see this guide.

Click on Launch Instances to start initializing the EC2 instance.

Pick a key pair that you have access to. If you lose the keys, you lose access to the instance.

It may take a few minutes for your EC2 instance to change to the Running state.

Attach an Elastic IP (Optional)

These next steps are optional. We will attach an Elastic IP to your EC2 instance so that the the public IP remains unchanged between cluster restarts. If you do not plan to access the minikube api-server from outside the EC2 intance, you can skip this step

If you plan to connect to your minikube cluster using kubectl from outside the EC2 instance (e.g. from your laptop), your minikube cluster should have a stable IP address.

Minikube issues TLS certificates which are tied to specific hostnames the first-time you start the cluster. If the minikube IP address changes because of a restart, this can cause TLS verification to fail

From the EC2 instance management screen, Click on the Elastic IPs link under Network & Security

Click on the orange Allocate Elastic IP address button

Choose to allocate an EIP from Amazon’s pool of IPv4 addresses. Note that AWS does not charge for EIPs as long as the EIP is associated with a running EC2 instance or with a network interface attached to a running instance.

Click on the orange “Allocate” button

Allocation of a new EIP should be near instantaneous. Note that AWS limits an account to 5 EIPs by default but this quota can be raised by raising a service request to AWS

Now, select the new EIP and click on Actions > Associate Elastic IP address

Associating an EIP to an instance gives it a static IP.

Select Instance under Resource type and select the instance you created in the previous step. This is where having meaningful tags on your instance will help.

Click on “Associate

Install Minikube and Docker

Here comes the fun part! We’re going to install and run minikube as well as the Docker engine which is the Kubernetes container runtime.

This section references material from the minikube Get Started guide.

Your EC2 instance will now be reachable on the internet using the Elastic IP address you associated with it OR by the Public IPv4 address given on the EC2 management console

Test it by initiating an SSH connection with your private key.

ssh -i <private-key-file> ubuntu@<public-ip-address>

You have successfully started a secure shell connection to your EC2 instance.

The first order of business is to update the package lists of the Aptitude package manager

sudo apt-get update -y
Updating Aptitude package lists

Now, let’s download the latest minikube package for Linux using curl

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
Download Minikube for Linux

Take a look at downloaded file. It’s a file designed for the Linux install command

We will use the install command to copy files into the correct directory and set permissions

sudo install minikube-linux-amd64 /usr/local/bin/minikube

Great minikube is now installed, but before we can start it, we need to install the container runtime: Docker.

Use apt install to install the Docker engine

sudo apt install docker.io -y
Installing the Docker engine

Before you start Docker, you’ll have to modify user permissions so that Docker can run without root permissions

sudo usermod -aG docker $USER && newgrp docker

Here comes the moment of truth, we will now start the minikube cluster.

We will also specify docker as the Kubernetes driver and if you previously associated an EIP you can also add the IP with the –api-server-ips flag to include that IP in the TLS certificate.

minikube start --driver=docker --apiserver-ips "<public-ip>"

Let’s take a look at minikube’s processes and verify the cluster is running as expected.

Looks like all minikube processes are up and running

Upon starting the cluster, minikube also populates this file ~/.kube/config .

It contains information about the TLS certificate issued by minikube and information about how to connect to the cluster. This file will be used by the kubectl tool introduced in the next section.

Install kubectl

We will use the kubectl command line tool to issue API calls to the minikube cluster. Kubectl can be used to query the cluster status, its nodes, the pods running on it and other information.

kubectl is necessary to deploy the ConfigMaps, Services, and Deployments that constitute the Trino release

The section refers to material from Install and Set Up kubectl on Linux.

We begin by downloading the kubectl files from using curl

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"

Then we use the install tool to copy the files to /usr/local/bin/kubectl. We set the owner as root, the group owner as root and the mode to 0755

sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

Let’s test out kubectl by get all objects on the minikube cluster in all namespaces

kubectl get all -A

Notice that there are already objects such as pods and services created in minikube even before we have created any Trino objects.

These are Kubernetes system objects. Explaining their functions is beyond the scope of this article. You can learn more at Kubernetes Components

Deploy Trino on Minikube

This section makes references to the material in Trino Community Broadcast: 24: Trinetes I: Trino on Kubernetes which sources material from a PR in https://github.com/trinodb/charts containing contributions from Valeriano Manassero.

We are going to create multiple objects in minikube:

  1. ConfigMap: tcb-trino-catalog
  2. ConfigMap tcb-trino-coordinator
  3. Service: tcb-trino
  4. Deployment: tcb-trino-coordinator

Collectively, these objects constitute a release of Trino on minikube/Kubernetes. I will explain more about each object as we create them.

I am aware we can use helm to deploy a trino chart from https://trinodb.github.io/charts/ but we are using kubectl to create each object individually to maximize learning.

We will first create a ConfigMap named tcb-trino-catalog.

kubectl apply -f - <<EOF
# Source: trino/templates/configmap-catalog.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tcb-trino-catalog
  labels:
    app: trino
    chart: trino-0.2.0
    release: tcb
    heritage: Helm
    role: catalogs
data:
  tpch.properties: |
    connector.name=tpch
    tpch.splits-per-node=4
  tpcds.properties: |
    connector.name=tpcds
    tpcds.splits-per-node=4
EOF

This ConfigMap is used by Trino to configure 2 catalogs: tpch and tpcds which use the tpch and tpcds connector. These connectors generate data on the fly using a deterministic algorithm when you query a schema and are used to benchmark databases

When this ConfigMap is mounted under /etc/trino/catalog, it makes the tpch and tpcds catalogs available for query in Trino.

Next we will create the ConfigMap named tcb-trino-coordinator

kubectl apply -f - <<EOF
# Source: trino/templates/configmap-coordinator.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tcb-trino-coordinator
  labels:
    app: trino
    chart: trino-0.2.0
    release: tcb
    heritage: Helm
    component: coordinator
data:
  node.properties: |
    node.environment=production
    node.data-dir=/data/trino
    plugin.dir=/usr/lib/trino/plugin

  jvm.config: |
    -server
    -Xmx8G
    -XX:+UseG1GC
    -XX:G1HeapRegionSize=32M
    -XX:+UseGCOverheadLimit
    -XX:+ExplicitGCInvokesConcurrent
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:+ExitOnOutOfMemoryError
    -Djdk.attach.allowAttachSelf=true
    -XX:-UseBiasedLocking
    -XX:ReservedCodeCacheSize=512M
    -XX:PerMethodRecompilationCutoff=10000
    -XX:PerBytecodeRecompilationCutoff=10000
    -Djdk.nio.maxCachedBufferSize=2000000

  config.properties: |
    coordinator=true
    node-scheduler.include-coordinator=true
    http-server.http.port=8080
    query.max-memory=4GB
    query.max-memory-per-node=1GB
    query.max-total-memory-per-node=2GB
    memory.heap-headroom-per-node=1GB
    discovery-server.enabled=true
    discovery.uri=http://localhost:8080

  log.properties: |
    io.trino=INFO
EOF

This ConfigMap contains the configuration for the coordinator node including information such as plugins for Trino, the garbage collection parameters for the JVM and query memory limits. I have chosen to explain a few important configs

jvm.config.Xmx8G means a maximum of 8GB of memory can be allocated to the Java Virtual Machine

config.properties.coordinator is a boolean that determines if the node assumes the coordinator role

config.properties.node-scheduler.include-coordinator is a boolean that determines if the scheduling work on the coordinator is allowed. This is set to false in production environments but we set it to true since we are running a single node Trino cluster for testing

config.properties.http-server.http.port is an integer that sets the port on which the HTTP service is exposed. Since we are running Trino on Kubernetes, users will connect to a different port specified in the service object later on.

config.properties.query.max-memory is string that specifies a max amount of memory a single query is allowed to use before Trino kills the query. This configuration is frequently changed in production to balance query SLA with resource utilization.

A full explanation of all config parameters is out of scope but you can learn more at Trino Properties reference

You can verify that the 2 ConfigMaps were created correctly using kubectl

kubectl get configmaps

Next we will create a service to expose the Trino deployment (created in the next step). Services are used in Kubernetes to make pods reachable from within and outside the Kubernetes cluster

kubectl apply -f - <<EOF
# Source: trino/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: tcb-trino
  labels:
    app: trino
    chart: trino-0.2.0
    release: tcb
    heritage: Helm
spec:
  type: NodePort
  ports:
    - port: 8080
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: trino
    release: tcb
    component: coordinator
EOF

We are using the NodePort type service which makes the service available on a high port (30000-32767) of the EC2 instance. This NodePort is automatically chosen by minikube but we can discover the port number using kubectl (shown below).

spec.ports[0].port refers to the service port

spec.ports[0].targetPort refers to the port exposed on the Pod

spec.selector gives the labels that are used to select which pods are exposed by the service

You can use kubectl to discover which high port is used to expose the Service

kubectl get svc
In my case, the NodePort is 31181, this will be different for you

Finally, we create the core of the Trino release: the Deployment. The Deployment object creates a ReplicaSet and specifies the deployment strategy. It also contains the Pod template in which specifies the Trino docker images.

kubectl apply -f - <<EOF
# Source: trino/templates/deployment-coordinator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tcb-trino-coordinator
  labels:
    app: trino
    chart: trino-0.2.0
    release: tcb
    heritage: Helm
    component: coordinator
spec:
  selector:
    matchLabels:
      app: trino
      release: tcb
      component: coordinator
  template:
    metadata:
      labels:
        app: trino
        release: tcb
        component: coordinator
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
      volumes:
        - name: config-volume
          configMap:
            name: tcb-trino-coordinator
        - name: catalog-volume
          configMap:
            name: tcb-trino-catalog
      imagePullSecrets:
        - name: registry-credentials
      containers:
        - name: trino-coordinator
          image: "trinodb/trino:latest"
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - mountPath: /etc/trino
              name: config-volume
            - mountPath: /etc/trino/catalog
              name: catalog-volume
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /v1/info
              port: http
          readinessProbe:
            httpGet:
              path: /v1/info
              port: http
          resources:
            {}
EOF

I will explain a few key parameters.

spec.template.spec.containers[0].image specifies the Docker image used in the container. In this case: trinodb/trino:latest refers to release 365 as of 7-Dec-2021.

spec.template.spec.volumes mentions the 2 ConfigMaps we created: tcb-trino-coordinator and tcb-trino-catalog. These volumes are mounted as specified in spec.template.spec.containers[0].volumeMounts under the paths /etc/trino and /etc/trino/catalog

A full explanation of a Deployment specification can be found here.

Let’s review all the Kubernetes objects we created using kubectl

kubectl get all
We’ve created 5 objects in total. The ReplicaSet and Pod were created by the Deployment

Minikube provides a handy shortcut to provide a URL to the tcb-trino service.

minikube service tcb-trino --url

Note that this URL is only accessible internally since Minikube runs as separate VM within the EC2 instance. Save the URL because we will be using it in the next section.

(Optional) To make your Trino cluster accessible externally, you have to port-forward traffic from your EC2 instance to the Trino service using kubectl

k port-forward --address 0.0.0.0 svc/tcb-trino 8080:8080

You can access your cluster at http://{ec2-public-ip}:8080/ui/login.html . Enter a username of your choice since authentication is not enabled.

You will land on the Cluster Overview page of the Trino WebUI. I have seen data analysts at some companies spend countless hours staring at this page.

You can view running, completed, and failed queries in this screen. Notice there are 2 active workers (excluding the coordinator). To learn more about the Web UI, see the Trino documentation.

Now that Trino is now accessible externally, you get bonus marks if you can configure your BI client or use trino-python-client to send queries in. Note the EC2 security group still blocks incoming traffic not originating from your IP.

Install Trino-CLI

This section makes use of material available at Trino.io: CLI

Great, we’ve installed and started Trino on minikube, but how do we send queries in? Enter the Trino-CLI. What’s that?

The Trino CLI provides a terminal-based, interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.

https://trino.io/docs/current/installation/cli.html

Notice that the CLI is a JAR file which will need the Java Runtime Environment (JRE) to run. We will use Java 11.

We can install JRE 11 using apt install. We will be installing a JRE designed for headless environments such as servers

sudo apt install openjdk-11-jre-headless

We can verify the installation by checking its version

java -version

As a Java artifact, the Trino CLI jar is available on a Maven repo. We will use the version intended for build 365 for compatibility. We will use wget to download it into a file named trino.

wget https://repo1.maven.org/maven2/io/trino/trino-cli/365/trino-cli-365-executable.jar -o trino

We will need to add the executable flag to the trino file using chmod

chmod +x trino

Now we’re ready to use the Trino CLI to send queries to the Trino cluster.

Send queries to Trino

Remember that connection string you saved? I hope you still have it because you’ll be using it now

Use the trino CLI and specify the ip-address, port number and the tpcds catalog

./trino --server <ip-address>:<port-no> --catalog tpcds

If your connection is successful, you will be presented with the trino prompt “trino>”. Now, you can send in Trino queries.

Let’s explore the tpcds catalog by first listing schemas

SHOW SCHEMAS;

There are 11 schemas in tpcds including the information_schema. The number suffix in each schema determines the no of rows in the schema’s child tables. Knowing the data sizes is useful for those who use the tpcds dataset for benchmarking databases

Next, let’s list the tables in the sf10000 schema.

We observe 25 tables in the sf10000 schema. The tables contain deterministically generated fact and dimension tables designed to mock the data warehouse of a retail company. A full description of the TPC-DS specification can be found here.

Let’s query the web_sales tables

SELECT * FROM sf10000.web_sales LIMIT 10

The Trino CLI returns the results in the terminal using ASCII characters to represent a table. For a more convenient way to explore the data, you should use the JDBC driver in BI clients such as Dbeaver and SQL workbench.

Lastly, let’s query the Trino system catalog and view the no of nodes

SELECT * FROM system.runtime.nodes

As expected we find 1 running node whose HTTP service is running on port 8080. We also see that the node is also the coordinator.

Cleanup

Now, that we’ve sent queries to Trino, we are ready to cleanup. We will remove all the Kubernetes objects we created, stop minikube and then shutdown the EC2 instance.

Let’s use kubectl to delete the pods, deployments, services and, configmaps we created. This may takes some time since Kubernetes gives the pod some time to shutdown gracefully

kubectl delete pod --all
kubectl delete replicaset --all
kubectl delete service tcb-trino
kubectl delete deployment tcb-trino-coordinator
kubectl delete configmap --all

We can now stop the minikube process

minikube stop

If you previously allocated and associate an Elastic IP, you must now disassociate and release it from the Elastic IP screen.

It’s now safe to terminate the EC2 instance. Do so from the EC2 management console using Actions > Terminate.

This step is important to prevent further charges from being incurred.

Conclusion

Congratulations you’ve successfully deployed a state-of-the-art query engine: Trino on a sophisticated container orchestration platform: Kubernetes

Kubernetes can be used to spin up hundreds of Trino clusters, each with different configurations just be issuing a few commands. You could even achieve perfect resource isolation by spinning up a Trino cluster on demand for each query. I see Kubernetes as the tool by which all Big Data infrastructure will be managed in the future.

In the next few articles, I will demonstrate how to use a multi-node Trino cluster on Kubernetes to query data on S3 and how to run Spark application on Kubernetes.

References

minikube documentation: minikube start: https://minikube.sigs.k8s.io/docs/start/
minikube documentation: docker: https://minikube.sigs.k8s.io/docs/drivers/docker/
Kubernetes.io: Install and Set Up kubectl on Linux https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/
Trino.io : Trino Community Broadcast 24: Trinetes I: Trino on Kubernetes https://trino.io/episodes/24.html

Trino.io: Trino 365 Documentation: Command line interface: https://trino.io/docs/current/installation/cli.html

Trino Community Kubernetes Helm Charts https://github.com/trinodb/charts

2 thoughts on “Run Trino/Presto on Minikube on AWS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s