How to create a multi-node Presto cluster on AWS EC2

What’s wrong with a single node Presto cluster?

In a previous post, I created a single-node Presto cluster where the coordinator and worker processes run on the same node.

That’s a bad idea in large clusters. Processing work on the coordinator can starve the coordinator process of resources and negatively impact scheduling work and monitoring query executions.

A node in a production cluster should serve as either a coordinator or worker but not both.

What’s a multi node Presto cluster?

Simply put, a multi-node Presto cluster has more than one node. Nodes can specialize as either a coordinator or worker.

In this article, I will demonstrate how to set up a multi-node Presto cluster using EC2 instances from scratch. I am aware of Amazon EMR but I have deliberately avoided managed services in the interest of learning.

We will be using 3 t3a.medium instances in AWS. These instances do not qualify for free tier.

As of 2021-Apr-08, the hourly cost of each t3a.medium instance in ap-southeast-1 is US $ 0.0472 / hour.

If you keep all 3 instances running for 3 hours, you will spend US $ 0.425 in total.

What determines a Presto node’s role?

Before we dive into the AWS console, let’s take a look at the configurations which differentiates a coordinator node from a worker node.

A node’s role is assigned in the etc/config.properties file

An example of a coordinator’s config.properties file can be found at https://github.com/frenoid/presto-configs/blob/master/production/coordinator/etc/config.properties

I have reproduced the file below. The properties unique to a coordinator are in bold. The configs bar work from being scheduled on the coordinator and turn on the node discovery service

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://{coordinator-ip-or-hostname}:8080

Contrast that with the etc/config.properties of a worker node. coordinator is set to false and node-scheduler.include-coordinator and discovery-server.enabled is omitted

coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://{coordinator-ip-or-hostname}:8080

Set up the security group

Let’s begin setting up the infrastructure.

The first step is to create a security group for your EC2 instances to be launched in.

You should open all TCP ports to your IP address. You can discover your IP address at https://whatismyipaddress.com/

You should also open TCP port 8080 to all addresses to enable the service discovery process. This allows worker nodes to register with the coordinator

Once your security group is created, make note of its name and ID. I named minePrestoClusters (pretty self-explanatory)

Launch the EC2 instances

Go ahead and launch the 3 t3a.medium instances using the AWS console

You should use Amazon Linux 2 AMI (HVM), SSD Volume Type – ami-03ca998611da0fe12 (64-bit x86) / ami-07c49da39d8691810

You don’t need an IAM role and the default 8 GB of gp2 storage is sufficient.

Take care to launch the instances in security group you created before. This is vital for SSH access and for the nodes to be able to communicate with each other

In my case, I created my EC2 instances in the PrestoClusters security group

Launch your EC2 instance with a key-pair that you have access to. You will need this for SSH access

You should give your EC2 instances useful names to help in identifying coordinator and worker nodes later on

In my case, I named my clusters

  1. PrestoCluster01Master
  2. PrestoCluster01Worker01
  3. PrestoCluster01Worker02

Here what your AWS console will look like once you’re done launching all 3 instances

Bootstrapping the coordinator node

Before you start, you will need to know your coordinator’s public IP. You can get this from the EC2 console.

Click on your coordinator instance and look for Public IPv4 address

I will run the next steps at an accelerated pace since they have already been explained in a previous post

SSH into your coordinator node

ssh -i {your pem file} ec2-user@{coordinator-public-ip}

Update yum and install Amazon Corretto

sudo yum update -y
yum install java-11-amazon-corretto.x86_64 -y

Download and unzip the jar files for Presto 330. Prepare the etc and etc/catalog directories for config files

wget https://repo.maven.apache.org/maven2/io/prestosql/presto-server/330/presto-server-330.tar.gz 
tar xvzf presto-server-330.tar.gz
cd presto-server-330
mkdir etc
mkdir etc/catalog

Place the following config files files. I use vim, you can use your favourite text edior

etc/config.propertiestake note to replace {coodinator-public-ip}

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://{coordinator-public-ip}:8080

etc/jvm.config

-server
-Xmx4G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.nio.maxCachedBufferSize=2000000
-Djdk.attach.allowAttachSelf=true

etc/node.properties

node.environment=production

etc/catalog/tpch.properties

connector.name=tpch

Now go ahead start the Presto daemon on the coordinator node

./bin/launcher start

If this goes well you should see

1337 is the PID. This will be different for you

Just to be sure, navigate to the Web UI at http://{coordinator-public-ip}:8080

Notice that there are 0 active workers

At this point your coordinator node is up and can accept queries but cannot execute them without any registered worker nodes.

Note that all the coordinator config files are also available at https://github.com/frenoid/presto-configs/tree/master/production/coordinator

Let’s proceed with setting up some worker nodes.

Bootstrapping up the worker node

You’re going to do the same things as above. Except you’ll be doing them on the 2 workers nodes.

The key difference is with the file etc/config.properties

etc/config.propertiestake note to replace {coordinator-public-ip}

coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://{coordinator-public-ip}:8080

Remember to start the Presto daemon on the worker node!

./bin/launcher start

Putting everything together

Navigate to the Presto WebUI available at http://{coordinator-public-ip}:8080/

If you’ve done everything right, you will see something that looks like this. Notice there are 2 active workers. These workers registered themselves with the coordinator using the discovery server.

There is even a running query

Access the Presto cluster

At this point, you can even run queries on the Presto cluster by using the presto-cli tool for Presto 330

Download and make the CLI tool executable

wget -o presto https://repo.maven.apache.org/maven2/io/prestosql/presto-cli/330/presto-cli-330-executable.jar
chmod +x presto

Connect to the Presto cluster and run queries on the tpch catalog

presto --server {coordinator-public-ip} --catalog tpch

You can also view the table system.runtime.nodes to see the condition of nodes in the Presto cluster

Congratulations! You’ve successfully set up Presto 330 on a 3 nodes where 1 node is a coordinator and the other 2 are workers.

Cleaning up

You should go to your AWS console and stop or terminate the 3 t3a.medium Presto clusters to avoid incurring additional billing hours.

2 thoughts on “How to create a multi-node Presto cluster on AWS EC2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s