Today, I am going to create a Presto cluster on an AWS EC2 instance.

I am aware of AWS ElasticMapReduce, Amazon’s Managed Hadoop offering but since this is a technical exercise to learn about Presto internals, we’re going to do things the hard way đŸ™‚
Prerequisites
I assume you have some technical knowledge, namely
- Working in a POSIX compliant OS such as Linux or Mac
- Working in the CLI
- Know how to spin up an EC2 instance in the AWS cloud
- Know how to configure a Security Group in the AWS Cloud
Sources
I rely heavily on O’Reilly’s Presto: The Definitive Guide by Matt Fuller, Manfred Moser, Martin Traverso. It’s a fantastic resource for anyone looking to get their feet wet with Presto

I follow their examples closely except where they install Presto on local baremetal machine, I do the same on the AWS cloud
What we are going to do
First, we will create a security group to limit access to Presto
Second, we will spin-up an EC2 instance to run Presto
Third, we will install the JVM and Presto binaries
Fourth, we will configure Presto and add a data source
Finally, we will run the Presto daemon
Create the security groups
Create a security group called MyPrestoSG with two inbound rules to make Presto only accessible from your IP. Find out your IP address at https://whatismyipaddress.com/ and write it down
Create the first rule to allow inbound traffic on port 8080 to your IP only.
Create the second rule rule to allow allow inbound traffic on port 22 to your IP only.


You should allow all outbound traffic
Spin up the EC2 instance
Launch 1 EC2 instance with the following settings
- AMI: Amazon Linux 2 AMI (HVM), SSD Volume Type – ami-0b1e534a4ff9019e0 (64-bit x86) / ami-0a5c7dec456e07a8d (64-bit Arm)
- Instance type: t3a.medium
- Subnet: Pick a public one and assign a public IP
- IAM role: None
- EBS: 8 GB gp2
- Security group: MyPrestoSG
Once the instance state changes to “running” and status checks are passed. Try to ssh into your EC2 instance with
ssh ec2-user@{public-ip} -i {location}
If everything goes well, you will see the shell of your EC2 instance.

Install the JVM
Now we will install Presto 330 on the EC2 instance.
Presto 330 requires the long-term support version Java 11. So let’s install it.
First elevate yourself to root
sudo su
Then update yum
yum update -y
Now install Amazon Corretto 11 . It’s Amazons no-cost, production-ready distribution of OpenJDK
yum install java-11-amazon-corretto.x86_64

Check that Java 11 is correctly installed
java --version
If everything is working well, you will see

As this point, the book asks you confirm that Python 2.6 or higher is installed. Fortunately, Python 2.7 comes pre-installed with the Amazon Linux 2 AMI
Install the Presto binaries
Now we will download the Presto release binaries into the EC2 instance
You can download the Presto release binaries from the Maven Central Repository with wget
wget https://repo.maven.apache.org/maven2/io/prestosql/presto-server/330/presto-server-330.tar.gz
Then extract the archive to a directory named presto-server-330
tar xvzf presto-server-330.tar.gz
List the content of presto-server-330 to see the top-level Presto directory structure
ls -ltr presto-server-330

Configure Presto and add a data source
Before we start the Presto daemon, we must first provide a set of configuration files in presto-server-330/etc and add a data source
- Presto logging configuration etc/config.properties
- Presto node configuration etc/node.properties
- JVM configuration etc/jvm.config
- Catalog properties file for the TPC-H connector
Go into presto-server-330 and create the etc directory
cd presto-server-330
mkdir etc
Then create the three files using vim or your favourite text editor.
etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
etc/node.properties
node.environment=demo
etc/jvm.config
-server
-Xmx4G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.nio.maxCachedBufferSize=2000000
-Djdk.attach.allowAttachSelf=true
etc/catalog/tpch.properties
connector.name=tpch
Run Presto
Now the most exciting part, let’s start Presto!
Use the bin/launcher script to start Presto as a foreground process
bin/launcher run
If you’ve set everything up right, Presto will begin printing logs to stdout and stderr. After awhile you should see this line
INFO main io.prestosql.server.PrestoServer ======== SERVER STARTED

Congratulations you have a running instance of Presto!
Since you launched Presto on a public subnet and enabled 8080 inbound traffic. You can even access the UI at http://{ec2-public-ip}:8080

Shut down the cluster
At this points, I recommend stopping the EC2 instance. Even though it’s just a t3a.medium, AWS will continue to bill you by the second as long as your instance remains running
5 thoughts on “Creating a Presto Cluster on EC2”