Creating a Presto Cluster on EC2

Today, I am going to create a Presto cluster on an AWS EC2 instance.

I am aware of AWS ElasticMapReduce, Amazon’s Managed Hadoop offering but since this is a technical exercise to learn about Presto internals, we’re going to do things the hard way đŸ™‚

Prerequisites

I assume you have some technical knowledge, namely

  • Working in a POSIX compliant OS such as Linux or Mac
  • Working in the CLI
  • Know how to spin up an EC2 instance in the AWS cloud
  • Know how to configure a Security Group in the AWS Cloud

Sources

I rely heavily on O’Reilly’s Presto: The Definitive Guide by Matt Fuller, Manfred Moser, Martin Traverso. It’s a fantastic resource for anyone looking to get their feet wet with Presto

I follow their examples closely except where they install Presto on local baremetal machine, I do the same on the AWS cloud

What we are going to do

First, we will create a security group to limit access to Presto

Second, we will spin-up an EC2 instance to run Presto

Third, we will install the JVM and Presto binaries

Fourth, we will configure Presto and add a data source

Finally, we will run the Presto daemon

Create the security groups

Create a security group called MyPrestoSG with two inbound rules to make Presto only accessible from your IP. Find out your IP address at https://whatismyipaddress.com/ and write it down

Create the first rule to allow inbound traffic on port 8080 to your IP only.

Create the second rule rule to allow allow inbound traffic on port 22 to your IP only.

What your security group should look like. Remember to only allow your IP inbound access

You should allow all outbound traffic

Spin up the EC2 instance

Launch 1 EC2 instance with the following settings

  • AMI: Amazon Linux 2 AMI (HVM), SSD Volume Type – ami-0b1e534a4ff9019e0 (64-bit x86) / ami-0a5c7dec456e07a8d (64-bit Arm)
  • Instance type: t3a.medium
  • Subnet: Pick a public one and assign a public IP
  • IAM role: None
  • EBS: 8 GB gp2
  • Security group: MyPrestoSG

Once the instance state changes to “running” and status checks are passed. Try to ssh into your EC2 instance with

ssh ec2-user@{public-ip} -i {location}

If everything goes well, you will see the shell of your EC2 instance.

Install the JVM

Now we will install Presto 330 on the EC2 instance.

Presto 330 requires the long-term support version Java 11. So let’s install it.

First elevate yourself to root

sudo su

Then update yum

yum update -y

Now install Amazon Corretto 11 . It’s Amazons no-cost, production-ready distribution of OpenJDK

yum install java-11-amazon-corretto.x86_64
Yum will prompt you to install the package and its dependencies. Type “y”

Check that Java 11 is correctly installed

java --version

If everything is working well, you will see

You should see this if Java is installed correctly

As this point, the book asks you confirm that Python 2.6 or higher is installed. Fortunately, Python 2.7 comes pre-installed with the Amazon Linux 2 AMI

Install the Presto binaries

Now we will download the Presto release binaries into the EC2 instance

You can download the Presto release binaries from the Maven Central Repository with wget

wget https://repo.maven.apache.org/maven2/io/prestosql/presto-server/330/presto-server-330.tar.gz 

Then extract the archive to a directory named presto-server-330

 tar xvzf presto-server-330.tar.gz 

List the content of presto-server-330 to see the top-level Presto directory structure

ls -ltr presto-server-330

Configure Presto and add a data source

Before we start the Presto daemon, we must first provide a set of configuration files in presto-server-330/etc and add a data source

  • Presto logging configuration etc/config.properties
  • Presto node configuration etc/node.properties
  • JVM configuration etc/jvm.config
  • Catalog properties file for the TPC-H connector

Go into presto-server-330 and create the etc directory

cd presto-server-330
mkdir etc

Then create the three files using vim or your favourite text editor.

etc/config.properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080 

etc/node.properties

node.environment=demo 

etc/jvm.config

-server
-Xmx4G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.nio.maxCachedBufferSize=2000000
-Djdk.attach.allowAttachSelf=true

etc/catalog/tpch.properties

connector.name=tpch

Run Presto

Now the most exciting part, let’s start Presto!

Use the bin/launcher script to start Presto as a foreground process

bin/launcher run

If you’ve set everything up right, Presto will begin printing logs to stdout and stderr. After awhile you should see this line

INFO        main io.prestosql.server.PrestoServer ======== SERVER STARTED  
Presto printing to stdout

Congratulations you have a running instance of Presto!

Since you launched Presto on a public subnet and enabled 8080 inbound traffic. You can even access the UI at http://{ec2-public-ip}:8080

Shut down the cluster

At this points, I recommend stopping the EC2 instance. Even though it’s just a t3a.medium, AWS will continue to bill you by the second as long as your instance remains running

5 thoughts on “Creating a Presto Cluster on EC2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s