Create a single node Hadoop cluster

Starting out in Data Engineering

Hadoop on EC2

When I cut my teeth in Data Engineering in 2018, Apache Spark was all the rage. Spark’s in-memory processing made it lightening-fast and made older frameworks such as Apache Pig obsolete.

You couldn’t call yourself a Data Engineer without knowing Spark. I was a fledgling Data Engineer and it was imperative that I learn Spark but I faced a major obstacle: I could not find a good Spark development environment.

Sure, I knew about managed solutions such as EMR and Dataproc but they had hidden drawbacks. On the one hand, AWS and GCP handles the complexity of Big Data infrastructure so you focus on writing code, but at the cost of you knowing what’s happening under the hood.

Having worked in Big Data and maintained data pipelines for 4 years, I believe being a good Data Engineering means knowing how to build and maintain Data Infrastructure. Servers and virtual machines crash all the time and cause pipelines to fail.

As a Data Engineer, you must be capable of being a first responder. You must be able to assess the situation and form an initial diagnosis of the underlying cause.

As I’ve mentioned in my guide to creating a single-node Presto cluster, the best way to form an intuition about a technology is to setup and use it. There’s no substitute for diving into the nitty-gritty of installing things.

Once it’s running, you’ll have an overview of the system and feel more accomplished than if you simply clicked <Start Notebook> on the AWS console

Create your own development environment

The year is now 2021 and Spark remains the gold-standard in Big Data processing. YARN is gradually being replaced by Kubernetes but remains entrenched in many Big Data stacks. Kafka is a popular framework for stream-processing. For this reason I’ve chose this stack for a basic Big Data development environment.

Spark, YARN, HDFS, and Kafka: what you need to know as a Big Data engineer

This is the first of a three part series intended to provide Big Data infrastructure in a cost effective manner. Completing all thee parts means you’ll have hands-on experience in multiple Big Data technologies.

The first article will instruct you how to set up a Hadoop cluster including HDFS and YARN.

The second article will teach you to set up the Hive metastore service and Spark 3,

The third article will instruct you how load and manipulate a sample dataset using Hive and Spark and run your own Jupyterlab environment.

The fourth article will demonstrate how to set up a single-node Kafka broker with Zookeeper and stream logs in.

Ready? Let’s begin creating our single-node Hadoop cluster.

Create Cloud Infrastructure

Our first step is to standup an EC2 instance to run the Hadoop services HDFS and YARN in. We will be using t2.xlarge instance with 4 vCPUs and 16GB of memory. Hadoop thrives when there is an abundance of memory.

Assuming your t2.xlarge instances runs for 3 hours. You will incur a total charge of USD $ 0.70 in the Asia Pacific (Singapore) region.

The first step is to create a security group that allow ALL inbound traffic from your IP only. All outbound traffic is allowed. I have named my group my-spark-cluster-demo but you can use anything you like.

You should pick Ubuntu Server 18.04 LTS (HVM) SSD Volume Type as the AMI. 18.04 LTS will reach its End of Life in April 2028

Pick t2.xlarge as the instance type

Be sure to set Auto-assign Public IP to Enable (You’ll need this to SSH into your instance from the Internet)

Note that the IAM role is None. You won’t need IAM permissions yet.

Set the Size to 40 GiB and choose gp2 as the volume type. We want ample storage to store large datasets

Assign the security group you created to the EC2 instance.

Pick an existing keypair or create a new one then click on Review and Launch!

Your EC2 instance will be in Pending state. Wait till it changes to Running state then be sure to note down the Public IPv4 address! You’ll need this access your spark cluster.

Copy the Public IPv4 address. Note that the address changes everytime you stop and start the instance

Now use SSH to connect to your EC2 instance

ssh ubuntu@{public-ip-address}
The Ubuntu shell greets you once you successfull ssh in

If you see the above, then congratulations! You’ve successfully launched an EC2 instance. Let’s proceed to install the Hadoop packages

Get Prerequisites

The first thing you’ll need to do is update the package lists in APT using

sudo apt-get update
Updating the APT package lists

Then since Java is required for Hadoop, install Java 8 using APT

sudo apt install openjdk-8-jre-headless -y
Installing openjdk98
Still installing openjdk-8

Verify the JDK is properly installed using

java -version
Java 8 is properly installed in your EC2 instance

Setup login without passwords

Now we will setup passwordless login in our host. This is necessary to run the scripts which start services such as HDFS and YARN

Let’s generate a key-pair using ssh-key. This will produce 2 files in /home/ubuntu/.ssh

The first is id_rsa. This is your private key and should not be shared with anyone else.

The second is This is your public key and can be shared with another machine to authenticate your identity.

Generate a keypair

Now let’s append your newly-generated public key to /home/ubuntu/.ssh/authorized_keys. This allows the host to authenticate you without a password

cat ~/.ssh/ >> ~/.ssh/authorized_keys
Your public key has been appended to the list of authorized keys

Let’s test the passwordless login.

ssh localhost
If your screen looks like the picture above – then you’ve successsfully set up passwordless login on your single-node Hadoop cluster.

Setup Hadoop

Now, let’s get to the fun part of setting up Hadoop. We’re going to use wget to download Hadoop 3.3.0 which includes the HDFS and YARN packages.


Once the download is complete, we can untar the package. This will uncompress the archive into a new hadoop-3.3.0 directory

tar xzf hadoop-3.3.0.tar.gz

Now we will do 3 things, first move the hadoop-3.3.0 directory to /opt, then change the ownership of the hadoop-3.3.0 directory and all child objects to the the username used to login. Finally we create a softlink /opt/hadoop which points to /opt/hadoop-3.3.0

sudo mv -f hadoop-3.3.0 /opt
sudo chown ${USER}:${USER} -R /opt/hadoop-3.3.0
sudo ln -s /opt/hadoop-3.3.0 /opt/hadoop
The Hadoop-3.3.0 directory is where the Hadoop binaries and configurations sit

Let’s take a look at the contents of /opt/hadoop/etc/hadoop.

ls -ltr /opt/hadoop/etc/hadoop
Configuration files for Hadoop, HDFS, and YARN

Here are a few notable files

  1. : controls the logging level for services such as HDFS and YARN
  2. core-site.xml : informs Hadoop daemon where NameNode runs in the cluster
  3. hdfs-site.xml : contains the configuration setting for HDFS daemon and also specifies default block replication
  4. yarn-site.xml : contains configurations for YARN

Setup Hadoop HDFS

Now let’s proceed to configure and start the HDFS daemon by configuring HDFS

 /opt/hadoop/etc/hadoop/core-site.xml should look like this


The above sets the default filesystem in Hadoop to be HDFS and reachable at localhost at port 9000

/opt/hadoop/etc/hadoop/hdfs-site.xml should look like this


We have set up where in the local filesystem the DFS name node should store the name table, the checkpoint data, the actual data and finally set the replication factor to 1. (The replication factor should never be less than 3 in production)

Now let’s take a detour in setting up the environment variables: HADOOP_HOME, PATH and JAVA_HOME

We first need to find out the location of your JDK

find /usr/lib/jvm -name java
Finding the location of the Java binaries

In the example above, it’s located at /usr/lib/jvm/java-8-openjdk-amd64

Now add the following lines to ~/.profile . The ensures the DFS and YARN start scripts are in the PATH.

export HADOOP_HOME=/opt/hadoop                                                                                                                                                                                                               export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin                                                                                                                                                                                         export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 
Your .profile is executed whenever you start an interactive login shell. We will set environment variables here.

Export the variables by running

source ~/.profile

Also, add JAVA_HOME into /opt/hadoop/etc/hadoop/ . Once you remove all commented lines, the file will look like this

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}

We’re almost there, but first we have to format the local directories to create directories for Namenode, Secondary Namenode, and Datanode

hdfs namenode -format
ls -ltr /opt/hadoop/dfs/
Formatting the local FS to be a NameNode

Here’s what /opt/hadoop/dfs looks like after formatting

/opt/hadoop/dfs post-formatting

Make sure you’re logged in as the user set up for passwordless login

ssh ${USER}@localhost

It’s time for the moment of truth, let’s start DFS!


If your screen looks the picture above then you’ve hit a major milestone! You’ve started the DFS daemons and can now access HDFS to put and delete files.

Now let’s create a user folder and write some data for a test

hdfs dfs -mkdir -p /user/${USER}
hdfs dfs -put /opt/hadoop/etc/hadoop /user/${USER}
hdfs dfs -cat /user/${USER}/hadoop/core-site.xml

Now let’s remove the data we wrote.

hdfs dfs -rm -R -skipTrash /user/${USER}/hadoop
hdfs dfs -ls /user/${USER}
Testing out HDFS file storage and retrieval capabilities
The user directory in HDFS

We have a fully functioning HDFS to read, write and store data in. Let’s proceed to configuring and running YARN.

Setup Hadoop YARN

We will now configure and run Hadoop YARN. YARN stands for Yet Another Resource Negotiator and performs the task of resource manager and scheduler in the Hadoop cluster. If you run your application in YARN cluster, Spark will request resources from YARN before executing the job logic.

First let’s configure YARN. You will have to modify two files to contain the following



Note the mapreduce_shuffle value.



Note that under mapreduce.application.classpath, we provide the classes for the MapReduce application in the listed location

Let’s now start the YARN service with


If everything goes smoothly, your screen should look something like above

Let’s test YARN’s functionality by listing all applications under its management

yarn application -list
Listing YARN applications in the command line

As you can see above, we connected to the ResourceManager daemon at port 8032 and found there were 0 applications running.

That’s not very exciting :/ but we’ll run some applications in the next article once we setup Spark.

YARN also provides a webapp for you to view this information in your browser. Simply navigate to this address

You can see there are 0 apps completed and 8192 MB and 4 vCores are available to the scheduler

You can even view the node health of nodes in the cluster at the following URL

We have one running node with 8 GB of memory and 8 vCores available.

Management of Hadoop Node

Now that you’ve successfully setup HDFS and YARN, let’s trying shutting down the EC2 instance. But we can’t just stop the EC2 instance, we have to ensure YARN and HDFS shutdown gracefully to avoid data corruption and loss

First, shutdown YARN


Now, stop the HDFS daemons


Now you can go to your AWS console and stop your EC2 instance. Be sure to do this or AWS will continue to bill you for the running instance


In review, you’ve successfully laid the groundwork for a Hadoop cluster by doing the following

  1. Start an EC2 instance
  2. Setup and run HDFS
  3. Setup and run YARN
  4. Stopped the HDFS and YARN services
  5. Stop an EC2 instance

Right now, all you can do is store and retrieve data in HDFS. That’s not very interesting :/ .

In the next post, I demonstrate how to setup Hive and Spark on the Hadoop cluster. Once that’s done, your cluster will be ready to process multi-gigabyte datasets.

3 thoughts on “Create a single node Hadoop cluster

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s