Starting out in Data Engineering

When I cut my teeth in Data Engineering in 2018, Apache Spark was all the rage. Spark’s in-memory processing made it lightening-fast and made older frameworks such as Apache Pig obsolete.
You couldn’t call yourself a Data Engineer without knowing Spark. I was a fledgling Data Engineer and it was imperative that I learn Spark but I faced a major obstacle: I could not find a good Spark development environment.
Sure, I knew about managed solutions such as EMR and Dataproc but they had hidden drawbacks. On the one hand, AWS and GCP handles the complexity of Big Data infrastructure so you focus on writing code, but at the cost of you knowing what’s happening under the hood.

Having worked in Big Data and maintained data pipelines for 4 years, I believe being a good Data Engineering means knowing how to build and maintain Data Infrastructure. Servers and virtual machines crash all the time and cause pipelines to fail.
As a Data Engineer, you must be capable of being a first responder. You must be able to assess the situation and form an initial diagnosis of the underlying cause.
As I’ve mentioned in my guide to creating a single-node Presto cluster, the best way to form an intuition about a technology is to setup and use it. There’s no substitute for diving into the nitty-gritty of installing things.
Once it’s running, you’ll have an overview of the system and feel more accomplished than if you simply clicked <Start Notebook> on the AWS console
Create your own development environment
The year is now 2021 and Spark remains the gold-standard in Big Data processing. YARN is gradually being replaced by Kubernetes but remains entrenched in many Big Data stacks. Kafka is a popular framework for stream-processing. For this reason I’ve chose this stack for a basic Big Data development environment.

This is the first of a three part series intended to provide Big Data infrastructure in a cost effective manner. Completing all thee parts means you’ll have hands-on experience in multiple Big Data technologies.
The first article will instruct you how to set up a Hadoop cluster including HDFS and YARN.
The second article will teach you to set up the Hive metastore service and Spark 3,
The third article will instruct you how load and manipulate a sample dataset using Hive and Spark and run your own Jupyterlab environment.
The fourth article will demonstrate how to set up a single-node Kafka broker with Zookeeper and stream logs in.
Ready? Let’s begin creating our single-node Hadoop cluster.
Create Cloud Infrastructure
Our first step is to standup an EC2 instance to run the Hadoop services HDFS and YARN in. We will be using t2.xlarge instance with 4 vCPUs and 16GB of memory. Hadoop thrives when there is an abundance of memory.
Assuming your t2.xlarge instances runs for 3 hours. You will incur a total charge of USD $ 0.70 in the Asia Pacific (Singapore) region.
The first step is to create a security group that allow ALL inbound traffic from your IP only. All outbound traffic is allowed. I have named my group my-spark-cluster-demo but you can use anything you like.

You should pick Ubuntu Server 18.04 LTS (HVM) SSD Volume Type as the AMI. 18.04 LTS will reach its End of Life in April 2028

Pick t2.xlarge as the instance type

Be sure to set Auto-assign Public IP to Enable (You’ll need this to SSH into your instance from the Internet)
Note that the IAM role is None. You won’t need IAM permissions yet.

Set the Size to 40 GiB and choose gp2 as the volume type. We want ample storage to store large datasets

Assign the security group you created to the EC2 instance.
Pick an existing keypair or create a new one then click on Review and Launch!

Your EC2 instance will be in Pending state. Wait till it changes to Running state then be sure to note down the Public IPv4 address! You’ll need this access your spark cluster.

Now use SSH to connect to your EC2 instance
ssh ubuntu@{public-ip-address}

If you see the above, then congratulations! You’ve successfully launched an EC2 instance. Let’s proceed to install the Hadoop packages
Get Prerequisites
The first thing you’ll need to do is update the package lists in APT using
sudo apt-get update

Then since Java is required for Hadoop, install Java 8 using APT
sudo apt install openjdk-8-jre-headless -y


Verify the JDK is properly installed using
java -version

Setup login without passwords
Now we will setup passwordless login in our host. This is necessary to run the scripts which start services such as HDFS and YARN
Let’s generate a key-pair using ssh-key. This will produce 2 files in /home/ubuntu/.ssh
The first is id_rsa. This is your private key and should not be shared with anyone else.
The second is id_rsa.pub. This is your public key and can be shared with another machine to authenticate your identity.
ssh-keygen


Now let’s append your newly-generated public key to /home/ubuntu/.ssh/authorized_keys. This allows the host to authenticate you without a password
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Let’s test the passwordless login.
ssh localhost

Setup Hadoop
Now, let’s get to the fun part of setting up Hadoop. We’re going to use wget to download Hadoop 3.3.0 which includes the HDFS and YARN packages.
wget https://mirrors.gigenet.com/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

Once the download is complete, we can untar the package. This will uncompress the archive into a new hadoop-3.3.0 directory
tar xzf hadoop-3.3.0.tar.gz

Now we will do 3 things, first move the hadoop-3.3.0 directory to /opt, then change the ownership of the hadoop-3.3.0 directory and all child objects to the the username used to login. Finally we create a softlink /opt/hadoop which points to /opt/hadoop-3.3.0
sudo mv -f hadoop-3.3.0 /opt
sudo chown ${USER}:${USER} -R /opt/hadoop-3.3.0
sudo ln -s /opt/hadoop-3.3.0 /opt/hadoop

Let’s take a look at the contents of /opt/hadoop/etc/hadoop.
ls -ltr /opt/hadoop/etc/hadoop

Here are a few notable files
- log4j.properties : controls the logging level for services such as HDFS and YARN
- core-site.xml : informs Hadoop daemon where NameNode runs in the cluster
- hdfs-site.xml : contains the configuration setting for HDFS daemon and also specifies default block replication
- yarn-site.xml : contains configurations for YARN
Setup Hadoop HDFS
Now let’s proceed to configure and start the HDFS daemon by configuring HDFS
/opt/hadoop/etc/hadoop/core-site.xml should look like this
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
The above sets the default filesystem in Hadoop to be HDFS and reachable at localhost at port 9000
/opt/hadoop/etc/hadoop/hdfs-site.xml should look like this
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/opt/hadoop/dfs/namesecondary</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
We have set up where in the local filesystem the DFS name node should store the name table, the checkpoint data, the actual data and finally set the replication factor to 1. (The replication factor should never be less than 3 in production)
Now let’s take a detour in setting up the environment variables: HADOOP_HOME, PATH and JAVA_HOME
We first need to find out the location of your JDK
find /usr/lib/jvm -name java

In the example above, it’s located at /usr/lib/jvm/java-8-openjdk-amd64
Now add the following lines to ~/.profile . The ensures the DFS and YARN start scripts are in the PATH.
export HADOOP_HOME=/opt/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Export the variables by running
source ~/.profile
Also, add JAVA_HOME into /opt/hadoop/etc/hadoop/hadoop-env.sh . Once you remove all commented lines, the file will look like this
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
We’re almost there, but first we have to format the local directories to create directories for Namenode, Secondary Namenode, and Datanode
hdfs namenode -format
ls -ltr /opt/hadoop/dfs/

Here’s what /opt/hadoop/dfs looks like after formatting

Make sure you’re logged in as the user set up for passwordless login
ssh ${USER}@localhost
It’s time for the moment of truth, let’s start DFS!
/opt/hadoop/sbin/start-dfs.sh

If your screen looks the picture above then you’ve hit a major milestone! You’ve started the DFS daemons and can now access HDFS to put and delete files.
Now let’s create a user folder and write some data for a test
hdfs dfs -mkdir -p /user/${USER}
hdfs dfs -put /opt/hadoop/etc/hadoop /user/${USER}
hdfs dfs -cat /user/${USER}/hadoop/core-site.xml
Now let’s remove the data we wrote.
hdfs dfs -rm -R -skipTrash /user/${USER}/hadoop
hdfs dfs -ls /user/${USER}


We have a fully functioning HDFS to read, write and store data in. Let’s proceed to configuring and running YARN.
Setup Hadoop YARN
We will now configure and run Hadoop YARN. YARN stands for Yet Another Resource Negotiator and performs the task of resource manager and scheduler in the Hadoop cluster. If you run your application in YARN cluster, Spark will request resources from YARN before executing the job logic.
First let’s configure YARN. You will have to modify two files to contain the following
/opt/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
Note the mapreduce_shuffle value.
/opt/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
Note that under mapreduce.application.classpath, we provide the classes for the MapReduce application in the listed location
Let’s now start the YARN service with
/opt/hadoop/sbin/start-yarn.sh

If everything goes smoothly, your screen should look something like above
Let’s test YARN’s functionality by listing all applications under its management
yarn application -list

As you can see above, we connected to the ResourceManager daemon at port 8032 and found there were 0 applications running.
That’s not very exciting but we’ll run some applications in the next article once we setup Spark.
YARN also provides a webapp for you to view this information in your browser. Simply navigate to this address
http://{ec2-public-ip}:8088/cluster

You can even view the node health of nodes in the cluster at the following URL
http://{public-ip}:8088/cluster/nodes

Management of Hadoop Node
Now that you’ve successfully setup HDFS and YARN, let’s trying shutting down the EC2 instance. But we can’t just stop the EC2 instance, we have to ensure YARN and HDFS shutdown gracefully to avoid data corruption and loss
First, shutdown YARN
/opt/hadoop/sbin/stop-yarn.sh

Now, stop the HDFS daemons
/opt/hadoop/sbin/stop-dfs.sh

Now you can go to your AWS console and stop your EC2 instance. Be sure to do this or AWS will continue to bill you for the running instance
Conclusion
In review, you’ve successfully laid the groundwork for a Hadoop cluster by doing the following
- Start an EC2 instance
- Setup and run HDFS
- Setup and run YARN
- Stopped the HDFS and YARN services
- Stop an EC2 instance
Right now, all you can do is store and retrieve data in HDFS. That’s not very interesting .
In the next post, I demonstrate how to setup Hive and Spark on the Hadoop cluster. Once that’s done, your cluster will be ready to process multi-gigabyte datasets.
3 thoughts on “Create a single node Hadoop cluster”