This is the second article in a series to build a Big Data development environment in AWS.
If you’ve not read the first article, you’ll likely be confused. Please go read Create a single node Hadoop cluster

We’ve set up the storage service HDFS and the resource manager YARN. But that’s not enough for a data warehouse.
We need technologies to store and retrieve data as tables and we need a performant data-processing framework that processes both structured and unstructured data.
That’s where Hive and Spark comes in. What are they? Let’s hear it from the creators themselves
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
https://hive.apache.org
Hive is first and foremost a data warehousing software. It supports read, writing and managing large datasets using the SQL language and even supports external tables in HDFS.
You may have noticed you can use HiveSQL to read data, transform data, aggregate and project columns and write them back to HDFS.
Then why do we need Spark? Well Hive only works on structured data, the data must already be in a format such as CSV, TSV, Parquet or Avro.
What if we want to process unstructured data? That’s when Spark comes in
Apache Spark™ is a unified analytics engine for large-scale data processing.
https://spark.apache.org/
That’s a general description but Spark aims to be all-encompassing. It has Spark SQL for processing structured data, Spark Streaming for processing streaming data, MLib for machine learning and GraphX for working with graph datasets.

It’s also blazing fast thanks to it’s in-memory data processing, state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
That’s why I’ve chosen these 2 technologies as the data warehousing layer in our Big Data development environment.
Let’s begin setting it up!
Download and install Hive
We begin by booting and creating a shell in the EC2 instance created in the previous article where we installed YARN and HDFS
Go the AWS console and start your EC2 instance. Be sure to note down the public IP
You can enter using the SSH command and your key-pair. Go the AWS console to
ssh ubuntu@{ec2-public-ip}

Our first step will be to download the Apache Hive packages from downloads.apache.org
wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

Unpack the tar.gz file
tar xzf apache-hive-3.1.2-bin.tar.gz

You can then remove the tar.gz package and move the hive directory under /opt
You will also create a soft-link /opt/hive pointing to the hive directory
rm apache-hive-3.1.2-bin.tar.gz
sudo mv -f apache-hive-3.1.2-bin /opt
sudo ln -s /opt/apache-hive-3.1.2-bin /opt/hive

Add the HIVE_HOME to your .profile file and also add the hive binaries location to PATH so that hive executables run from any working directory
Add the following lines to /home/ubuntu/.profile
export HIVE_HOME=/opt/hive
export PATH=$PATH:${HIVE_HOME}/bin

Setup a database for Hive metastore
Our next step is to create a PostgreSQL DB which will persist the metadata of the Hive data warehouse.
To understand this, you must distinguish the table data which is stored in HDFS from the table metadata which is stored in the Hive metastore DB.
The former contains table rows such as (1,”James”,1000), (2,”Mary”,2500),(3,”Paul”,3000) while latter contains information about how a table structure and its datatypes e.g such as its schema “emp_id int, name string, salary int” and its table location “hdfs://user/ubuntu/spark-datawarehouse/employees/“
PostgreSQL DB will store table schema, table and partition location and other metadata used to in the Hive and Spark data warehouse. We will be running the DB in Docker for ease of installation and dependency management
Because PostgreSQL will run in a container, we must install the Docker engine on the EC2 instance.
sudo apt install docker.io

Next, we will add ubuntu user to the docker user group to give it permissions to access the Docker engine

Log out of your SSH session using the exit command re-enter using SSH
Test that you access the Docker engine using the CLI
docker ps

Now we shall pull the postgres image from dockerhub.

Now that we have the postgres image stored locally, we will instantiate a postgres container named hivemetastore with the root password “super_secure_password” (you can change yours!) and bind the container port 5432 to the host port of 6432
docker create \
--name hivemetastore \
-p 6432:5432 \
-e POSTGRES_PASSWORD=super_secure_password \
postgres

We will use the friendlyname hivemetastore
Now let’s start the hivemetastore container

To make sure the container initiation went off without a hitch, let’s look at its logs
docker logs -f hivemetastore

We can enter the Docker container and run commands. Let’s try running the psql client.
If you’re prompted for the password, type super_secure_password (or whatever you chose)
docker exec -it hivemetastore psql -U postgres

psql is awaiting input
Our first step is to create a database called metastore in which metadata tables will be created, create the user hive whose password is “super_secure_password” and grant user hive full permissions on the metastore database
CREATE DATABASE metastore;
CREATE USER hive WITH ENCRYPTED PASSWORD 'super_secure_password';
GRANT ALL ON DATABASE metastore TO hive;

PostgreSQL let’s you list databases using the “\l” command

the user postgres has full access to the other databases
We verified that PostgreSQL DB is accessible within the container but now we must check connectivity from outside the container
Type exit to “\q” to leave psql and “exit” to leave the container shell
We install the psql client locally to verify connectivity from the Spark cluster
sudo apt install postgresql-client -y

First let’s validate port 6432 is bound to the Docker container and accessible using telnet
telnet localhost 6432

Now we will use the psql client to connect to localhost:6432 to login into the metastore database using the user hive
Provide the password “super_secure_password” when prompted
psql -h localhost \
-p 6432 \
-d metastore \
-U hive \
-W

Configure and setup Hive metastore
We’re still quite far from being done. To launch the Hive metastore service, we must configure the hive metastore and initialize the Hive schema in the postgreSQL DB
First we will configure /opt/hive/conf/hive-site.xml which contains instructions for the metastore service to connect to the PostgreSQL DB. Take note of the “super_secure_password” value under javax.jdo.option.ConnectionPassword and change it if you’ve customized it
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://localhost:6432/metastore</value>
<description>PostgreSQL JDBC driver connection URL</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
<description>PostgreSQL metastore driver class name</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>the username for the DB instance</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>super_secure_password</value>
<description>the password for the DB instance</description>
</property>
</configuration>
We’re also going to overwrite the guava library under /opt/hive/lib with a more recent version
rm /opt/hive/lib/guava-19.0.jar
cp /opt/hadoop/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/hive/lib/

That’s sufficient configuration for the us to initialize the Hive Metastore schema in Postgres
schematool -dbType postgres -initSchema

Let’s verify by logging into to postgreSQL DB to view the new tables
docker exec \
-it hivemetastoer \
psql -U postgres \
-d metastore
Once you see the psql command line, go ahead and list the tables under the metastore schema
\d

If you’re interested in the details of each table, check out this link. There is an E-R diagram provided in the blog post.
Launch and validate Hive
We’re now ready to test the hive service. Simply launch it from the command line
hive

Let’s test a command
SHOW DATABASES

There aren’t any databases and tables at this moment. But don’t worry, we’ll be loading the airlines dataset in our next article.
Streamline cluster management
There are number of services that need to be started for our Spark cluster to work: HDFS, YARN and the PostgreSQL DB in Docker. We could run the commands individually but that would be too much hassle.
Instead we will create 2 scripts start_all.sh and stop_all.sh to launch and shutdown all services at once
/home/ubuntu/start_all.sh
/opt/hadoop/sbin/start-dfs.sh /opt/hadoop/sbin/start-yarn.sh docker start hivemetastore docker ps
/home/ubuntu/stop_all.sh
/opt/hadoop/sbin/stop-yarn.sh /opt/hadoop/sbin/stop-dfs.sh
To start the services run
source ~/start_all.sh

To stop the services run
source ~/stop_all.sh

Download and install Spark 3
Great, we’re one step closer to having a functional Spark cluster. We have HDFS to store the data, YARN to manage resources, Hive to handle the table definitions and metadata
We’re ready to install the crowning jewel: Spark!
We begin with downloading Spark 3.1.1 from archive.apache.org
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

Next we unpack the tgz archive and remove it. We move the new Spark directory to /opt and create a soft-link /opt/spark3 that points to it
tar xzf spark-3.1.1-bin-hadoop3.2.tgz
rm spark-3.1.1-bin-hadoop3.2.tgz
sudo mv -f spark-3.1.1-bin-hadoop3.2 /opt
sudo ln -s spark-3.1.1-bin-hadoop3.2 /opt/spark3

Configure Spark 3
We will need to add the $SPARK_HOME env to ~/.profile and add the Spark binaries directory to $PATH so pyspark and spark-shell are immediately available from the command line. Finally, we make Python 3 the default interpreter for Pyspark
Add these lines to ~/.profile
export SPARK_HOME=/opt/spark3
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=python3

Ensure that /opt/spark3/conf/spark-env.sh has these environment values. They are used for Spark to find Hadoop libraries and configuration and creating the Spark Context

Next we edit values in /opt/spark3/conf/spark-defaults.conf. These values are used as defaults when building the SparkContext in an application.
These configs tell Spark to use the YARN cluster manager by default, where to store Spark logs in HDFS and where Spark jars can be found in HDFS etc. A full list of settings can be found here
/opt/spark3/conf/spark-defaults.conf
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby/
spark.sql.repl.eagerEval.enabled true
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///spark3-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs:///spark3-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.yarn.historyServer.address localhost:18080
spark.yarn.jars hdfs:///spark3-jars/*.jar
We will also add the following lines to /opt/hive/conf/hive-site.xml
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>

For Spark to access the Hive Metastore tables, it must read hive-site.xml configurations. We create a soft-link in /opt/spark3/conf to allow this
sudo ln -s /opt/hive/conf/hive-site.xml /opt/spark3/conf/

We also create a directory to store Spark logs and jars in HDFS, then copy Spark jars into the HDFS. This is references in the Spark property spark.yarn.jars which allows Spark to use jars when running cluster-mode
dfs dfs -mkdir /spark3-jars
hdfs dfs -mkdir /spark3-logs
hdfs dfs -put /opt/spark3/jars/* /spark3-jars
Finally we download the PostgreSQL jar into the /opt/spark3/jars folder. This allows Spark to access the postgreSQL DB to read metastore tables.

Validate Spark 3
Here comes the moment of truth, we’ve set up YARN and HDFS and configured Hive and the metastore, then we downloaded configured and installed Spark.
Now it’s time to run Spark! First check that $SPARK_HOME/bin is part of your $PATH

Ensure that HDFS and YARN are up by running start_all.sh
source ~/start_all.sh
Let’s first try launching pyspark. For the uninitiated, Pyspark is the Python interpreter with a SparkSession and SparkContext prepared at launched. You can use spark and sc to access the Spark Dataframe and RDD APIs while still being able to use Python 3 REPL.
I find it useful for testing a code snippet before integrating it into a larger script.
Let’s launch Pyspark with YARN as the cluster manager. The Resource Manager YARN allocates resources in the form of containers which the application uses to execute its tasks.
pyspark --master yarn

For completeness, let’s also launch spark-shell with YARN as the cluster manager. spark-shell is simply a Scala REPL combined with a SparkSession and SparkContext and offers an alternative to Pyspark

When would you choose to use Scala over Python? Well, Spark is written in Scala which means applications written in Scala can be more performant under some circumstances. In my opinion, Scala’s greatest advantage in production is resolving and compiling dependencies into a JAR file at build time vs Python where you have to pass a zip file runtime.
Python’s approach is more fragile and makes the code less portable. You are likely encounter issues when moving Python code from one Spark cluster to another where some Python libraries are not available. Scala side-steps this by having the classes built into JAR file.
Congratulations! You have a fully-fledged Spark cluster. Welcome to the big leagues – your cluster is capable of processing gigabytes of data.
Now to avoid spending unnecessary money, we’re going to need shut our server down.
First shutdown the YARN and DFS services using the stop_all script
source ~/stop_all.sh

Then shutdown your EC2 instance in the AWS console to avoid spending unnecessary money
Closing statements
We’ve come a long way. In the previous article, we set up storage in the form of HDFS and a resource manager called YARN. In this article we set up a Hive metastore service in order to support Spark’s data warehouse function.
We’re close to having a complete Big Data development environment but we’re missing a few pieces
In the next article, we will setup load data into our Hive data warehouse and use Jupyterlab to explore this data.
Stay tuned!
Hello Norman, thanks a lot for your great work. I have some doubts about the docker usage.
1. In practice, should we also use docker containers for other components like HDFS, Spark and Hive?
2. If it is, what should be the proposed docker container layers normally in practice? HDFS, Spark and Hive should have their own container or we can put them together?
LikeLike
Hello Steve,
I assume you’re assembling a dev (non-prod) environment. Yes, using containerisation tools such as Docker will greatly simplify deployment of HDFS, Spark and Hive.
The tools should be deployed in their own containers and orchestrated using docker-compose. In addition, docker-compose helps with inter-container networking and mounting of local volumes for persisting data. See https://github.com/panovvv/bigdata-docker-compose/blob/master/docker-compose.yml
If you are considering a containerised deployment for Spark and Hive, I recommend you look into Kubernetes. The latest version of Spark supports using Kubernetes as a cluster manager (https://spark.apache.org/docs/latest/running-on-kubernetes.html) . Starburst offers a helm chart for the hive metastore (https://docs.starburst.io/starburst-enterprise/k8s/deploy-hms.html)
LikeLike