Set up Spark and Hive for data warehousing and processing

This is the second article in a series to build a Big Data development environment in AWS.

If you’ve not read the first article, you’ll likely be confused. Please go read Create a single node Hadoop cluster

Setup Spark and Hive in Hadoop cluster

We’ve set up the storage service HDFS and the resource manager YARN. But that’s not enough for a data warehouse.

We need technologies to store and retrieve data as tables and we need a performant data-processing framework that processes both structured and unstructured data.

That’s where Hive and Spark comes in. What are they? Let’s hear it from the creators themselves

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

https://hive.apache.org

Hive is first and foremost a data warehousing software. It supports read, writing and managing large datasets using the SQL language and even supports external tables in HDFS.

You may have noticed you can use HiveSQL to read data, transform data, aggregate and project columns and write them back to HDFS.

Then why do we need Spark? Well Hive only works on structured data, the data must already be in a format such as CSV, TSV, Parquet or Avro.

What if we want to process unstructured data? That’s when Spark comes in

Apache Spark™ is a unified analytics engine for large-scale data processing.

https://spark.apache.org/

That’s a general description but Spark aims to be all-encompassing. It has Spark SQL for processing structured data, Spark Streaming for processing streaming data, MLib for machine learning and GraphX for working with graph datasets.

Spark covers a wide range of data-processing use cases

It’s also blazing fast thanks to it’s in-memory data processing, state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

That’s why I’ve chosen these 2 technologies as the data warehousing layer in our Big Data development environment.

Let’s begin setting it up!

Download and install Hive

We begin by booting and creating a shell in the EC2 instance created in the previous article where we installed YARN and HDFS

Go the AWS console and start your EC2 instance. Be sure to note down the public IP

You can enter using the SSH command and your key-pair. Go the AWS console to

ssh ubuntu@{ec2-public-ip}
The Ubuntu system information screen greets you

Our first step will be to download the Apache Hive packages from downloads.apache.org

wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
wget downloads the file to your /home/ubuntu directory

Unpack the tar.gz file

tar xzf apache-hive-3.1.2-bin.tar.gz

You can then remove the tar.gz package and move the hive directory under /opt

You will also create a soft-link /opt/hive pointing to the hive directory

rm apache-hive-3.1.2-bin.tar.gz
sudo mv -f apache-hive-3.1.2-bin /opt
sudo ln -s /opt/apache-hive-3.1.2-bin /opt/hive
The contents of /opt after moving the hive directory in and creating a softlink

Add the HIVE_HOME to your .profile file and also add the hive binaries location to PATH so that hive executables run from any working directory

Add the following lines to /home/ubuntu/.profile

export HIVE_HOME=/opt/hive
export PATH=$PATH:${HIVE_HOME}/bin
Contents of ~/.profile

Setup a database for Hive metastore

Our next step is to create a PostgreSQL DB which will persist the metadata of the Hive data warehouse.

To understand this, you must distinguish the table data which is stored in HDFS from the table metadata which is stored in the Hive metastore DB.

The former contains table rows such as (1,”James”,1000), (2,”Mary”,2500),(3,”Paul”,3000) while latter contains information about how a table structure and its datatypes e.g such as its schema “emp_id int, name string, salary int” and its table location “hdfs://user/ubuntu/spark-datawarehouse/employees/

PostgreSQL DB will store table schema, table and partition location and other metadata used to in the Hive and Spark data warehouse. We will be running the DB in Docker for ease of installation and dependency management

Because PostgreSQL will run in a container, we must install the Docker engine on the EC2 instance.

sudo apt install docker.io

Next, we will add ubuntu user to the docker user group to give it permissions to access the Docker engine

Log out of your SSH session using the exit command re-enter using SSH

Test that you access the Docker engine using the CLI

docker ps
This output shows you have 0 containers running. We’ll soon change that

Now we shall pull the postgres image from dockerhub.

Now that we have the postgres image stored locally, we will instantiate a postgres container named hivemetastore with the root password “super_secure_password” (you can change yours!) and bind the container port 5432 to the host port of 6432

docker create \
  --name hivemetastore \
  -p 6432:5432 \
  -e POSTGRES_PASSWORD=super_secure_password \
  postgres
The cryptic string at the end is the container ID. Docker randomly generate this per container
We will use the friendlyname hivemetastore

Now let’s start the hivemetastore container

To make sure the container initiation went off without a hitch, let’s look at its logs

docker logs -f hivemetastore
LOG: database system is ready to accept connections means the postgreSQL DB was successfully started

We can enter the Docker container and run commands. Let’s try running the psql client.

If you’re prompted for the password, type super_secure_password (or whatever you chose)

docker exec -it hivemetastore psql -U postgres
A successful login to PostgreSQL.
psql is awaiting input

Our first step is to create a database called metastore in which metadata tables will be created, create the user hive whose password is “super_secure_password” and grant user hive full permissions on the metastore database

CREATE DATABASE metastore;
CREATE USER hive WITH ENCRYPTED PASSWORD 'super_secure_password';
GRANT ALL ON DATABASE metastore TO hive;

PostgreSQL let’s you list databases using the “\l” command

Notice that user hive is granted full access privileges to the metastore database just as
the user postgres has full access to the other databases

We verified that PostgreSQL DB is accessible within the container but now we must check connectivity from outside the container

Type exit to “\q” to leave psql and “exit” to leave the container shell

We install the psql client locally to verify connectivity from the Spark cluster

sudo apt install postgresql-client -y
Install the psql client on the EC2 instance (not inside the Docker container)

First let’s validate port 6432 is bound to the Docker container and accessible using telnet

telnet localhost 6432
We are able to connect to port 6432. Use Ctrl + ] to exit

Now we will use the psql client to connect to localhost:6432 to login into the metastore database using the user hive

Provide the password “super_secure_password” when prompted

psql -h localhost \
  -p 6432 \
  -d metastore \
  -U hive \
  -W
We have successfully logged into the postgreSQL container from the EC2 instance using the psql client

Configure and setup Hive metastore

We’re still quite far from being done. To launch the Hive metastore service, we must configure the hive metastore and initialize the Hive schema in the postgreSQL DB

First we will configure /opt/hive/conf/hive-site.xml which contains instructions for the metastore service to connect to the PostgreSQL DB. Take note of the “super_secure_password” value under javax.jdo.option.ConnectionPassword and change it if you’ve customized it

<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://localhost:6432/metastore</value>
    <description>PostgreSQL JDBC driver connection URL</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
    <description>PostgreSQL metastore driver class name</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>the username for the DB instance</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>super_secure_password</value>
    <description>the password for the DB instance</description>
  </property>
</configuration>

We’re also going to overwrite the guava library under /opt/hive/lib with a more recent version

rm /opt/hive/lib/guava-19.0.jar
cp /opt/hadoop/share/hadoop/hdfs/lib/guava-27.0-jre.jar /opt/hive/lib/

That’s sufficient configuration for the us to initialize the Hive Metastore schema in Postgres

schematool -dbType postgres -initSchema
schemaTool completition screen

Let’s verify by logging into to postgreSQL DB to view the new tables

docker exec \
    -it hivemetastoer \
    psql -U postgres \
    -d metastore

Once you see the psql command line, go ahead and list the tables under the metastore schema

\d
Hive creates a large number of tables in the relational database. Use \q to quit

If you’re interested in the details of each table, check out this link. There is an E-R diagram provided in the blog post.

Launch and validate Hive

We’re now ready to test the hive service. Simply launch it from the command line

hive
Hive will greet you with a hive> prompt if initialization is successful

Let’s test a command

SHOW DATABASES
Looks like there aren’t any databases

There aren’t any databases and tables at this moment. But don’t worry, we’ll be loading the airlines dataset in our next article.

Streamline cluster management

There are number of services that need to be started for our Spark cluster to work: HDFS, YARN and the PostgreSQL DB in Docker. We could run the commands individually but that would be too much hassle.

Instead we will create 2 scripts start_all.sh and stop_all.sh to launch and shutdown all services at once

/home/ubuntu/start_all.sh

/opt/hadoop/sbin/start-dfs.sh                                                                                                                                                                                                                /opt/hadoop/sbin/start-yarn.sh                                                                                                                                                                                                               docker start hivemetastore                                                                                                                                                                                                                   docker ps 

/home/ubuntu/stop_all.sh

/opt/hadoop/sbin/stop-yarn.sh                                                                                                                                                                                                                /opt/hadoop/sbin/stop-dfs.sh  

To start the services run

source ~/start_all.sh

To stop the services run

source ~/stop_all.sh

Download and install Spark 3

Great, we’re one step closer to having a functional Spark cluster. We have HDFS to store the data, YARN to manage resources, Hive to handle the table definitions and metadata

We’re ready to install the crowning jewel: Spark!

We begin with downloading Spark 3.1.1 from archive.apache.org

wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

Next we unpack the tgz archive and remove it. We move the new Spark directory to /opt and create a soft-link /opt/spark3 that points to it

tar xzf spark-3.1.1-bin-hadoop3.2.tgz
rm spark-3.1.1-bin-hadoop3.2.tgz
sudo mv -f spark-3.1.1-bin-hadoop3.2 /opt
sudo ln -s spark-3.1.1-bin-hadoop3.2 /opt/spark3
Contents of /opt and /opt/spark3 after downloading Spark

Configure Spark 3

We will need to add the $SPARK_HOME env to ~/.profile and add the Spark binaries directory to $PATH so pyspark and spark-shell are immediately available from the command line. Finally, we make Python 3 the default interpreter for Pyspark

Add these lines to ~/.profile

export SPARK_HOME=/opt/spark3
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=python3
The contents of ~/.profile

Ensure that /opt/spark3/conf/spark-env.sh has these environment values. They are used for Spark to find Hadoop libraries and configuration and creating the Spark Context

Next we edit values in /opt/spark3/conf/spark-defaults.conf. These values are used as defaults when building the SparkContext in an application.

These configs tell Spark to use the YARN cluster manager by default, where to store Spark logs in HDFS and where Spark jars can be found in HDFS etc. A full list of settings can be found here

/opt/spark3/conf/spark-defaults.conf

spark.driver.extraJavaOptions     -Dderby.system.home=/tmp/derby/
spark.sql.repl.eagerEval.enabled  true
spark.master                      yarn
spark.eventLog.enabled            true
spark.eventLog.dir                hdfs:///spark3-logs
spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory     hdfs:///spark3-logs
spark.history.fs.update.interval  10s
spark.history.ui.port             18080
spark.yarn.historyServer.address  localhost:18080
spark.yarn.jars                   hdfs:///spark3-jars/*.jar

We will also add the following lines to /opt/hive/conf/hive-site.xml 

<property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
The final version of hive-site.xml looks like this

For Spark to access the Hive Metastore tables, it must read hive-site.xml configurations. We create a soft-link in /opt/spark3/conf to allow this

sudo ln -s /opt/hive/conf/hive-site.xml /opt/spark3/conf/
The contents of /opt/spark3/conf after creating a soft link to hive-site.xml

We also create a directory to store Spark logs and jars in HDFS, then copy Spark jars into the HDFS. This is references in the Spark property spark.yarn.jars which allows Spark to use jars when running cluster-mode

dfs dfs -mkdir /spark3-jars
hdfs dfs -mkdir /spark3-logs
hdfs dfs -put /opt/spark3/jars/* /spark3-jars

Finally we download the PostgreSQL jar into the /opt/spark3/jars folder. This allows Spark to access the postgreSQL DB to read metastore tables.

Validate Spark 3

Here comes the moment of truth, we’ve set up YARN and HDFS and configured Hive and the metastore, then we downloaded configured and installed Spark.

Now it’s time to run Spark! First check that $SPARK_HOME/bin is part of your $PATH

Ensure that HDFS and YARN are up by running start_all.sh

source ~/start_all.sh

Let’s first try launching pyspark. For the uninitiated, Pyspark is the Python interpreter with a SparkSession and SparkContext prepared at launched. You can use spark and sc to access the Spark Dataframe and RDD APIs while still being able to use Python 3 REPL.

I find it useful for testing a code snippet before integrating it into a larger script.

Let’s launch Pyspark with YARN as the cluster manager. The Resource Manager YARN allocates resources in the form of containers which the application uses to execute its tasks.

pyspark --master yarn
Python REPL with the SparkSession created as “spark” and the SparkContext created as “sc”

For completeness, let’s also launch spark-shell with YARN as the cluster manager. spark-shell is simply a Scala REPL combined with a SparkSession and SparkContext and offers an alternative to Pyspark

Scala REPL with SparkSession and SparkContext pre-created

When would you choose to use Scala over Python? Well, Spark is written in Scala which means applications written in Scala can be more performant under some circumstances. In my opinion, Scala’s greatest advantage in production is resolving and compiling dependencies into a JAR file at build time vs Python where you have to pass a zip file runtime.

Python’s approach is more fragile and makes the code less portable. You are likely encounter issues when moving Python code from one Spark cluster to another where some Python libraries are not available. Scala side-steps this by having the classes built into JAR file.

Congratulations! You have a fully-fledged Spark cluster. Welcome to the big leagues – your cluster is capable of processing gigabytes of data.

Now to avoid spending unnecessary money, we’re going to need shut our server down.

First shutdown the YARN and DFS services using the stop_all script

source ~/stop_all.sh

Then shutdown your EC2 instance in the AWS console to avoid spending unnecessary money

Closing statements

We’ve come a long way. In the previous article, we set up storage in the form of HDFS and a resource manager called YARN. In this article we set up a Hive metastore service in order to support Spark’s data warehouse function.

We’re close to having a complete Big Data development environment but we’re missing a few pieces

In the next article, we will setup load data into our Hive data warehouse and use Jupyterlab to explore this data.

Stay tuned!

6 thoughts on “Set up Spark and Hive for data warehousing and processing

  1. Hello Norman, thanks a lot for your great work. I have some doubts about the docker usage.
    1. In practice, should we also use docker containers for other components like HDFS, Spark and Hive?
    2. If it is, what should be the proposed docker container layers normally in practice? HDFS, Spark and Hive should have their own container or we can put them together?

    Like

    1. Hello Steve,

      I assume you’re assembling a dev (non-prod) environment. Yes, using containerisation tools such as Docker will greatly simplify deployment of HDFS, Spark and Hive.

      The tools should be deployed in their own containers and orchestrated using docker-compose. In addition, docker-compose helps with inter-container networking and mounting of local volumes for persisting data. See https://github.com/panovvv/bigdata-docker-compose/blob/master/docker-compose.yml

      If you are considering a containerised deployment for Spark and Hive, I recommend you look into Kubernetes. The latest version of Spark supports using Kubernetes as a cluster manager (https://spark.apache.org/docs/latest/running-on-kubernetes.html) . Starburst offers a helm chart for the hive metastore (https://docs.starburst.io/starburst-enterprise/k8s/deploy-hms.html)

      Like

Leave a comment