Create a JupyterLab notebook for Spark

This is the third article in a series to build a Big Data development environment in AWS. Before you proceed further, ensure you’ve completed the first and second articles.

What is Jupyterlab?

Imagine this – you’ve created a pipeline to clean your company’s raw data and enrich it according to business requirements. You’ve documented each table and column in excruciating detail. Finally you built a dashboard brimming with charts and insights which tell a compelling narrative of the business’ health and direction.

How do you share and present your work?

If you’ve worked as a Data Engineer or Data Scientist before, this probably sounds familiar. Your work exists as a bundle of Python/Scala scripts on your laptop. Its logic is scattered across multiple files. The visualizations exported as PNG images in another directory.

Sure, you could commit the code to version control but that doesn’t help you present the data. Git repos are designed to display text not graphs and charts.

Here’s where a notebook comes in handy. To quote Project Jupyter

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

https://jupyter.org/

You tell a story when you write a notebook. Code shows how you accomplished something but the narrative text says why. Embbed visualizations transform an inscrutable 12 column table into a easily-digestible bar chart. Further, notebooks are compatible with version control: notebooks can be exported as JSON objects.

examples of jupyterlab workspaces in single document and multiple document workspaces
Code, images and narrative in a single notebook. Image credit: https://jupyter.org/

Jupyter supports 3 languages at its core: Julia, Python, and R – all languages heavily used in numerical and scientific computing. Its interactive web interface permits rapid code prototyping, saving of user sessions, and display of graphics. At its heart, it uses a kernel to interact with a programming language.

In this article, I will demonstrate

  • How to prepare data for use with Spark and Jupyter notebooks
  • How to install Jupyterhub
  • Install the Spark 3 kernel in Jupyterhub
  • Use Jupyterhub to explore data

Download and load Airlines dataset

Start your EC2 instance in the AWS console and note down the public IP address.

SSH in

ssh ubuntu@{public-ip-address}
Landing screen for the Ubuntu 18.04.5 LTS AMI

Start the DFS, YARN, and Hive metastore DB using the start_all.sh script you created previously

source /home/ubuntu/start_all.sh

The first thing we will do is download the Airline Dataset. I got it from https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O by its origin is RITA. The full dataset contains a list of US flights from 1987 to 2008 and information such as their Departure Times and Arrival Times as well as delays.

The full dataset is 12 GB but we use a truncated version which contains 2k rows per year. Download it to your home directory using wget

wget https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv
The 2k airlines dataset is a paltry 4.37Mb

Now that you have the data saved at /home/ubuntu/allyears2k.csv , we will send the data to HDFS using the put command. A multi-node Spark cluster can read the data files from HDFS in parallel and speed up data loading.

hdfs dfs -mkdir -p /data/airlines
hdfs dfs -put allyears2k.csv /data/airlines
hdfs dfs -ls /data/airlines
Load the dataset into HDFS

Now let’s use the Hive CLI to create a new database called airlines_db . Notice that we are using Hive and not Spark to run the table DDL. Because we configured Spark to use the Hive data catalog, we can also access the airlines_db database and its child tables in Spark

Launch the Hive CLI.

hive

Create the new database and use it. This sets airlines_db as the session database, meaning that queries that don’t specify a database assume the use of airlines_db.

CREATE DATABASE airlines_db;
USE airlines_db;
Create a new Hive database

Now that you set airlines_db as the session database, create the airlines table inside of it

CREATE TABLE `airlines`(
  `year` int,
  `month` int,
  `dayofmonth` int,
  `dayofweek` int,
  `deptime` int,
  `crsdeptime` int,
  `arrtime` int,
  `crsarrtime` int,
  `uniquecarrier` string,
  `flightnum` int,
  `tailnum` string,
  `actualelapsedtime` int,
  `crselapsedtime` int,
  `airtime` int,
  `arrdelay` int,
  `depdelay` int,
  `origin` string,
  `dest` string,
  `distance` int,
  `taxiin` int,
  `taxiout` int,
  `cancelled` int,
  `cancellationcode` int,
  `diverted` int,
  `carrierdelay` int,
  `weatherdelay` int,
  `nasdelay` int,
  `securitydelay` int,
  `lateaircraftdelay` int,
  `isarrdelayed` string,
  `isdepdelayed` string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES ('skip.header.line.count' = '1');
Create a new table in Hive called airlines

The table DDL consists of 3 parts,

First we have a list of column names and column types e.g. DayofMonth INT, IsArrDelayed STRING etc.

Then, we specify what the storage format is. In this case, it’s ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’. By default, Hive uses TEXTFILE, and in this textfile we expect the data to be laid out in rows where each field in a row is terminated by the ‘,’ character and each row is terminated by the newline ‘\n’ character

At this point, the airlines table is empty. We want to load the data from allyears2k.csv in HDFS

Note that Hive deletes /data/airlines/allyears2k.csv upon completing the load operation. If you want to keep the input file, use an external table to load data.

We can now explore the data in the airlines table

SELECT * FROM airlines LIMIT 10
10 rows of the airlines table

We can even count the rows in airlines

SELECT COUNT(1) FROM airlines
There are 43k+ rows in the airlines table

I mentioned previously that, we are using Hive as a data catalog for Spark. This means Spark can access tables defined in the Hive metastore. Let’s verify this.

Exit the Hive CLI

exit;

Then launch the Pyspark shell.

pyspark --master yarn

We can view the databases accessible to Spark using SHOW DATABASES. show() is added so the results are printed in the console

spark.sql("SHOW DATABASES").show()
SHOW DATABASES works in Spark SQL too

Many commands (but not all) Hive commands also work in Spark SQL. For example we will use the USE command to set airlines_db as the current database

spark.sql("USE airlines_db")
So does the USE command

We then use SHOW TABLES to list the tables in airlines_db. If we did not specify airlines_db as the current database, we would use SHOW TABLES FROM airlines_db. As above, we add .show() to print the results to console

spark.sql("SHOW TABLES").show()
Even SHOW TABLES works the same way

We can perform the same operations we did in Hive in Spark. Let’s select 10 rows and count the rows in airlines.

spark.sql("SELECT * FROM airlines LIMIT 10").show
spark.sql("SELECT COUNT(1) FROM airlines").show()

You may have noticed the commands run faster in Spark compared to Hive. This is because Spark uses an in-memory execution model. For this reason, we prefer doing data processing in Spark over Hive.

Install Jupyterlab and Run Jupyterlab

In this section we’re going to install the Jupyterlab server. Jupyterlab has a web-based development environment for Jupyter notebook that also supports plugins to add functionality.

Fortunately, Jupyterlab is available through pip. However, installing Jupyterlab entails installing a long list of Python dependencies. To avoid dependency conflicts, we will install Jupyterlab in a virtual environment called venv

First, we use apt-get to install venv for Python 3

sudo apt-get install python3-venv
Installing the venv tool for Python 3

Then we create a directory for the virtual environment called dl-venv. This is done using the venv command

python3 -m venv dl-venv
Create a virtual environment for Python 3

The venv command creates a directory called dl-venv in /home/ubuntu

Contents of the user’s home directory after creating a virtual env

This directory contains a Python interpreter, libraries and scripts which are isolated from the rest of the system. When you activate the virtual environment and install a new Python library, the library is installed here instead of the system’s Python directory. When you deactivate the virtual environment, these libraries are not longer accessible to the system’s Python interpreter.

Contents of the virtual environment directory

Let’s activate the dl-venv virtual environment. venv provides a script to do so

source dl-venv/bin/activate
Activate the virtual environment

Now that dl-venv is active, we can proceed to install Jupyterlab is an isolated Python environment

pip3 install jupyterlab
Jupyterlab requires a long list of dependencies to be installed
We’re finally done installing Jupyterlab

Once Jupyterlab and Jupyter notebook are installed, we must generate a config file. Many important settings such as the hashed user password, Jupyterlab’s IP bindings and remote access control are located in the config file.

jupyter notebook --generate-config

Now set a password (and don’t forget it!). But even if you do – you can always set a new one using this command

jupyter notebook password

Now open up /home/ubuntu/jupyter_notebook_config.py and uncomment and change 2 configs.

The first is to make the notebook server listen on all IPs

c.NotebookApp.ip

The second is to allow access from remote hosts

c.NotebookApp.allow_remote_access = True

Here’s what /home/ubuntu/.jupyter contains after configuration

Contents of .jupyter

Now for the moment of truth! Let’s start JupyterLab

jupyter lab
Jupyterlab webserver is listening on port 8888

Now I hope you remember your EC2 instance’s public IP – if not you can get it from the AWS Console. Open you web browser and navigate to the following URL

http://{ec2-public-ip}:8888/lab

You may be prompted for you password. Enter the password you set.

Landing screen for Jupyterlab. We are missing Pyspark.

Set up the Pyspark kernel

Great stuff. You’ve successfully launched Jupyterlab and accessed the web development environment. You may have noticed the option to launch Python 3 notebooks and consoles. But we’re missing the option to launch the same for Pyspark.

Don’t worry, we’re going to install the kernel for Pyspark 3 in this section. What’s a kernel? Well,

Kernels are programming language specific processes that run independently and interact with the Jupyter Applications and their user interfaces. IPython is the reference Jupyter kernel, providing a powerful environment for interactive computing in Python.

https://jupyter.readthedocs.io/en/latest/projects/kernels.html

Aside from the core languages of Julia, Python and R, Jupyter supports other languages through adding more kernels. In our case, we’ll be creating a custom one since ours is a simple one-node Spark cluster. For a more complex and production-ready kernel, check out sparkmagic.

Ok, first let’s shutdown the Jupyter Server using Control + C and type y to the prompt

We will need to add a kernel in the dl-venv directory. For full complete treatment of creating kernels see this

Otherwise, create a new file /home/ubuntu/dl-venv/share/jupyter/kernels/pyspark3/kernel.json with the following text

{
      "argv": [
        "python",
        "-m",
        "ipykernel_launcher",
        "-f",
        "{connection_file}"
      ],
      "display_name": "Pyspark 3",
      "language": "python",
      "env": {
        "PYSPARK_PYTHON": "/usr/bin/python3",
        "SPARK_HOME": "/opt/spark3/",
        "SPARK_OPTS": "--master yarn --conf spark.ui.port=0",
        "PYTHONPATH": "/opt/spark3/python/lib/py4j-0.10.9-src.zip:/opt/spark3/python/"
      }
}

The argv section contains the command line arguments used to launch the kernel. The display_name is what will be shown in the UI as the kernel name. The language is the programming language used in the kernel. env is a dictionary of environment variables initialized when a kernel is launched. In our case, we set env relevant for Spark.

Once we have placed kernel.json, we are ready to install the kernel

jupyter kernelspec install /home/ubuntu/dl-venv/share/jupyter/kernels/pyspark3 --user

Verify the kernel is installed by listing all installed kernels

Go ahead and restart Jupyterlab and login at the public ip http://{ec2-public-ip}:8888/lab

jupyter lab

You will now see Pyspark 3 listed as a kernel option under Notebook and Console. Let’s create a new Pyspark 3 notebook. Click on Pyspark 3

Pyspark is now available as an option

Let’s try running the Spark SQL code we tested earlier. Previously, the Pyspark shell created the SparkSession automatically for us. We must do this manually when using Jupyterlab

from pyspark.sql import SparkSession

spark = SparkSession. \
    builder. \
    enableHiveSupport(). \
    appName('my-demo-spark-job'). \
    master('yarn'). \
    getOrCreate()

spark.sql('SHOW DATABASES').show()

spark.sql('SELECT count(1) FROM airlines_db.airlines').show()

The first 8 lines of the notebook are dedicated to creating the SparkSession. You can learn more about the SparkSession.builder here. Notice the use of enableHiveSupport() so we can access tables defined in the Hive metastore and the use of appName() to which sets the application’s name in YARN

Past the code above into the notebook’s first cell and click the play button to run the code. Wait awhile as it takes time to start the Pyspark kernel.

Exploring the airlines table using Jupyterlab

Observe that the results of the query are printed directly below the cell containing code. The format distinction between input and results helps in the presentation of code.

If you are watching the logging in the Jupyterlab application, you can observe it starting an instance of the Pyspark kernel. This instance is used to execute the Spark code and Jupyterlab prints the results in the notebook

Jupyterlab runs Pyspark in the background

Nice work. We have a full functional Jupyterlab deployment on which we can use a notebook to develop Pyspark code. The notebook also provides formatting of results and storage of notebooks so you can return to work-in-progress. If you’re interested in providing Jupyter notebooks in a multi-user environment, checkout Jupyterhub

Now, let’s begin shutting down the Jupyterlab and Spark server. First, shutdown Jupyterlab but entering Control + C in the command line and enter Y when prompted

Next, stop the HDFS, YARN and Hive metastore DB using the stop_all.sh script

Finally, shutdown the EC2 instance in the AWS Console. This step is critical to avoid being charged unnecessary hours for the EC2 instance. t2.xlarge costs 0.2336 USD / hour and is expensive to run for extended periods of time.

Closing remarks

Notebooks are an indispensable part of the modern Data Engineer’s and Data Scientist’s toolkit. Notebooks like Jupyterlab weave the data narrative and data plots into the code in way that a traditional IDE does not. The web development interface opens the possibility for remote development and even sharing of content in a richer way than a traditional version control system.

We’ve successfully set up HDFS, YARN, Hive, Spark, and Jupyterlab. This stack is sufficient for a Data Engineer developing a batch ETL job.

What if you’re working with a streaming application? Stay tuned for the next installment where I will set up a single node Kafka cluster and ship logs to it

2 thoughts on “Create a JupyterLab notebook for Spark

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s