Stratified K-fold Cross Validation for imbalanced classification tasks
2022-07-11
The default Random Forest feature importance is not reliable: Understanding Permutation Feature Importance
2022-07-14
Show all

Setup Apache Spark on a multi-node cluster

12 mins read

This article covers basic steps to install and configure Apache Spark Apache Spark 3.1.1 on a multi-node cluster which includes installing spark master and workers. I will provide step-by-step instructions to set up spark on Ubuntu. So, if you are you are looking to get your hands dirty with the Apache Spark cluster, this article can be a stepping stone for you.

So, let’s begin!

What is Apache Spark?

It is an open-source and distributed processing system used for big data workloads. Spark is a fast, general engine and powerful engine for big data processing. Apache Spark follows a master/worker architecture (two main daemons) and a cluster manager.

· Master Daemon (Master/Driver Process)

· Worker Daemon (Slave Process)

· Cluster Manager

A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or in a mixed machine configuration.

Few basics

Before we jump into installing Spark, let us define the terminologies that we will use in this. This will not cover advanced concepts of tuning Spark to suit the needs of a given job at hand. If you already know these, you can go ahead and skip this section.

Apache Spark is a distributed computing framework that utilizes the framework of Map-Reduce to allow parallel processing of different things. As a cluster, Spark is defined as a centralized architecture.

Centralized systems are systems that use client/server architecture where one or more client nodes are directly connected to a central server. More info here.

Our setup will work on One Master node (an EC2 Instance) and Three Worker nodes. We will use our Master to run the Driver Program and deploy it in Standalone mode using the default Cluster Manager.

Master: A master node that handles resource allocation for multiple jobs to the spark cluster. A master in Spark is defined for two reasons.

  • Identify the resource (CPU time, memory) needed to run when a job is submitted and requests the cluster manager.
  • Provide the resources (CPU time, memory) to the Driver Program that initiated the job as Executors.
  • Facilitates communication between the different Workers

Driver Program: The driver program does most of the grunt work for ONE job. It breaks apart the job or program (generally written in Python, Scala, or Java) into a Directed Acyclic Graph (DAG), downloads dependencies, assigns tasks to executors, and allows communication between the workers to share data during shuffling. Each Job has a separate Driver Program with its own SparkContext and SparkSession.

A DAG is a mathematical construct that is commonly used in computer science to represent flow of control in a program and its dependencies. DAGs contain vertices, and edges. In Spark, each Task is taken as a vertex of the graph. The edges are added based on the dependency of a task on other tasks. If there are cyclic dependencies, the DAG creation fails.

Workers: Each worker is an EC2 instance. They run the Executors. A worker can run executors of multiple jobs.

Executors: Executors are assigned tasks for a particular job. They run on a particular worker to complete the task. Job has a separate set of executors. They are run on top of a Virtual machine and identified only by the job.

Given we have a quick understanding of what Spark is, let’s spend time putting it into practice. A common image that you will see when people explain Spark is given below. This doesn’t however capture the physical components fully.

Here is a more detailed picture of what our setup will look like at the EC2 level and how you will interact with Spark and run your jobs on it.

Prerequisites

A cluster of machines connected over a network. Here is a good tutorial on creating a cluster using Virtual Box.

Ubuntu 16.04 or higher installed on virtual machines.

Note: Create a user of the same name in master and all slaves to make your tasks easier during ssh and also switch to that user in master.

Steps for installation of Apache Spark Cluster

Step 1.

Create two (or more) clones of the Oracle VM VirtualBox Machine that has been earlier created. Select the option “Generate new MAC addresses for all network adapters” in MAC Address Policy. And also choose the option “Full Clone” in the clone type.

Step 2.

Go to the settings option of virtual machines and make the following network configuration on Adapter 2.

Step 3.

You need to set the hostname of each virtual machine. Open the /etc/hostname file and type the name of the machine in it and save. Run the following command on each virtual machine:

$ sudo nano /etc/hostname

Step 4.

To figure out the IP address of the virtual machines run the following command:

$ ip addr

I run the above-mentioned command on the master and workers (slaves). On my system, I found the following IP addresses:

sp-master 192.168.56.101
sp-slave1 192.168.56.102
sp-slave2 192.168.56.103

For each machine, you will find a different IP address.

Step 5. Configure Hosts

In this step, edit the hosts file and added the IP address and hostname information, saved it, and reboot the machine. Run the following command on all the machines.

$ sudo nano /etc/hosts
$ sudo reboot

Step 6. Install Java (On Master and workers)

Run the following commands on all the Machines (master and workers/slaves).

$ sudo apt-get update
$ sudo apt install openjdk-8-jdk -y
$ java -version

Step 7. Install Scala (On Master and workers)

Install Scala on all the machines (master and the worker/slaves). Run the following command:

$ sudo apt-get install scala

To check the version of Scala, run the following command:

$ scala -version

Step 8. Install and Configure ssh (On Master only)

Now configure Open SSH server-client on master. To configure the Open SSH server-client, run the following command:

$ sudo apt-get install openssh-server openssh-client

The next step is to generate key pairs. For this purpose, run the following command:

$ ssh-keygen -t rsa -P ""

Run the following command to authorize the key:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now copy the content of .ssh/id_rsa.pub from master to .ssh/authorized_keys (all the workers/slaves as well as master). Run the following commands:

$ ssh-copy-id user@192.168.56.101
$ ssh-copy-id user@192.168.56.102
$ ssh-copy-id user@192.168.56.103

Note: user name and IP will be different from your machines. So, use it accordingly.

Now it’s time to check if everything is installed properly. Run the following command on the master to connect to the slaves/workers:

$ ssh sp-slave1 
$ ssh sp-slave2 

You can exit from the slave machine by typing the command:

$ exit

Step 9. Install Spark (On Master and Workers)

Download the stable version of Apache Spark. I will install spark-3.1.1 with Hadoop-3.2. To download spark-3.1.1 with Hadoop-3.2, run the following command:

$ wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

Now run the following command to untar the spark tar file:

$ tar xvf spark-3.1.1-bin-hadoop3.2.tgz

Run the following command to move the spark files to the spark directory (/usr/local/bin):

$ sudo mv spark-3.1.1-bin-hadoop3.2 /usr/local/spark

To set up the environment for Apache Spark, we need to edit the .bashrc file. Run the following command to edit .bashrc file:

$ sudo nano ~/.bashrc

Add the following line to the file and save.

export PATH=$PATH:/usr/local/spark/bin

The above line sets the location (Path) where the spark software file is located to the PATH variable.

Run the following command to make effective changes in the .bashrc file:

$ source ~/.bashrc

Add permissions for users to the spark folder.

sudo chmod -R 777 /usr/local/spark

Step 10. (On Master only)

Now that we have Spark installed in all our machines we need to let the Master instance know about the different workers that are available. We will be using a standalone cluster manager for demonstration purposes. You can also use something like YARN or Mesos to handle the cluster. Spark has detailed notes on the different cluster managers that you can use.

Cluster Manager is an agent that works in allocating the resource requested by the master on all the workers. The cluster manager then shares the resource back to the master, which the master assigns to a particular driver program.

We need to modify /usr/local/spark/conf/spark-env.sh file by providing information about Java location and the master node’s IP. This will be added to the environment when running Spark jobs.

To set up the Apache Spark Master configuration, edit spark-env.sh file. Traverse to the spark/conf folder and make a copy of the spark-env.sh.template file as a spark-env.sh

$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh

Now, to edit the spark-env.sh configuration file, run the following command:

$ sudo nano spark-env.sh

Add the following parameters (line of code) at the end of the file, save and exit.

# contents of conf/spark-env.sh
export SPARK_MASTER_HOST="<Master-IP>"
export SPARK_LOCAL_HOST="<Master-IP>"
export JAVA_HOME="<Path_of_JAVA_installation>"

Note: In my system, MASTER IP is “192.168.5.101” and JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64″. You need to use these parameters as per your system.

Add Workers or Slaves

We will also add all the IPs where the worker will be started. Open the /usr/local/spark/conf/slaves file and paste the following. Now, to edit the configuration file usr/local/spark/conf/slaves, run the following command on master:

$ sudo nano /usr/local/spark/conf/slaves

And add the master and workers/slaves’ names (given below) in the above-mentioned file, save and exit.

# contents of conf/slaves
sp-master
sp-slave1
sp-slave2

Oversubscription: It is the idea of exaggerating the number of resources that is available for use so cluster manager is in the illusion of having more resources. This forces CPU to work on a task and not sit idle during any clock cycles. You can set this by setting the SPARK_WORKER_CORES flag in spark-env.sh to a value higher than the actual number of cores.

Example: If each of our three workers have 2 cores, we have total of 6 cores available per worker. We can set the WORKER_CORES to be 3 times that to allow for oversubscription. i.e.,

export SPARK_WORKER_CORES=6

Check out other resources and options to tune here.

Step 11.

Now, our Apache Spark Cluster is readyTo start the Apache Spark cluster, run the following command on the master:

$ cd /usr/local/spark
$ ./sbin/start-all.sh

Now, to check the services started by spark, run the following command:

$ jps

Step 12. Web UI

To see the Apache Spark cluster status in the browser (UI), go to your browser and type: https://<mater-IP>:8080

In the case of my system, my master IP is 192.168.56.101.

You can see in the above snippet; we have two alive Workers are running.

Spark Master UI

http://<MASTER-IP>:8080/

Spark Application UI

http://<MASTER_IP>:4040/

Step 13.

To stop services of the Apache Spark cluster, run the following command:

$ ./sbin/stop-all.sh

Connect to Spark cluster using PySpark

In order to demonstrate the glory of distributed computation, we’re going to solve some mathematical problems. You might be familiar with the Pi number — this is the sixteenth letter of the Greek alphabet, but more importantly, it is one of the fundamental constants in our world. It is roughly equal to 3.14, it represents the ratio of the circumference of a circle to its diameter. It is an irrational number which means that it cannot be expressed by a simple fraction. It is also known as ‘infinite decimal’ — after the decimal point, digits will go on forever. The difficulty of computing Pi is directly linked to precision, the number of digits after the decimal point. We will write a program where precision will be configurable and will compare run times between standalone and cluster execution of spark depending on precision.

This explicitly tells pyspark to connect to the existing Spark master node. Another option is to stop the standalone Spark context and create another with a connection to Master. This is how it’s done in Jupyter Notebook:

from pyspark import SparkConf, SparkContext
import random as rnd

conf = SparkConf().setAppName("pi")
sc = SparkContext('spark://<master_ip>:7077', conf = conf)

SAMPLES = 100

def inside(p):     
    x, y = rnd.random(), rnd.random()
    return x*x + y*y < 1

count = sc.parallelize(range(0, SAMPLES)).filter(inside).count()
pi = 4 * count / SAMPLES

print(pi)

If you put this snippet in Jupyter Notebook and run it, you will see your application running in Spark UI.

Conclusion

In this article, I have presented step by step process to set up Apache Spark 3.1.1 cluster on Ubuntu 16.04. It is super simple to install spark. Hope you would successfully set up the Apache Spark cluster on your system.

References:

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b

https://aws.plainenglish.io/how-to-setup-install-an-apache-spark-3-1-1-cluster-on-ubuntu-817598b8e198

https://medium.com/codex/setup-a-spark-cluster-step-by-step-in-10-minutes-922c06f8e2b1

https://medium.com/@pavelpokrovsky/apache-spark-archlinux-on-laptop-part-1-73a82527ea7

https://medium.com/@pavelpokrovsky/apache-spark-archlinux-on-laptop-part-2-fde8ec7367ac

https://blog.insightdatascience.com/simply-install-spark-cluster-mode-341843a52b88

https://becominghuman.ai/real-world-python-workloads-on-spark-standalone-clusters-2246346c7040

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.