Below is a docker-compose file to set up a Spark cluster with 1 master and 2 worker nodes.
version: '3.7' services: spark-master: image: bitnami/spark:latest container_name: spark-master command: bin/spark-class org.apache.spark.deploy.master.Master ports: - "9090:8080" - "7077:7077" spark-worker-1: image: bitnami/spark:latest container_name: spark-worker-1 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077 depends_on: - "spark-master" environment: SPARK_MODE: worker SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_MASTER_URL: spark://spark-master:7077 spark-worker-2: image: bitnami/spark:latest container_name: spark-worker-2 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077 depends_on: - "spark-master" environment: SPARK_MODE: worker SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_MASTER_URL: spark://spark-master:7077
This setup uses the bitnami/spark image. You can add more workers and also change the SPARK_WORKER_CORES and SPARK_WORKER_MEMORY in the environment based on your system’s specifications.
After creating the docker-compose.yml file, you just need to go to the directory of the compose file and type:
docker-compose up -d
This will start your Apache Spark container. Run the following command to ensure all containers are running:
Now that the Spark container is up and running, we need to test if it is working. Create a simple PySpark script using spark-submit command and save it as pi.py.
# pi.py import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() partitions = int(sys.argv) if len(sys.argv) > 1 else 2 n = 100000 * partitions def f(_: int) -> float: x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()
This is a simple PySpark to calculate PI value.
After setting up the program, we need to copy it to the spark container. To do this, we can use docker cp command.
docker cp -L pi.py spark-master:/opt/bitnami/spark/pi.py
Now we need to get the address where our spark master is running. To do that, we need to use docker logs command.
docker logs spark-master
After running this command, you need to find the address which will look something like this:
Here, the Spark Master is running at spark://172.20.0.2:7077.
Now we need to execute the pyspark file using the following command
docker exec spark-master spark-submit --master spark://172.20.0.2:7077 pi.py
If the program is executed properly, it will display the PI value as below:
To stop the containers, type the following command.