Apache Spark is an analytics engine that allows users to easily process large volumes of data. It has a diverse set of high-level APIs and tools like Spark SQL, Structured Streaming, and MLLib that help with effectively processing both batch and real-time data feeds.
It has become one of the best known tools currently available to both data engineers and data scientists. Data engineers use Apache Spark for building data pipelines, churning through terabytes sometimes even petabytes of data. Data scientists have taken interest in using Apache Spark for their exploratory data analysis while also making use of it for feature engineering and its machine learning capabilities
Spark applications have a driver program/pod that interacts with a cluster manager for resources during the course of its run. Technically there are four different types of cluster managers that are available to SparkContext,
- Apache Mesos
- Hadoop YARN
Kubernetes is an open-source platform for managing and orchestrating the execution of containerized workloads and services across a cluster of multiple machines. Kubernetes is available in many different flavors such as GKE, Amazon’s Elastic Kubernetes Service (EKS), and Red Hat’s Openshift. It can also be deployed in a local terminal using minikube, kINd (Kubernetes in Docker).
Let’s take a closer look into the steps and issues that are usually seen when we set up and deploy spark applications via spark-submit. There is not a lot of documentation available on the public forums about setting up Spark applications on Kubernetes, except for official spark documentation and a couple of articles on running them in EKS.
Creating a service account and namespace for your pods in Kubernetes is a prerequisite for running Spark applications in Kubernetes. Since this is an extensive topic that really needs its own blog post, and there are already several blogs in the community that help with defining the manifest files, it will not be discussed in detail on this blog.
Building your spark docker image
The first step in running your Spark application is to build Spark docker image. A Spark docker image needs to have at least the Apache Spark, Hadoop, AWS Java SDK bundles as part of the image. I used python slim as my base image since, my use cases warranted a specific version of python. You could also use other images like Debian here. The most common issue at this step is to identify the right dependencies between Spark, Hadoop, and AWS Java SDK versions in the docker image. I recommend relying on dependencies listed out for Hadoop and AWS in the maven repository to identify other package dependencies. I used the following versions in my docker image:
Since python slim is used as a base, Java 11 needs to be separately installed in the image.
Apache Spark official distributions are available in Apache archives and AWS SDK are available in the maven repository for use. Here it needs to be mentioned that AWS SDK jars need to be copied to the jars folder under spark-home for use in Spark applications.
Creating your own spark-submit script
By using the Spark docker image we built-in the first step. Official Spark documentation for Kubernetes suggests using the below command to run spark apps via spark-submit:
I wish it was easy to trigger a spark-submit job in the production EKS cluster. Since AWS eks clusters use IRSA (IAM roles for service accounts) for fine-grained access control, we’ll need to include a set of additional configs to the above spark-submit step.
The most important ones that need to be included are the configs for the AWS credentials provider, S3 file system, event logging onto S3 buckets, Kubernetes authentication drivers, and the ca certificate that needs to be used in EKS. Below is a sample command that can be used for creating a spark session for spark applications in Kubernetes.
Monitoring Spark applications in Kubernetes.
Spark applications can run either from data processing hosts such as airflow workers or run directly from their own driver pods. This is controlled by deploy mode config on the Spark session command. In the above example, you can see that client (i.e. the processing host) is used as a driver pod.
When Spark applications run in a Kubernetes pod, it’s possible to stream the logs from the pod using the below command.
kubectl -n=<namespace> logs -f <driver-pod-name>
Logs can also be seen via the Kubernetes dashboard for the cluster.
Spark application driver UI is made available at port 4040 of the driver pod. I personally prefer to forward the logs for a running application in the driver pod to my local for debugging.
kubectl port-forward <driver-pod-name> 4040:4040
In the case of more than one Spark app running from the driver pod, note that driver UI for subsequent applications will be available at the next set of port numbers such as 4041,4042,4043, etc.
Setting up Spark history server for Kubernetes.
Since spark application logs in Kubernetes dashboard are available only during the runtime of an app, logs need to be stored in a central location like as s3, or hdfs for later reference. In the above example, the Spark session writes its logs onto an s3 location and has the event logs config enabled. A Spark docker image initially built for a Spark session can also be used for spark history server.
Spark history server needs to be created as deployment in EKS cluster with access to s3 location that has event logs from Spark applications.
Deploying the above config YAML results in a pod created for running the Spark history server in EKS cluster. One will also need ingress and service defined in the Kubernetes cluster for the Spark history UI to be accessed — Spark history server uses 18080 port by default.
Running spark on Kubernetes offers several advantages like better isolation of workloads, and flexible/elastic deployment in Kubernetes allowing for easy scaling and administration. There’s no need to maintain standing Hadoop clusters or permanent EMR clusters to run Spark apps. Setting up Spark on Kubernetes improves the productivity of developers and data scientists as they are no longer resource-constrained.
I Hope this article gives you details on steps needed for setting up your own spark on Kubernetes cluster apps in an AWS EKS cluster.