Jul 22, 2015

Running Apache Spark in Kubernetes



Yesterday, the Kubernetes community officially launched version 1.0 at OSCON. This is a pretty big milestone for us and version 1.0 offers a lot of very useful features for those looking for a container orchestration solution. I thought a good way to demonstrate some of the cool features that 1.0 provides is to show how easily and simply Apache Spark can be deployed in Kubernetes and connected to a variety of network storage providers to build analytical applications.

Thanks to Matt Farrellee and Tim St. Clair, Red Hat Emerging Technologies have already contributed a set of Pods, ReplicationControllers and Services to run Apache Spark in Kubernetes. This can be found in the github repo under examples/spark. To deploy Spark, one just needs a Spark Master Pod, a Spark Master Service and  a Spark Worker Replication Controller. 

However, the current solution is not configured to mount storage of any kind into the Spark Master or Spark Workers so it can be a little difficult to use to analyze data. The good news is that Red Hat Emerging Technologies are also actively contributing towards Kubernetes Volume Plugins which allow Pods to declaratively mount various kinds of network storage directly into a given container. This means that you can connect your Spark Containers (or any containers for that matter) to a variety of network storage using one of the Kubernetes Volume Plugins. To date, we have presently contributed Volume Plugins for Ceph, GlusterFS, ISCSI, NFS  (incl. NFS with NetApp and NFS with GlusterFS) and validated GCE Persistent Disks with SELinux and Amazon EBS Disks with SELinux, all in Kubernetes version 1.0. We also have FibreChannel, Cinder and Manila Kubernetes Volume Plugins in the works.

I've provided a demo below that shows how to run Apache Spark against data mounted using a Kubernetes Volume Plugin. Given that Apache Spark is typically used in conjunction with a Distributed File System, I've used the GlusterFS Volume Plugin as the exemplar.  I have a Pull Request submitted to Merge this example into Kubernetes but we’ve temporarily frozen Kubernetes in anticipation of our version 1.0 launch. In the interim, you can follow the guide off of my personal branch. The video below provides a walkthrough of the solution. 


No comments: