Spark on Kubernetes: How to handle missing Config folder

10/15/2019

I'm trying to run spark in an kubernetes cluster as described here https://spark.apache.org/docs/latest/running-on-kubernetes.html

It works fine for some basic scripts like the provided examples.

I noticed that the config folder despite being added to the image build by the "docker-image-tool.sh" is overwritten by a mount of a config map volume.

I have two Questions:

  1. What sources does spark use to generate that config map or how do you edit it? As far as I understand the volume gets deleted when the last pod is deleted and regenerated when a new pod is created
  2. How are you supposed to handle the spark-env.sh script which can't be added to a simple config map?
-- Itsmedenise
apache-spark
docker
kubernetes

1 Answer

10/15/2019

One initially non-obvious thing about Kubernetes is that changing a ConfigMap (a set of configuration values) is not detected as a change to Deployments (how a Pod, or set of Pods, should be deployed onto the cluster) or Pods that reference that configuration. That expectation can result in unintentionally stale configuration persisting until a change to the Pod spec. This could include freshly created Pods due to an autoscaling event, or even restarts after a crash, resulting in misconfiguration and unexpected behaviour across the cluster.

Note: This doesn’t impact ConfigMaps mounted as volumes, which are periodically synced by the kubelet running on each node.

To update configmap execute:

$ kubectl replace -f file.yaml

You must create a ConfigMap before you can use it. So I recommend firstly modify configMap and then redeploy pod.

Note that container using a ConfigMap as a subPath volume mount will not receive ConfigMap updates.

The configMap resource provides a way to inject configuration data into Pods. The data stored in a ConfigMap object can be referenced in a volume of type configMap and then consumed by containerized applications running in a Pod.

When referencing a configMap object, you can simply provide its name in the volume to reference it. You can also customize the path to use for a specific entry in the ConfigMap.

When a ConfigMap already being consumed in a volume is updated, projected keys are eventually updated as well. Kubelet is checking whether the mounted ConfigMap is fresh on every periodic sync. However, it is using its local ttl-based cache for getting the current value of the ConfigMap. As a result, the total delay from the moment when the ConfigMap is updated to the moment when new keys are projected to the pod can be as long as kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute by default) in kubelet.

But what I strongly recommend you is to use Kubernetes Operator for Spark. It supports mounting volumes and ConfigMaps in Spark pods to customize them, a feature that is not available in Apache Spark as of version 2.4.

A SparkApplication can specify a Kubernetes ConfigMap storing Spark configuration files such as spark-env.sh or spark-defaults.conf using the optional field .spec.sparkConfigMap whose value is the name of the ConfigMap. The ConfigMap is assumed to be in the same namespace as that of the SparkApplication. Spark on K8S provides configuration options that allow for mounting certain volume types into the driver and executor pods. Volumes are "delivered" from Kubernetes side but they can be delivered from local storage in Spark. If no volume is set as local storage, Spark uses temporary scratch space to spill data to disk during shuffles and other operations. When using Kubernetes as the resource manager the pods will be created with an emptyDir volume mounted for each directory listed in spark.local.dir or the environment variable SPARK_LOCAL_DIRS . If no directories are explicitly specified then a default directory is created and configured appropriately.

Useful blog: spark-kubernetes-operator.

-- MaggieO
Source: StackOverflow