Do i need to have Spark on all worker machines to run spark-submit and run a spark job on a k8s cluster within the same worker machines?

1/28/2020

I have a k8s cluster. Now i want to deploy a spark job on the k8s cluster and i'm wondering whether i need to install and configure spark on all the worker machines or not.

-- semenchukou
apache-spark
kubernetes
spark-submit

1 Answer

1/29/2020

As I understand it depends which mode you will use.

You can use Cluster Mode to launch Spark Pi .

$ ./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///path/to/examples.jar

The Spark master, specified either via passing the --master command line argument to spark-submit or by setting spark.master in the application's configuration, must be a URL with the format k8s://<api_server_host>:<k8s-apiserver-port>. The port must always be specified, even if it's the HTTPS port 443. Prefixing the master string with k8s:// will cause the Spark application to launch on the Kubernetes cluster, with the API server being contacted at api_server_url. If no HTTP protocol is specified in the URL, it defaults to https. For example, setting the master to k8s://example.com:443 is equivalent to setting it to k8s://https://example.com:443, but to connect without TLS on a different port, the master would be set to k8s://http://example.com:8080.

In Kubernetes mode, the Spark application name that is specified by spark.app.name or the --name argument to spark-submit is used by default to name the Kubernetes resources created like drivers and executors. So, application names must consist of lower case alphanumeric characters, -, and . and must start and end with an alphanumeric character.

You can also setup a Client Mode

Starting with Spark 2.4.0, it is possible to run Spark applications on Kubernetes in client mode. When your application runs in client mode, the driver can run inside a pod or on a physical host. When running an application in client mode, it is recommended to account for the following factors:

Client Mode Networking

Spark executors must be able to connect to the Spark driver over a hostname and a port that is routable from the Spark executors. The specific network configuration that will be required for Spark to work in client mode will vary per setup. If you run your driver inside a Kubernetes pod, you can use a headless service to allow your driver pod to be routable from the executors by a stable hostname. When deploying your headless service, ensure that the service's label selector will only match the driver pod and no other pods; it is recommended to assign your driver pod a sufficiently unique label and to use that label in the label selector of the headless service. Specify the driver's hostname via spark.driver.host and your spark driver's port to spark.driver.port. ...

The whole documentation regarding running Spark on Kubernetes is available here.

There is also a nice explanation given by Gigaspaces about Running a Spark Job in Kubernetes

-- Crou
Source: StackOverflow