I have a k8s cluster. Now i want to deploy a spark job on the k8s cluster and i'm wondering whether i need to install and configure spark on all the worker machines or not.
As I understand it depends which mode you will use.
You can use Cluster Mode
to launch Spark Pi .
$ ./bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///path/to/examples.jar
The Spark master, specified either via passing the
--master
command line argument tospark-submit
or by settingspark.master
in the application's configuration, must be a URL with the formatk8s://<api_server_host>:<k8s-apiserver-port>
. The port must always be specified, even if it's the HTTPS port 443. Prefixing the master string withk8s://
will cause the Spark application to launch on the Kubernetes cluster, with the API server being contacted atapi_server_url
. If no HTTP protocol is specified in the URL, it defaults tohttps
. For example, setting the master tok8s://example.com:443
is equivalent to setting it tok8s://https://example.com:443
, but to connect without TLS on a different port, the master would be set tok8s://http://example.com:8080
.In Kubernetes mode, the Spark application name that is specified by
spark.app.name
or the--name
argument tospark-submit
is used by default to name the Kubernetes resources created like drivers and executors. So, application names must consist of lower case alphanumeric characters,-
, and.
and must start and end with an alphanumeric character.
You can also setup a Client Mode
Starting with Spark 2.4.0, it is possible to run Spark applications on Kubernetes in client mode. When your application runs in client mode, the driver can run inside a pod or on a physical host. When running an application in client mode, it is recommended to account for the following factors:
Client Mode Networking
Spark executors must be able to connect to the Spark driver over a hostname and a port that is routable from the Spark executors. The specific network configuration that will be required for Spark to work in client mode will vary per setup. If you run your driver inside a Kubernetes pod, you can use a headless service to allow your driver pod to be routable from the executors by a stable hostname. When deploying your headless service, ensure that the service's label selector will only match the driver pod and no other pods; it is recommended to assign your driver pod a sufficiently unique label and to use that label in the label selector of the headless service. Specify the driver's hostname via
spark.driver.host
and your spark driver's port tospark.driver.port
. ...
The whole documentation regarding running Spark on Kubernetes is available here.
There is also a nice explanation given by Gigaspaces about Running a Spark Job in Kubernetes