Spark needs lots of resources to does its job. Kubernetes is great environment for resource management. How many Spark PODs do you run per node to have the best resource utilization?
Trying to run Spark Cluster on Kubernetes Cluster.
It depends on many factors. We need to know how much resources do you have and how much is being consumed by the pods. To do so you need to setup a Metrics-server.
Metrics Server is a cluster-wide aggregator of resource usage data.
Next step is to setup HPA.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization or other custom metrics. HPA normally fetches metrics from a series of aggregated APIs:
How to make it work?
HPA is being supported by kubectl by default:
kubectl create
- creates a new autoscalerkubectl get hpa
- lists your autoscalerskubectl describe hpa
- gets a detailed description of autoscalerskubectl delete
- deletes an autoscalerExample: kubectl autoscale rs foo --min=2 --max=5 --cpu-percent=80
creates an autoscaler for replication set foo, with target CPU utilization set to 80% and the number of replicas between 2 and 5. You can and should adjust all values to your needs.
Here is a detailed documentation of how to use kubectl autoscale command.
Please let me know if you find that useful.