I am running Spark 2.3 job on kubernetes cluster
kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
when I ran spark submit on k8s master the driver pod is stuck in Waiting: PodInitializing state.
This is happening if I submit the jobs almost parallel i.e. submit 5 jobs one after the other.
I tried kubectl describe node on the node where the driver pod is running this is what I got ,I do see there is over-commit on resources but I expected kubernetes scheduler not to schedule if resources in node are over-committed or node is in Not Ready state, in this case node is in Ready State but I observe same behavior if node is in Not Ready state.
Name: **********
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=****
node-role.kubernetes.io/worker=true
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Tue, 31 Jul 2018 09:59:24 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 14 Aug 2018 09:31:20 -0400 Tue, 31 Jul 2018 09:59:24 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Tue, 14 Aug 2018 09:31:20 -0400 Sat, 11 Aug 2018 00:41:27 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: *****
Hostname: ******
Capacity:
cpu: 16
memory: 125827288Ki
pods: 110
Allocatable:
cpu: 16
memory: 125724888Ki
pods: 110
System Info:
Machine ID: *************
System UUID: **************
Boot ID: 1493028d-0a80-4f2f-b0f1-48d9b8910e9f
Kernel Version: 4.4.0-1062-aws
OS Image: Ubuntu 16.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://Unknown
Kubelet Version: v1.8.3
Kube-Proxy Version: v1.8.3
PodCIDR: ******
ExternalID: **************
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system calico-node-gj5mb 250m (1%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-**************************************** 100m (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system prometheus-prometheus-node-exporter-9cntq 100m (0%) 200m (1%) 30Mi (0%) 50Mi (0%)
logging elasticsearch-elasticsearch-data-69df997486-gqcwg 400m (2%) 1 (6%) 8Gi (6%) 16Gi (13%)
logging fluentd-fluentd-elasticsearch-tj7nd 200m (1%) 0 (0%) 612Mi (0%) 0 (0%)
rook rook-agent-6jtzm 0 (0%) 0 (0%) 0 (0%) 0 (0%)
rook rook-ceph-osd-*****-gwb8j 0 (0%) 0 (0%) 0 (0%) 0 (0%)
spark accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1 2 (12%) 0 (0%) 10Gi (8%) 12Gi (10%)
spark accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5 2 (12%) 0 (0%) 10Gi (8%) 12Gi (10%)
spark accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver 1 (6%) 0 (0%) 2Gi (1%) 2432Mi (1%)
spark accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver 1 (6%) 0 (0%) 2Gi (1%) 2432Mi (1%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
7050m (44%) 1200m (7%) 33410Mi (27%) 45874Mi (37%)
Events: <none>
Kubectl describe pod gives below message
Name: accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
Namespace: spark
Node: ****
Start Time: Mon, 13 Aug 2018 16:18:34 -0400
Labels: launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
spark-role=driver
Annotations: spark-app-name=accelerate-testing-2
Status: Pending
IP:
Init Containers:
spark-init:
Container ID:
Image: ****:v2.3.0
Image ID:
Port: <none>
Args:
init
/etc/spark-init/spark-init.properties
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/etc/spark-init from spark-init-properties (rw)
/var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
/var/spark-data/spark-files from download-files-volume (rw)
/var/spark-data/spark-jars from download-jars-volume (rw)
Containers:
spark-kubernetes-driver:
Container ID:
Image: ******:v2.3.0
Image ID:
Port: <none>
Args:
driver
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
memory: 2432Mi
Requests:
cpu: 1
memory: 2Gi
Environment:
SPARK_DRIVER_MEMORY: 2g
SPARK_DRIVER_CLASS: com.myclass
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SPARK_MOUNTED_CLASSPATH: /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
SPARK_MOUNTED_FILES_DIR: /var/spark-data/spark-files
SPARK_JAVA_OPT_0: -Dspark.kubernetes.container.image=***
SPARK_JAVA_OPT_1: -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
SPARK_JAVA_OPT_2: -Dspark.submit.deployMode=cluster
SPARK_JAVA_OPT_3: -Dspark.driver.blockManager.port=7079
SPARK_JAVA_OPT_4: -Dspark.executor.memory=10g
SPARK_JAVA_OPT_5: -Dspark.app.id=spark-63f536fd87f8457796802767922ef7d9
SPARK_JAVA_OPT_6: -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
SPARK_JAVA_OPT_7: -Dspark.master=k8s://https://kubernetes.default
SPARK_JAVA_OPT_8: -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
SPARK_JAVA_OPT_9: -Dspark.executor.cores=2
SPARK_JAVA_OPT_10: -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
SPARK_JAVA_OPT_11: -Dspark.driver.port=7078
SPARK_JAVA_OPT_12: -Dspark.kubernetes.namespace=spark
SPARK_JAVA_OPT_13: -Dspark.executor.memoryOverhead=2g
SPARK_JAVA_OPT_14: -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
SPARK_JAVA_OPT_15: -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
SPARK_JAVA_OPT_16: -Dspark.executor.instances=10
SPARK_JAVA_OPT_17: -Dspark.memory.fraction=0.6
SPARK_JAVA_OPT_18: -Dspark.driver.memory=2g
SPARK_JAVA_OPT_19: -Dspark.kubernetes.driver.pod.name=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
SPARK_JAVA_OPT_20: -Dspark.app.name=accelerate-testing-2
SPARK_JAVA_OPT_21: -Dspark.kubernetes.driver.label.launch-id=********
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
/var/spark-data/spark-files from download-files-volume (rw)
/var/spark-data/spark-jars from download-jars-volume (rw)
Conditions:
Type Status
Initialized False
Ready False
PodScheduled True
Volumes:
spark-init-properties:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
Optional: false
download-jars-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
download-files-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
spark-token-mj86g:
Type: Secret (a volume populated by a Secret)
SecretName: spark-token-mj86g
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 44m (x518 over 18h) kubelet, **************************** Pod sandbox changed, it will be killed and re-created.
Warning FailedSync 19s (x540 over 18h) kubelet, **************************** Error syncing pod
I have also tried kubectl top nodes
none of the nodes are overcommitted on the resources