I'm trying to run H2O SW on Kubernetes using the steps in the documentation.
I launch a test SW app
$ bin/spark-submit \
--master k8s://$KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--class ai.h2o.sparkling.InitTest \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.32.0.2-1-2.4 \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.jar
It seems that the UI Flow is running correctly as I can access it after doing
$ kubectl port-forward ai-h2o-sparkling-inittest-1606331533023-driver 54322:54322
When looking at the logs of the created SparklingWater pod I see the following
$ kubectl logs ai-h2o-sparkling-inittest-1606331533023-driver
20/11/25 19:14:14 INFO SignalUtils: Registered signal handler for INT
20/11/25 19:14:22 INFO Server: jetty-9.4.z-SNAPSHOT; built: 2018-06-05T18:24:03.829Z; git: d5fc0523cfa96bfebfbda19606cad384d772f04c; jvm 1.8.0_275-b01
20/11/25 19:14:23 INFO ContextHandler: Started a.h.o.e.j.s.ServletContextHandler@5af7a7{/,null,AVAILABLE}
20/11/25 19:14:23 INFO AbstractConnector: Started ServerConnector@63f4e498{HTTP/1.1,[http/1.1]}{0.0.0.0:54321}
20/11/25 19:14:23 INFO Server: Started @90939ms
20/11/25 19:14:23 INFO RestApiUtils: H2O node http://10.244.1.4:54321/3/Cloud successfully responded for the GET.
20/11/25 19:14:23 INFO H2OContext: Sparkling Water 3.32.0.2-1-2.4 started, status of context:
Sparkling Water Context:
* Sparkling Water Version: 3.32.0.2-1-2.4
* H2O name: root
* cluster size: 2
* list of used nodes:
(executorId, host, port)
------------------------
(0,10.244.1.4,54321)
(1,10.244.0.10,54321)
------------------------
Open H2O Flow in browser: http://ai-h2o-sparkling-inittest-1606331533023-driver-svc.default.svc:54321 (CMD + click in Mac OSX)
Exception in thread "main" java.lang.RuntimeException: H2O cluster should be of size 3 but is 2
at ai.h2o.sparkling.InitTest$.main(InitTest.scala:34)
at ai.h2o.sparkling.InitTest.main(InitTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
When looking at the Pods created by SW, I see one in the pending (never gets into running status)
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
ai-h2o-sparkling-inittest-1606331533023-driver 1/1 Running 0 13m
app-name-1606331575519-exec-1 1/1 Running 0 12m
app-name-1606331575797-exec-2 1/1 Running 0 12m
app-name-1606331575816-exec-3 0/1 Pending 0 12m
Any ideas how to fix this issue?
It seems this is caused by the k8s cluster not having enough CPUs (it is a small cluster).
Reducing the number of executors (from 3 to 2) when launching SW fixed the problem
bin/spark-submit \
--master k8s://$KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--class ai.h2o.sparkling.InitTest \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:3.32.0.2-1-2.4 \
--conf spark.executor.instances=2 \
local:///opt/sparkling-water/tests/initTest.jar