Spark 2.4 connecting to dataproc from container: java.net.UnknownHostException

2/4/2019

I'm having issues connecting Spark 2.4 from a docker container running in kubernetes to a dataproc cluster (with Spark 2.4). I'm getting "java.net.UnknownHostException" for the local kubernetes hostname. This same network config works with Spark 2.2, so it seems like something has changed with how Spark is doing hostname resolution.

If I modify /etc/hosts so that the hostname and kubernetes hostname point to 127.0.0.1 it will work with a warning. However, thats not a workaround I'm comfortable with as it would be overriding kubernetes default entry and potentially impact other things.

Error Stack:

(base) appuser@jupyter-21001-0:/opt/notebooks/Test$ pyspark
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
2019-02-01 18:47:25 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-02-01 18:47:27 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2019-02-01 18:47:59 ERROR YarnClientSchedulerBackend:70 - YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details.
2019-02-01 18:47:59 ERROR YarnClientSchedulerBackend:70 - Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:514)
        at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
        at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:827)
        at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.io.IOException: Failed to connect to jupyter-21001-0.jupyter-21001.75dev-a123456.svc.cluster.local:38831
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: jupyter-21001-0.jupyter-21001.75dev-a123456.svc.cluster.local
        at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
        at java.net.InetAddress.getAllByName(InetAddress.java:1193)
        at java.net.InetAddress.getAllByName(InetAddress.java:1127)
        at java.net.InetAddress.getByName(InetAddress.java:1077)
        at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
        at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
        at java.security.AccessController.doPrivileged(Native Method)
        at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
        at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
        at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
        at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
        at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
        at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
        at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
        at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
        at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
        at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
        at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
        at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
        at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
        at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        ... 1 more
 
2019-02-01 18:47:59 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Attempted to request executors before the AM has registered!
2019-02-01 18:47:59 ERROR SparkContext:91 - Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
        at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:745)
        at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:191)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:745)
/opt/spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/opt/spark/python/pyspark/shell.py", line 41, in <module>
    spark = SparkSession._create_shell_session()
  File "/opt/spark/python/pyspark/sql/session.py", line 583, in _create_shell_session
    return SparkSession.builder.getOrCreate()
  File "/opt/spark/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/spark/python/pyspark/context.py", line 349, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/spark/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/opt/spark/python/pyspark/context.py", line 180, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/opt/spark/python/pyspark/context.py", line 288, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Spark context stopped while waiting for backend
        at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:745)
        at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:191)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:745)

Spark Version:

(base) appuser@jupyter-21001-0:/opt/notebooks/Test$ pyspark --version
Welcome to
     ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
 
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_152-release
Branch
Compiled by user  on 2018-10-29T06:22:05Z
Revision
Url
Type --help for more information.

Hosts File:

# Kubernetes-managed hosts file.
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
172.18.168.47   jupyter-21001-0.jupyter-21001.75dev-a123456.svc.cluster.local   jupyter-21001-0

spark-defaults.conf

spark.master yarn
spark.driver.extraClassPath /opt/spark/lib/gcs-connector-latest-hadoop2.jar
spark.executor.extraClassPath /usr/lib/hadoop/lib/gcs-connector.jar:/usr/lib/hadoop/lib/bigquery-connector.jar

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://jupyter-21001-m</value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://jupyter-21001-m</value>
  </property>
  <property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
  </property>
  <property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.</description>
  </property>
  <property>
    <name>fs.gs.project.id</name>
    <value>NOT_RUNNING_INSIDE_GCE</value>
  </property>
  <property>
    <name>google.cloud.auth.service.account.enable</name>
    <value>true</value>
  </property>
  <property>
    <name>google.cloud.auth.service.account.json.keyfile</name>
    <value>/oauth/auth.json</value>
  </property>
  <property>
    <name>fs.gs.implicit.dir.repair.enable</name>
    <value>false</value>
    <description>
      Whether or not to create objects for the parent directories of objects
      with / in their path e.g. creating gs://bucket/foo/ upon finding
      gs://bucket/foo/bar.
    </description>
  </property>
</configuration>
-- kstorer
apache-spark
google-cloud-dataproc
google-kubernetes-engine

1 Answer

2/5/2019

This sounds hard. At a high level I would recommend taking a look at running Jupyter notebooks inside of Dataproc first, but I will try to answer the question.

In YARN client mode (which pyspark shell uses). The AppMaster needs to dial back into the driver so you need to set spark.driver.host and spark.driver.port to something that the AppMaster can talk to. Assuming this is on GKE in the same GCE network as your Dataproc cluster a NodePort is probably your best bet for exposing the driver to the GCE network.

The part I don't know is passing the node IP and port to your pyspark pod so that you can set spark.driver.host and spark.driver.port correctly (GKE experts please chime in on that).

I don't understand the details of your setup, but I would not edit GKE VMs /etc/hosts myself. I also find it pretty surprising that this broke between 2.2 and 2.4 (is it possible you weren't actually running on YARN on Dataproc in your 2.2 setup?).

Lastly you don't mention it, but you should need to set yarn.resourcemanager.hostname=jupyter-21001-m in core-site.xml or yarn.site.xml.

As a partial invocation. This should successfully create a YARN AppMaster in Dataproc (that will then fail to dial back into GKE):

kubectl run spark -i -t --image gcr.io/spark-operator/spark:v2.4.0 \
  --generator=run-pod/v1 --rm --restart=Never --env HADOOP_CONF_DIR=/tmp \
  --command -- /opt/spark/bin/spark-submit \
    --master yarn \
    --conf spark.hadoop.fs.defaultFS=hdfs://jupyter-21001-m \
    --conf spark.hadoop.yarn.resourcemanager.hostname=jupyter-21001-m \
    --class org.apache.spark.examples.SparkPi \ 
    /opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar

Sorry that's an incomplete answer, I will update if I fill in the missing pieces.

-- Patrick Clay
Source: StackOverflow