How to connect to docker hosted postgresql database via jdbc in Spark?

5/7/2019

I try to retrieve data from a postgres database hosted in a docker using JDBC and spark dataframe. The postgres port is opened as a nodeport in my Kubernetes cluster.

Connection is set using:

val postgres_url = s"$databaseHost:32020"
val postgres_username = "xxxx"
val postgres_db_name = "yyyy"

//Connexion à postgre et récupération du DataFrame de la table
val jdbc_url = s"jdbc:postgresql://$postgres_url/$postgres_db_name"

val connectionProperties = new Properties
connectionProperties.put("user", postgres_username)
connectionProperties.put("driver", "org.postgresql.Driver") 

The connection seems to work as the dataframe schema is correctly set when is use a spark.read.jdbc. But when I try to access real data, I have a connection refused error in a different port than the one provided (Error mentions 31816instead of 32020).

val df_table = spark.read.jdbc(jdbc_url, "type_mime", connectionProperties)
df_table.count()

Give:

df_table: org.apache.spark.sql.DataFrame = [id: bigint, mime_type: string ... 1 more field] 
// Schema is correctly loaded

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage 26.0 (TID 211, localhost, executor driver): java.io.IOException: Failed to connect to /192.168.97.1:31816
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
        at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:366)
        at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:332)
        at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:654)
        at org.apache.spark.util.Utils$.fetchFile(Utils.scala:480)
        at org.apache.spark.executor.Executor$anonfun$org$apache$spark$executor$Executor$updateDependencies$5.apply(Executor.scala:696)
        at org.apache.spark.executor.Executor$anonfun$org$apache$spark$executor$Executor$updateDependencies$5.apply(Executor.scala:688)
        at scala.collection.TraversableLike$WithFilter$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$updateDependencies(Executor.scala:688)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.97.1:31816
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) ... 1 more 
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1499)
        at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
        at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
        at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
        at org.apache.spark.rdd.RDD$anonfun$collect$1.apply(RDD.scala:936)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
        at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
        at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2430)
        at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2429)
        at org.apache.spark.sql.Dataset$anonfun$55.apply(Dataset.scala:2837)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
        at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) ... 68 elided 
Caused by: java.io.IOException: Failed to connect to /192.168.97.1:31816
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
        at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:366)
        at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:332)
        at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:654)
        at org.apache.spark.util.Utils$.fetchFile(Utils.scala:480)
        at org.apache.spark.executor.Executor$anonfun$org$apache$spark$executor$Executor$updateDependencies$5.apply(Executor.scala:696)
        at org.apache.spark.executor.Executor$anonfun$org$apache$spark$executor$Executor$updateDependencies$5.apply(Executor.scala:688)
        at scala.collection.TraversableLike$WithFilter$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$updateDependencies(Executor.scala:688)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308) ... 3 more 
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.97.1:31816
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) ... 1 more

The database can be accessed correctly with psql

Is JDBC using another port than Postgres main one ? Should I opened it up in docker ?

-- Double Sept
apache-spark
apache-zeppelin
jdbc
kubernetes
postgresql

1 Answer

5/10/2019

I managed to solve the problem. It has nothing to do with JDBC or Postgres.

The stacktrace shows that the problem occurs when Spark starts to distribute works across executors.

In fact, I was running my code in a Zeppelin notebook hosted on Kubernetes and it was running out of available ports for new connections.

Hope it will help.

-- Double Sept
Source: StackOverflow