how to pass --packages to spark-submit in Kubernetes managed cluster?

10/11/2019

I'm trying to use Snowflake spark connector packages in spark-submit using --packages

when i run in local, it is working fine. I'm able to connect to Snowflake table and returning a Spark DataFrame.

spark-submit --packages net.snowflake:snowflake-jdbc:2.8.1,net.snowflake:spark-snowflake_2.10:2.0.0 test_sf.py

but when i try to pass --master argument, its fails stating Snowflake class is not available.

spark-submit --packages net.snowflake:snowflake-jdbc:2.8.1,net.snowflake:spark-snowflake_2.10:2.0.0 --master spark://spark-master.cluster.local:7077 test_sf.py

Update:

I have tried all the options like --jars, extraClassPath on driver and executor and --packages, but nothing seems to be working.. is it because some problem in Spark standalone cluster

Latest update:

It is working when i specify the repository URL in --jars instead of file path. So basically i have to upload the jars in some repository and point to that.

error log:

Caused by: java.lang.ClassNotFoundException: net.snowflake.spark.snowflake.io.SnowflakePartition
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.serializer.JavaDeserializationStream$anon$1.resolveClass(JavaSerializer.scala:67)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:313)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
-- Shankar
kubernetes
pyspark
snowflake-cloud-data-platform

1 Answer

10/17/2019

I am posting on behalf of a colleague that had some insights on this:

When you run spark-submit from your laptop to run a workload on Kubernetes (managed or otherwise) it requires you to provide the k8s master URL and not the spark master URL. Whatever this URL is pointing to "spark://spark-master.cluster.local:7077" does not have a line of sight from your machine, it may be that it does not even exist in your original issue. When using spark submit it creates the executor and driver nodes inside k8s and at that time a spark master URL will be available but even then the spark master URL is available only from inside k8s unless the line of sight is made available

Per your Update section: For passing packages, packages search for packages in the local maven repo or a remote repo if the path is provided to the remote repo, you can use the --jars options. Wherein you can bake the jars inside the container that would run the spark job and then provide the local path in the --jars variable

Does any of this resonate with the updates and conclusions you reached in your updated question?

-- Rachel McGuigan
Source: StackOverflow