I'm trying to use Snowflake spark connector packages in spark-submit
using --packages
when i run in local, it is working fine. I'm able to connect to Snowflake table
and returning a Spark DataFrame
.
spark-submit --packages net.snowflake:snowflake-jdbc:2.8.1,net.snowflake:spark-snowflake_2.10:2.0.0 test_sf.py
but when i try to pass --master argument, its fails stating Snowflake class is not available.
spark-submit --packages net.snowflake:snowflake-jdbc:2.8.1,net.snowflake:spark-snowflake_2.10:2.0.0 --master spark://spark-master.cluster.local:7077 test_sf.py
Update:
I have tried all the options like --jars
, extraClassPath
on driver and executor and --packages
, but nothing seems to be working.. is it because some problem in Spark standalone cluster
Latest update:
It is working when i specify the repository URL
in --jars
instead of file path. So basically i have to upload the jars in some repository and point to that.
error log:
Caused by: java.lang.ClassNotFoundException: net.snowflake.spark.snowflake.io.SnowflakePartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:313)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
I am posting on behalf of a colleague that had some insights on this:
When you run spark-submit from your laptop to run a workload on Kubernetes (managed or otherwise) it requires you to provide the k8s master URL and not the spark master URL. Whatever this URL is pointing to "spark://spark-master.cluster.local:7077" does not have a line of sight from your machine, it may be that it does not even exist in your original issue. When using spark submit it creates the executor and driver nodes inside k8s and at that time a spark master URL will be available but even then the spark master URL is available only from inside k8s unless the line of sight is made available
Per your Update section: For passing packages, packages search for packages in the local maven repo or a remote repo if the path is provided to the remote repo, you can use the --jars options. Wherein you can bake the jars inside the container that would run the spark job and then provide the local path in the --jars variable
Does any of this resonate with the updates and conclusions you reached in your updated question?