Using Snappy compression in Spark in Google Kubernetes Engine

2/28/2019

I have a service running in a Docker container in the Google Kubernetes Engine It writes data to Google Cloud Storage, storing it as an .avro file using Snappy compression.

conf.setBoolean("mapreduce.output.fileoutputformat.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec")

It works great. Then I set up a new project, deployed it to a new container and it didn't work, the service and the Docker file is the same but I'm getting this error:

 org.apache.spark.internal.io.SparkHadoopWriter$anonfun$3.apply(SparkHadoopWriter.scala:83) 

at org.apache.spark.internal.io.SparkHadoopWriter$anonfun$3.apply(SparkHadoopWriter.scala:78) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) ... 3 more Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43) at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:358) at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:382) at org.apache.avro.file.DataFileWriter.sync(DataFileWriter.java:401) at org.apache.avro.file.DataFileWriter.flush(DataFileWriter.java:410) at org.apache.avro.file.DataFileWriter.close(DataFileWriter.java:433) at org.apache.avro.mapreduce.AvroKeyRecordWriter.close(AvroKeyRecordWriter.java:83) at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.closeWriter(SparkHadoopWriter.scala:361) at org.apache.spark.internal.io.SparkHadoopWriter$anonfun$4.apply(SparkHadoopWriter.scala:137) at org.apache.spark.internal.io.SparkHadoopWriter$anonfun$4.apply(SparkHadoopWriter.scala:127) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$executeTask(SparkHadoopWriter.scala:139) ... 8 more Suppressed: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43) at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:358) at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:382) at org.apache.avro.file.DataFileWriter.sync(DataFileWriter.java:401) at org.apache.avro.file.DataFileWriter.flush(DataFileWriter.java:410) at org.apache.avro.file.DataFileWriter.close(DataFileWriter.java:433) at org.apache.avro.mapreduce.AvroKeyRecordWriter.close(AvroKeyRecordWriter.java:83) at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.closeWriter(SparkHadoopWriter.scala:361) at org.apache.spark.internal.io.SparkHadoopWriter$anonfun$1.apply$mcV$sp(SparkHadoopWriter.scala:142) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1424)

Odd. The snappy jar file is still present in the Docker image - so I don't know why the service can't load the class.

It works if I disable compression (which is not ideal). Perhaps the compression libraries supported by Google Cloud Storage have changed? Any suggestions? (I'm open to other compression libraries)

-- s d
docker
google-kubernetes-engine
java
snappy

0 Answers