Pyspark job stuck in sleep and retry loop while reading from GCS

7/12/2019

I tried running my spark job on GKE using spark-operator and dataproc but on both instances the hadoop adaptor is able to list the files but gets stuck in a sleep-retry loop while trying to read them from GCS.

The service account has full access and I was able to fetch the file using gsutil on the same executor container using the same service account. This seems to rule out network or permission issues. Using spark-operator version v2.4.0-v1beta1-latest

Logs:

2019-07-12 11:33:12 INFO  HadoopRDD:54 - Input split: gs://app-logs/2019/07/04/08/ip-10-1-34-63-app-json.log-2019-07-04-08-20.gz:0+295144331
2019-07-12 11:33:12 INFO  HadoopRDD:54 - Input split: gs://app-logs/2019/07/04/08/ip-10-1-33-94-app-json.log-2019-07-04-08-20.gz:0+305812437
2019-07-12 11:33:12 INFO  HadoopRDD:54 - Input split: gs://app-logs/2019/07/04/08/ip-10-1-34-61-app-json.log-2019-07-04-08-20.gz:0+297933921
2019-07-12 11:33:12 INFO  HadoopRDD:54 - Input split: gs://app-logs/2019/07/04/08/ip-10-1-33-112-app-json.log-2019-07-04-08-20.gz:0+309553279
2019-07-12 11:33:12 INFO  TorrentBroadcast:54 - Started reading broadcast variable 0
2019-07-12 11:33:12 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.1 KB, free 3.3 GB)
2019-07-12 11:33:12 INFO  TorrentBroadcast:54 - Reading broadcast variable 0 took 13 ms
2019-07-12 11:33:12 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 323.8 KB, free 3.3 GB)
2019-07-12 11:33:14 INFO  CodecPool:181 - Got brand-new decompressor [.gz]
2019-07-12 11:33:14 INFO  CodecPool:181 - Got brand-new decompressor [.gz]
2019-07-12 11:33:14 INFO  CodecPool:181 - Got brand-new decompressor [.gz]
2019-07-12 11:33:14 INFO  CodecPool:181 - Got brand-new decompressor [.gz]
2019-07-12 11:42:00 WARN  GoogleCloudStorageReadChannel:76 - Failed read retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-33-112-app-json.log-2019-07-04-08-20.gz'. Sleeping...
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)
at sun.security.ssl.InputRecord.read(InputRecord.java:532)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.MeteredStream.read(MeteredStream.java:134)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3393)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse$SizeValidatingInputStream.read(NetHttpResponse.java:169)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:370)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:130)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:293)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:224)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
2019-07-12 11:42:00 INFO  GoogleCloudStorageReadChannel:76 - Done sleeping before retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-33-112-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:42:00 INFO  GoogleCloudStorageReadChannel:76 - Success after 1 retries on reading 'gs://app-logs/2019/07/04/08/ip-10-1-33-112-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:42:00 WARN  GoogleCloudStorageReadChannel:76 - Failed read retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-34-61-app-json.log-2019-07-04-08-20.gz'. Sleeping...
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)
at sun.security.ssl.InputRecord.read(InputRecord.java:532)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.MeteredStream.read(MeteredStream.java:134)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3393)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse$SizeValidatingInputStream.read(NetHttpResponse.java:169)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:370)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:130)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:293)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:224)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
2019-07-12 11:42:00 INFO  GoogleCloudStorageReadChannel:76 - Done sleeping before retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-34-61-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:42:00 INFO  GoogleCloudStorageReadChannel:76 - Success after 1 retries on reading 'gs://app-logs/2019/07/04/08/ip-10-1-34-61-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:50:44 WARN  GoogleCloudStorageReadChannel:76 - Failed read retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-33-112-app-json.log-2019-07-04-08-20.gz'. Sleeping...
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)
at sun.security.ssl.InputRecord.read(InputRecord.java:532)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:244)
at sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:689)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3393)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse$SizeValidatingInputStream.read(NetHttpResponse.java:169)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:370)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:130)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:293)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:224)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
2019-07-12 11:50:44 INFO  GoogleCloudStorageReadChannel:76 - Done sleeping before retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-33-112-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:50:44 INFO  GoogleCloudStorageReadChannel:76 - Success after 1 retries on reading 'gs://app-logs/2019/07/04/08/ip-10-1-33-112-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:55:06 WARN  GoogleCloudStorageReadChannel:76 - Failed read retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-33-94-app-json.log-2019-07-04-08-20.gz'. Sleeping...
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)
at sun.security.ssl.InputRecord.read(InputRecord.java:532)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.MeteredStream.read(MeteredStream.java:134)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3393)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse$SizeValidatingInputStream.read(NetHttpResponse.java:169)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:370)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:130)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:293)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:224)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
2019-07-12 11:55:06 INFO  GoogleCloudStorageReadChannel:76 - Done sleeping before retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-33-94-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:55:06 INFO  GoogleCloudStorageReadChannel:76 - Success after 1 retries on reading 'gs://app-logs/2019/07/04/08/ip-10-1-33-94-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:55:10 WARN  GoogleCloudStorageReadChannel:76 - Failed read retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-34-63-app-json.log-2019-07-04-08-20.gz'. Sleeping...
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)
at sun.security.ssl.InputRecord.read(InputRecord.java:532)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.MeteredStream.read(MeteredStream.java:134)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3393)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse$SizeValidatingInputStream.read(NetHttpResponse.java:169)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:370)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:130)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:293)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:224)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
2019-07-12 11:55:10 INFO  GoogleCloudStorageReadChannel:76 - Done sleeping before retry #1/10 for 'gs://app-logs/2019/07/04/08/ip-10-1-34-63-app-json.log-2019-07-04-08-20.gz'
2019-07-12 11:55:10 INFO  GoogleCloudStorageReadChannel:76 - Success after 1 retries on reading 'gs://app-logs/2019/07/04/08/ip-10-1-34-63-app-json.log-2019-07-04-08-20.gz'

What could be causing this? Have relaxed the firewall rules as well.

-- mightwork
apache-spark
google-cloud-dataproc
google-cloud-storage
google-kubernetes-engine
pyspark

1 Answer

8/19/2019

You need to check if your cluster and GCS bucket that you are reading from are in the same GCP region - it could be slow if reads are cross regional.

Also, it seems that you are processing gzipped log files that are can not be split and need to be re-read from the beginning on each failure, which could lead to long read times if there are some network flakiness (because of cross regional reads, for example).

-- Igor Dvorzhak
Source: StackOverflow