I am experimenting Spark2.3 on a K8s cluster. Wondering how does the checkpoint work? Where is it stored? If the main driver dies, what happens to the existing processing?
In case of consuming from Kafka, how does the offset maintained? I tried to lookup online but could not find any answer to those questions. Our application is consuming a lot of Kafka data so it is essential to be able to restart and pick up from where it was stopped.
Any gotchas on running Spark Streaming on K8s?
The Kubernetes Spark Controller doesn't know anything about checkpointing, AFAIK. It's just a way for Kubernetes to schedule your Spark driver and the Workers that it needs to run a job.
Storing the offset is really up to your application and where you want to store the Kafka offset, so that when it restarts it picks up that offset and starts consuming from there. This is an example on how to store it in Zookeeper.
You could, for example, write ZK offset manager functions in Scala:
import com.metamx.common.scala.Logging
import org.apache.curator.framework.CuratorFramework
...
object OffsetManager extends Logging {
def getOffsets(client: CuratorFramework,
... = {
}
def setOffsets(client: CuratorFramework,
... = {
}
...
Another way would be storing your Kafka offsets in something reliable like HDFS.