Can Spark theoretically lose a data of the failed job?

9/25/2018

So we are using RDD and do a flatMap on a set of data. Then each element we are transforming with a map operation.

val elementsRDD: RDD[Element] = ...

val result = elements.map(processData);

On a fixed set of elements we see that on each run if some executors dies during a map operation the spark spin ups new executors but it doesn't provide a data to operation as a result we are loosing data. Our expectation that Spark should provide data or at least re-run stage from scratch.

We use a newest Kubernetes feature of Spark 2.4(which is still in development)

UPDATE: Documentation says that it's impossible situation, but our logging from executors shows that we lose different pieces of data during a data processing on a fixed set of data. Moreover that if we do not kill any executors during a process we do not lose any data.

-- GregTk
apache-spark
kubernetes

1 Answer

9/25/2018

No. The data processed for the died executor will be lost, but when the driver notices the failure of an executor, it distributes the jobs of the death executor among the still alive executors. Spark won't successfully success the application until every job has been successfully completed.

You can read some notes on Spark's high availability here

Update:

As @user6910411 pointed out, there is a case in which you could lose data: if the data source being used by your Spark application is not persistent or it provides temporal data. In those cases the modification of the data used by the Spark application may lead to loss of data.

-- Álvaro Valencia
Source: StackOverflow