Running Flink Job when Taskmanager killed / loss

10/1/2017

What I want to achieve is Flink cluster that will automatically re-allocate to run the job when there is a resource interruption , eg: Kubernetes pod scale down, loss of existing taskmanager.

I tested with a Flink cluster of :

  • one Jobmanager, 2 taskmanager (2 task slot each),
  • Restart Strategies-fixedDelayRestart(2, 2000),
  • checkpoint and state configured to HDFS.
  • The job started as 4 parallelism which utilized all the available slots.
  • This cluster will later be running on top of Kubernetes and manage by autoscaling.

Scenario : When I kill one of the taskmanager, the Flink cluster will run with 1 JM and 1 TM, the Job will then restart, and failed eventually as it will start with previous state (4 parallelism) and complaint unavailable resource from the Flink cluster.

Is there a way for me to restart the job by dynamically re-allocate available resource instead of using previous state?

Appreciate if someone can shade some light on this.

-- coffee_latte1020
apache-flink
kubernetes
scaling

0 Answers