What I want to achieve is Flink cluster that will automatically re-allocate to run the job when there is a resource interruption , eg: Kubernetes pod scale down, loss of existing taskmanager.
I tested with a Flink cluster of :
Scenario : When I kill one of the taskmanager, the Flink cluster will run with 1 JM and 1 TM, the Job will then restart, and failed eventually as it will start with previous state (4 parallelism) and complaint unavailable resource from the Flink cluster.
Is there a way for me to restart the job by dynamically re-allocate available resource instead of using previous state?
Appreciate if someone can shade some light on this.