I use Kubernetes (Openshift
) to deploy many microservices. I wish to utilise the same to deploy some of my Flink jobs. Flink
jobs are critical - some jobs are stateless that process every data (exactly once), some jobs are stateful that looks for patterns in the stream or react to time. No jobs can tolerate long downtime or frequent shutdown (due to programming errors, the way Flink quits).
I find docs mostly lean to deploy Flink jobs in k8s as Job Cluster
. But how should one take a practical approach in doing it?
pod
, how can Flink restore its state to recover?pod
be replicated more than one? How do the JobManager
& TaskManager
works when two or more pods exists? If not why? Other approaches?Though k8s can restart the failed Flink pod, how can Flink restore its state to recover?
From Flink Documentation we have:
Checkpoints allow Flink to recover state and positions in the streams to give the application the same semantics as a failure-free execution.
It means that you need to have a Check Storage mounted in your pods to be able to recover the state.
In Kubernetes you could use Persistent Volumes to share the data across your pods.
Actually there are a lot of supported plugins, see here.
You can have more replicas of TaskManager
, but in Kubernetes you don't need to take care of HA for JobManager
since you can use Kubernetes self-healing deployment.
To use self-healing deployment in Kubernetes you just need to create a deployment and set the replica
to 1
, like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- name: http
containerPort: 80
imagePullPolicy: IfNotPresent
Finally, you can check this links to help you setup Flink in Kubernetes:
running-apache-flink-on-kubernetes