Kubernetes Rolling updates for JSR 352 WebSphere Liberty batch

1/27/2020

I'm having a WebSphere Liberty batch job running on Kubernetes cluster, scaled up to 10 pods. Each pod has the same code base and each has multiple JSR batch jobs. I wanted to do rolling update and zero downtime.

As per the docs(https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/ and https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace), we can create pre-stop shutdown hook to trap the SIGTERM and do the final cleanup activity before terminationGracePeriodSeconds, and Kubernetes issues force kill and exit the container.

What i'm wondering is, suppose a Batch job is already running and if i do update/redeployment, is it possible to wait till the running job in the completed or update the pod which dont run jobs?

-- Keshore Durairaj
batch-processing
kubernetes
websphere
websphere-liberty

1 Answer

1/27/2020

I don't know of a way to defer shutdown until the job finishes naturally. Usually people try to stop the running jobs and then restart them later. If you can determine which jobs are running in this server, then you could issue a stop for those jobs and wait for them to complete (by checking job status), before allowing the server to terminate. If you stop the server nicely, then it will automatically try to stop the jobs, but only waits a little while (30 seconds?) before just bringing the server down.

Stopping running jobs is tricky because they don't necessarily stop. If the job is executing a Batchlet, for example, then the Batchlet's stop( ) method will be driven, but the batch application is free to just ignore that and do nothing (or may not be able to do anything to stop depending on what the step does).

As far as stopping new work from entering the server, that depends how the work is getting there (there are several different possibilities).

-- DFollis
Source: StackOverflow