I'm using Azure Kubernetes Service to run a Go application that pulls from RabbitMQ, runs some processing, and returns it. The pods scale to handle an increase of jobs. Pretty run-of-the-mill stuff.
The HPA is setup like this:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
production Deployment/production 79%/80% 2 10 10 4d11h
staging Deployment/staging 2%/80% 1 2 1 4d11h
What happens is as the HPA scales up and down, there are always 2 pods that will stay running. We've found that after running for so long, the Go app on those pods will time out. Sometimes that's days, sometimes it weeks. Yes, we could probably dig into the code and figure this out, but it's kind of a low priority for that team.
Another solution I've thought of would be to have the HPA remove the oldest pods first. This would mean that the oldest pod would never be more than a few hours old. A first-in, first-out model.
However, I don't see any clear way to do that. It's entirely possible that it isn't, but it seems like something that could work.
Am I missing something? Is there a way to make this work?
In my opinion(I also mentioned in comment) - the most simple(not sure about elegance) way is to have some cronjob that will periodically clean timed out pods.
One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format. CronJobs are useful for creating periodic and recurring tasks, like running backups or sending emails. CronJobs can also schedule individual tasks for a specific time, such as scheduling a Job for when your cluster is likely to be idle.
Cronjobs examples and howto: