I have created a Mutating WebHook that works fine when the resulting pods reach healthy Running state. But when used with pods that ultimately fail (e.g. bad image name), the scheduler keeps creating more and more, up to 4000 pods that all error out and retry. If I disable the webhook, and the pod still fails for the same reason, then only 2 are attempted and all is normal failures.
Its like my webhook is creating "new" pods and not just mutating ones passed to it. This ONLY happens when the resulting pods fail to Run.
So what is about having the webhook in place that is causing so many additional pods to be scheduled when pods fail?
Turns out I had a mistake in the webhook where instead of just adding an additional label to indicate the mutation was done, it was instead removing existing labels including the ones kube uses to manage the pod. So when the pod got mutated, it erased the control lables and consequently the scheduler thinks no pods had been created and kept creating new ones. Once fixed, all works normally.