Multiple pods of a 600 pod deployment stuck in ContainerCreating
after a rolling update with the message:
Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod network: add cmd: failed to assign an IP address to container
What I have tried:
maxIPAddresses, value: 759.000000
ipamdActionInProgress, value: 1.000000
addReqCount, value: 16093.000000
awsAPILatency, value: 564.000000
delReqCount, value: 32337.000000
eniMaxAvailable, value: 69.000000
assignIPAddresses, value: 558.000000
totalIPAddresses, value: 682.000000
eniAllocated, value: 69.000000
Do the CNI metrics output suggest there's an issue? Seems like there are enough IPs.
What else can I try to debug?
It seems that you reached maximum number of IP addresses in your subnet what can suggest such thing in documentation:
maxIPAddress: the maximum number of IP addresses that can be used for Pods in the cluster. (assumes there is enough IPs in the subnet).
Please take a look also on maxUnavailable and maxSurge parameters which controls how many PODs appear during rolling upgrade - maybe your configuration assumes that during rolling upgrade you will have over 600 PODs (like 130%) and that hit limits of your AWS network.