I have created a kubernetes cluster using KOPS on AWS cloud. The cluster gets created without any issues and runs fine for 10-15 hrs. I have deployed SAP Vora2.1 on this cluster. However generally after 12-15 hrs the KOPS cluster gets into problems related to kube-proxy and kube-dns. These pods either goes down or shows in a completed state. There is lot of restart as well. This eventually results into my application pods getting into problems and application also goes down. the application uses consul for service discovery however as kubernetes foundation services are not working properly so application does not comes to steady state even if I try to restore kube-proxy/kube-dns pods.
This is a 3 node cluster (1 master and 2 nodes) set up in a fully autoscaling mode. The overlay network is using default kubenet. Below is snapshot of pod statuses once system runs into issues,
[root@ip-172-31-18-162 ~]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
infyvora vora-catalog-1549734119-cfnhz 0/2 CrashLoopBackOff 188 20h
infyvora vora-consul-0 0/1 CrashLoopBackOff 101 20h
infyvora vora-consul-1 1/1 Running 34 20h
infyvora vora-consul-2 0/1 CrashLoopBackOff 95 20h
infyvora vora-deployment-operator-293895365-4b3t6 0/1 Completed 104 20h
infyvora vora-disk-0 1/2 CrashLoopBackOff 187 20h
infyvora vora-dlog-0 0/2 CrashLoopBackOff 226 20h
infyvora vora-dlog-1 1/2 CrashLoopBackOff 155 20h
infyvora vora-doc-store-2451237348-dkrm6 0/2 CrashLoopBackOff 229 20h
infyvora vora-elasticsearch-logging-v1-444540252-mwfrz 0/1 CrashLoopBackOff 100 20h
infyvora vora-elasticsearch-logging-v1-444540252-vrr63 1/1 Running 14 20h
infyvora vora-elasticsearch-retention-policy-137762458-ns5pc 1/1 Running 13 20h
infyvora vora-fluentd-kubernetes-v1.21-9f4pt 1/1 Running 12 20h
infyvora vora-fluentd-kubernetes-v1.21-s2t1j 0/1 CrashLoopBackOff 99 20h
infyvora vora-grafana-2929546178-vrf5h 1/1 Running 13 20h
infyvora vora-graph-435594712-47lcg 0/2 CrashLoopBackOff 157 20h
infyvora vora-kibana-logging-3693794794-2qn86 0/1 CrashLoopBackOff 99 20h
infyvora vora-landscape-2532068267-w1f5n 0/2 CrashLoopBackOff 232 20h
infyvora vora-nats-streaming-1569990702-kcl1v 1/1 Running 13 20h
infyvora vora-prometheus-node-exporter-k4c3g 0/1 CrashLoopBackOff 102 20h
infyvora vora-prometheus-node-exporter-xp511 1/1 Running 13 20h
infyvora vora-prometheus-pushgateway-399610745-tcfk7 0/1 CrashLoopBackOff 103 20h
infyvora vora-prometheus-server-3955170982-xpct0 2/2 Running 24 20h
infyvora vora-relational-376953862-w39tc 0/2 CrashLoopBackOff 237 20h
infyvora vora-security-operator-2514524099-7ld0k 0/1 CrashLoopBackOff 103 20h
infyvora vora-thriftserver-409431919-8c1x9 2/2 Running 28 20h
infyvora vora-time-series-1188816986-f2fbq 1/2 CrashLoopBackOff 184 20h
infyvora vora-tools5tlpt-100252330-mrr9k 0/1 rpc error: code = 4 desc = context deadline exceeded 272 17h
infyvora vora-tools6zr3m-3592177467-n7sxd 0/1 Completed 1 20h
infyvora vora-tx-broker-4168728922-hf8jz 0/2 CrashLoopBackOff 151 20h
infyvora vora-tx-coordinator-3910571185-l0r4n 0/2 CrashLoopBackOff 184 20h
infyvora vora-tx-lock-manager-2734670982-bn7kk 0/2 Completed 228 20h
infyvora vsystem-1230763370-5ckr0 0/1 CrashLoopBackOff 115 20h
infyvora vsystem-auth-1068224543-0g59w 0/1 CrashLoopBackOff 102 20h
infyvora vsystem-vrep-1427606801-zprlr 0/1 CrashLoopBackOff 121 20h
kube-system dns-controller-3110272648-chwrs 1/1 Running 0 22h
kube-system etcd-server-events-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system etcd-server-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system kube-apiserver-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system kube-controller-manager-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system kube-dns-1311260920-cm1fs 0/3 Completed 309 22h
kube-system kube-dns-1311260920-hm5zd 3/3 Running 39 22h
kube-system kube-dns-autoscaler-1818915203-wmztj 1/1 Running 12 22h
kube-system kube-proxy-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system kube-proxy-ip-172-31-64-110.ap-southeast-1.compute.internal 0/1 CrashLoopBackOff 98 22h
kube-system kube-proxy-ip-172-31-64-15.ap-southeast-1.compute.internal 1/1 Running 13 22h
kube-system kube-scheduler-ip-172-31-64-102.ap-southeast-1.compute.internal 1/1 Running 0 22h
kube-system tiller-deploy-352283156-97hhb 1/1 Running 34 22h
Has anyone come across similar issue related to KOPS kubernetes on AWS. Appreciate if any pointers to solve this issue.
Regards, Deepak