I'm running Fargate on EKS and I have about 20~30 pods running. After about a few days (5 ~ 7 days; experienced two times), they begin to refuse Readiness probe HTTP requests. I captured the pod's description at that time. I want to point out the first event - connection reset by peer
.
I've come across this issue in Istio and the root cause can be the same. However, I don't use Istio so I'm stuck where to go. I'm going to attach partial data of my ingress, service, and deployment below.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 56m kubelet Readiness probe failed: Get "http://10.104.4.xxx:20001/health_readiness": read tcp 169.254.175.xxx:36978->10.104.4.xxx:20001: read: connection reset by peer
Warning Unhealthy 55m (x3 over 56m) kubelet Liveness probe failed: dial tcp 10.104.4.xxx:20001: connect: connection refused
Normal Killing 55m kubelet Container hybrid-server-logic failed liveness probe, will be restarted
Warning FailedPreStopHook 55m kubelet Exec lifecycle hook ([/bin/bash -c kill -SIGTERM $(ps -ef | grep node | grep -v grep | awk '{print $1}')]) for Container "hybrid-server-logic" in Pod "hybrid-server-logic-745bf8ffc4-479x6_jpj-prod(c4acfaef-a8a6-41e8-9d89-3c03336388b3)" failed - error: rpc error: code = Unknown desc = failed to exec in container: failed to create exec "e92f0b6c6f1dcfa680a03ed3d2dc9b5176980d7b6dce371a8bcbb2c5eb2368fe": mkdir /run/containerd/io.containerd.grpc.v1.cri/containers/hybrid-server-logic/io/168763600: no space left on device, message: ""
Warning Unhealthy 72s (x331 over 56m) kubelet Readiness probe failed: Get "http://10.104.4.xxx:20001/health_readiness": dial tcp 10.104.4.xxx:20001: connect: connection refused
//ingress
http {
path {
path = "/*"
backend {
service_name = "my-app-service"
service_port = 20001
}
}
}
// serivce
name = my-app-service
spec {
port {
port = 20001
protocol = "TCP"
target_port = "my-app-port"
}
selector = {
"app" = "my-app"
}
type = "NodePort"
}
// deployment
...
ports:
- containerPort: 20001
name: logic-port
protocol: TCP
...
readinessProbe: # on failure, k8s will not forward traffic.
httpGet:
path: /health_readiness
port: my-app-port
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 5
livenessProbe: # on failure, k8s will restarts the server.
tcpSocket:
port: my-app-port
initialDelaySeconds: 10
periodSeconds: 20
timeoutSeconds: 5
I was looking into the instance and the disk was full because of log files on the machine.