k8s pod readiness probed failed: read tcp xxx -> yyy: read: connection reset by peer

4/22/2021

I'm running Fargate on EKS and I have about 20~30 pods running. After about a few days (5 ~ 7 days; experienced two times), they begin to refuse Readiness probe HTTP requests. I captured the pod's description at that time. I want to point out the first event - connection reset by peer.

I've come across this issue in Istio and the root cause can be the same. However, I don't use Istio so I'm stuck where to go. I'm going to attach partial data of my ingress, service, and deployment below.

Events:
  Type     Reason             Age                  From     Message
  ----     ------             ----                 ----     -------
  Warning  Unhealthy          56m                  kubelet  Readiness probe failed: Get "http://10.104.4.xxx:20001/health_readiness": read tcp 169.254.175.xxx:36978->10.104.4.xxx:20001: read: connection reset by peer
  Warning  Unhealthy          55m (x3 over 56m)    kubelet  Liveness probe failed: dial tcp 10.104.4.xxx:20001: connect: connection refused
  Normal   Killing            55m                  kubelet  Container hybrid-server-logic failed liveness probe, will be restarted
  Warning  FailedPreStopHook  55m                  kubelet  Exec lifecycle hook ([/bin/bash -c kill -SIGTERM $(ps -ef | grep node | grep -v grep | awk '{print $1}')]) for Container "hybrid-server-logic" in Pod "hybrid-server-logic-745bf8ffc4-479x6_jpj-prod(c4acfaef-a8a6-41e8-9d89-3c03336388b3)" failed - error: rpc error: code = Unknown desc = failed to exec in container: failed to create exec "e92f0b6c6f1dcfa680a03ed3d2dc9b5176980d7b6dce371a8bcbb2c5eb2368fe": mkdir /run/containerd/io.containerd.grpc.v1.cri/containers/hybrid-server-logic/io/168763600: no space left on device, message: ""
  Warning  Unhealthy          72s (x331 over 56m)  kubelet  Readiness probe failed: Get "http://10.104.4.xxx:20001/health_readiness": dial tcp 10.104.4.xxx:20001: connect: connection refused
//ingress
http {
        path {
          path = "/*"
          backend {
            service_name = "my-app-service"
            service_port = 20001
          }
        }
}
// serivce
name = my-app-service
spec {
    port {
      port        = 20001
      protocol    = "TCP"
      target_port = "my-app-port"
    }
    selector = {
      "app" = "my-app"
    }
    type = "NodePort"
}
// deployment
...
ports:
        - containerPort: 20001
          name: logic-port
          protocol: TCP
...
readinessProbe: # on failure, k8s will not forward traffic.
          httpGet:
            path: /health_readiness
            port: my-app-port
          initialDelaySeconds: 20
          periodSeconds: 10
          timeoutSeconds: 5
livenessProbe: # on failure, k8s will restarts the server.
          tcpSocket:
            port: my-app-port
          initialDelaySeconds: 10
          periodSeconds: 20
          timeoutSeconds: 5
-- sunsets
kubernetes

1 Answer

5/28/2021

I was looking into the instance and the disk was full because of log files on the machine.

-- sunsets
Source: StackOverflow