Intra-host UDP traffic missing after destination-pod re-created

9/25/2019

I'm sending UDP packets (statsd) from pods on a host to <hostIP>:8125. On the other end, a collector (datadog-agent using hostPort; one per host via DaemonSet) picks up the packets and does it's thing.

Generally this works fine, but if I ever delete + re-create the collector (kubectl delete pod datadog-agent-xxxx; new pod is started on same IP/port a few seconds later), traffic from existing client-sockets stop arriving at the collector (UDP sockets created after the pod-rescheduling works fine).

Re-starting just the agent inside the collector pod (kubectl exec -it datadog-agent-xxxxx agent stop; auto-restarts after ~30s) the same old traffic does show up. So containers somehow must have an impact.

While UDP are (supposedly) stateless, something, somewhere is obviously keeping state around!? Any ideas/pointers?

Each "client" pod has something like this in the deployment/pod:

kind: Deployment
...
spec:
  template:
    spec:
      containers:
        - name: webservice
          env:
            # Statsd defaults to localhost:8125, but that's this pod. Use `hostPort` on collector + hostIP here to get around that.
            DD_AGENT_HOST:
              valueFrom:
                fieldRef:
                  fieldPath: 'status.hostIP'

On the collector (following datadog's k8s docs):

kind: DaemonSet
...
spec:
  template:
    spec:
      containers:
        - image: datadog/agent:6.140.0
          ports:
            - containerPort: 8125
              hostPort: 8125
              protocol: UDP
          env:
            - name: DD_DOGSTATSD_NON_LOCAL_TRAFFIC
              value: "true"
            - ...

This happens on Kubernetes 1.12 on Google Kubernetes Engine.

-- Morten Siebuhr
datadog
kubernetes

1 Answer

9/27/2019

This is likely related to this issue in the portmap plugin. The current working theory is that a conntrack entry is created when the client pod reaches out for the UDP host port, and that entry becomes stale when the server pod is deleted, but it's not deleted, so clients keep hitting it, essentially blackholing the traffic.

You can try removing the conntrack entry with something like conntrack -D -p udp --dport 8125 on one of the impacted host. If that solves the issue then that was the root cause of your problem.

This workaround described in the GitHub issue should mitigate the issue until a fix is merged:

You can add an initContainer to the server's pod to run the conntrack command when it starts:

initContainers: 
        - image: <conntrack-image>
          imagePullPolicy: IfNotPresent 
          name: conntrack 
          securityContext: 
            allowPrivilegeEscalation: true 
            capabilities: 
              add: ["NET_ADMIN"] 
          command: ['sh', '-c', 'conntrack -D -p udp']
-- Haïssam Kaj
Source: StackOverflow