Fixing DataDog agent congestion issues in Amazon EKS cluster

6/23/2021

A few months ago I integrated DataDog into my Kubernetes cluster by using a DaemonSet configuration. Since then I've been getting congestion alerts with the following message:

Please tune the hot-shots settings https://github.com/brightcove/hot-shots#errors

By attempting to follow the docs with my limited Orchestration/DevOps knowledge, what I could gather is that I need to add the following to my DaemonSet config:

spec
.
.
  securityContext:
        sysctls:
          - name: net.unix.max_dgram_qlen
            value: "1024"
          - name: net.core.wmem_max
            value: "4194304"

I attempted to add that configuration piece to one of the auto-deployed DataDog pods directly just to try it out but it hangs indefinitely and doesn't save the configuration (Instead of adding to DaemonSet and risking bringing all agents down).

That hot-shots documentation also mentions that the above sysctl configuration requires unsafe sysctls to be enabled in the nodes that contain the pods:

kubelet --allowed-unsafe-sysctls \
  'net.unix.max_dgram_qlen, net.core.wmem_max'

The cluster I am working with is fully deployed with EKS by using the Dashboard in AWS (Little knowledge on how it is configured). The above seems to be indicated for manually deployed and managed cluster.

  • Why is the configuration I am attempting to apply to a single DataDog agent pod not saving/applying? Is it because it is managed by DaemonSet or is it because it doesn't have the proper unsafe sysctl allowed? Something else?
  • If I do need to enable the suggested unsafe sysctlon all nodes of my cluster. How do I go about it since the cluster is fully deployed and managed by Amazon EKS?
-- ALostBegginer
amazon-eks
amazon-web-services
datadog
kubernetes

1 Answer

12/14/2021

So we managed to achieve this using a custom launch template with our managed node group and then passing in a custom bootstrap script. This does mean however you need to supply the AMI id yourself and lose the alerts in the console when it is outdated. In Terraform this would look like:

resource "aws_eks_node_group" "group" {
  ...
  launch_template {
    id      = aws_launch_template.nodes.id
    version = aws_launch_template.nodes.latest_version
  }
  ...
}

data "template_file" "bootstrap" {
  template = file("${path.module}/files/bootstrap.tpl")
  vars = {
    cluster_name        = aws_eks_cluster.cluster.name
    cluster_auth_base64 = aws_eks_cluster.cluster.certificate_authority.0.data
    endpoint            = aws_eks_cluster.cluster.endpoint
  }
}

data "aws_ami" "eks_node" {
  owners      = ["602401143452"]
  most_recent = true

  filter {
    name   = "name"
    values = ["amazon-eks-node-1.21-v20211008"]
  }
}

resource "aws_launch_template" "nodes" {
  ...
  image_id = data.aws_ami.eks_node.id
  user_data = base64encode(data.template_file.bootstrap.rendered)
  ...
}

Then the bootstrap.hcl file looks like this:

#!/bin/bash

set -o xtrace

systemctl stop kubelet
/etc/eks/bootstrap.sh '${cluster_name}' \
  --b64-cluster-ca '${cluster_auth_base64}' \
  --apiserver-endpoint '${endpoint}' \
  --kubelet-extra-args '"--allowed-unsafe-sysctls=net.unix.max_dgram_qlen"'

The next step is to set up the PodSecurityPolicy, ClusterRole and RoleBinding in your cluster so you can use the securityContext as you described above and then pods in that namespace will be able to run without a SysctlForbidden message.

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: sysctl
spec:
  allowPrivilegeEscalation: false
  allowedUnsafeSysctls:
  - net.unix.max_dgram_qlen
  defaultAllowPrivilegeEscalation: false
  fsGroup:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
  - '*'

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: allow-sysctl
rules:
- apiGroups:
  - policy
  resourceNames:
  - sysctl
  resources:
  - podsecuritypolicies
  verbs:
  - '*'

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: allow-sysctl
  namespace: app-namespace
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: allow-sysctl
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:serviceaccounts:app-namespace

If using the DataDog Helm chart you can set the following values to update the securityContext of the agent. But you will have to update the chart PSP manually to set allowedUnsafeSysctls

datadog:
  securityContext:
    sysctls:
      - name: net.unix.max_dgram_qlen"
        value: 512"
-- b3n
Source: StackOverflow