I deployed the Datadog agent using the Datadog Helm chart which deploys a Daemonset
in Kubernetes. However when checking the state of the Daemonset I saw it was not creating all pods:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
datadog-agent-datadog 5 2 2 2 2 <none> 1h
When describing the Daemonset
to figure out what was going wrong I saw it did not have enough resources:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x5 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 42s (x7 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Normal SuccessfulCreate 42s daemonset-controller Created pod: datadog-agent-7b2kp
However, I have the Cluster-autoscaler installed in the cluster and configured properly (It does trigger on regular Pod
deployments that do not have enough resources to schedule), but it does not seem to trigger on the Daemonset
:
I0424 14:14:48.545689 1 static_autoscaler.go:273] No schedulable pods
I0424 14:14:48.545700 1 static_autoscaler.go:280] No unschedulable pods
The AutoScalingGroup has enough nodes left:
Did I miss something in the configuration of the Cluster-autoscaler? What can I do to make sure it triggers on Daemonset
resources as well?
Edit: Describe of the Daemonset
Name: datadog-agent
Selector: app=datadog-agent
Node-Selector: <none>
Labels: app=datadog-agent
chart=datadog-1.27.2
heritage=Tiller
release=datadog-agent
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=datadog-agent
Annotations: checksum/autoconf-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/checksd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/confd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
Service Account: datadog-agent
Containers:
datadog:
Image: datadog/agent:6.10.1
Port: 8125/UDP
Host Port: 0/UDP
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
Liveness: http-get http://:5555/health delay=15s timeout=5s period=15s #success=1 #failure=6
Environment:
DD_API_KEY: <set to the key 'api-key' in secret 'datadog-secret'> Optional: false
DD_LOG_LEVEL: INFO
KUBERNETES: yes
DD_KUBERNETES_KUBELET_HOST: (v1:status.hostIP)
DD_HEALTH_PORT: 5555
Mounts:
/host/proc from procdir (ro)
/host/sys/fs/cgroup from cgroups (ro)
/var/run/docker.sock from runtimesocket (ro)
/var/run/s6 from s6-run (rw)
Volumes:
runtimesocket:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
procdir:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
cgroups:
Type: HostPath (bare host directory volume)
Path: /sys/fs/cgroup
HostPathType:
s6-run:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 33m (x6 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-144.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Normal SuccessfulCreate 33m daemonset-controller Created pod: datadog-agent-7b2kp
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-174.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-3-250.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
You can add priorityClassName to point to a high priority PriorityClass to your DaemonSet. Kubernetes will then remove other pods in order to run the DaemonSet's pods. If that results in unschedulable pods, cluster-autoscaler should add a node to schedule them on.
See the docs (Most examples based on that) (For some pre-1.14 versions, the apiVersion is likely a beta (1.11-1.13) or alpha version (1.8 - 1.10) instead)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority class for essential pods"
Apply it to your workload
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: datadog-agent
spec:
template:
metadata:
labels:
app: datadog-agent
name: datadog-agent
spec:
priorityClassName: high-priority
serviceAccountName: datadog-agent
containers:
- image: datadog/agent:latest
############ Rest of template goes here
You should understand how cluster autoscaler works. It is responsible only for adding or removing nodes. It is not responsible for creating or destroying pods. So in your case cluster autoscaler is not doing anything because it's useless. Even if you add one more node - there will be still a requirement to run DaemonSet pods on nodes where is not enough CPU. That's why it is not adding nodes.
What you should do is to manually remove some pods from occupied nodes. Then it will be able to schedule DaemonSet pods.
Alternatively you can reduce CPU requests of Datadog to, for example, 100m or 50m. This should be enough to start those pods.