We have Prometheus running on k8s but it won't start anymore because RAM requirements are insufficient (and CPU close to the limit as well). Since this is all new to me I'm not sure about which approach to take. I tried deploying the container with a bit increased RAM limit (node has 16Gi, I increased from 145xxMi to 15Gi). The status is constantly pending.
Normal NotTriggerScaleUp 81s (x16 over 5m2s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) didn't match node selector, 2 Insufficient memory
Warning FailedScheduling 80s (x6 over 5m23s) default-scheduler 0/10 nodes are available: 10 Insufficient memory, 6 node(s) didn't match node selector, 9 Insufficient cpu.
Normal NotTriggerScaleUp 10s (x14 over 5m12s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 3 node(s) didn't match node selector
These are the logs from when prometheus crashed and didn't start anymore. describe pod also said memory usage was 99%:
level=info ts=2020-10-09T09:39:34.745Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53476 maxSegment=53650
level=info ts=2020-10-09T09:39:38.518Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53477 maxSegment=53650
level=info ts=2020-10-09T09:39:41.244Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53478 maxSegment=53650
What can I do to solve this issue? Note there is no autoscaling in place.
Do I scale up the EC2 worker nodes manually? Do I do something else?
The message from cluster autoscaler reveals the problem:
cluster-autoscaler pod didn't trigger scale-up
Even if the cluster autoscaler would add a new node to the cluster, the Prometheus still would not fit to the node.
This is likely due to the EKS nodes having some capacity from the 16Gi reserved for the system. The allocatable capacity is seemingly less than 15Gi, as the Prometheus does not fit on the node after increasing its memory request.
To solve this, you could either decrease the memory request on the Prometheus pod, or add new larger nodes which have more memory available.