K8s: CPU of node dropping to 1 after roughly 1h of computation

3/22/2020

Architecture

I use Kubernetes to do process jobs with crawler/scraper. 4 nodes (each 8G & 4cpu). No Cluster Auto-scale, with Digital Ocean.

There are 3 different containers:

  1. Redis queue (replicas 1), to store IDs + the associated service
  2. Cron.js running in a container (rpeplicas 1, limits: 0.5cpu/1000Mi), adding 37000 IDs to the Redis queue once a day.
  3. Crawler.js to crawl items with Puppeteer. It takes an ID from the redis queue, scrap data and then remove the ID from the Redis queue. It uses nodejs child process to speed up the crawling with 2 workers.

\=> Crawler use 1 replica, and activate HPA to 15 pods if CPU going up to 75% (When start processing jobs in the queue). Crawler Resources: Requests: 300Mi memory, 0.3 CPU Limits: 450Mi, 0.4 cpu

When it crawls, a crawler container consumes always 100% CPU and 100% on requests and is getting close to the limits.


Expected comportment:

The crawler should starts crawling items once a day, use a lot of CPU, triggers HPA and then 15 pods are running to crawl items. It should crawl for few hours (consuming every ID in the queue), then going back to a normal state (1 replica) when the crawling is over.

With a queue of 1000 items, it should take around 10 min to crawl and everything is working.

Problem

Using 37000 items in the queue: After ~50min, every crawler's CPU on a node are going down to 1 (0%) but memory stays around request. It happens first with one node, then 10 min later on another one as well (Image from Grafana/Prometheus: Graphical view of CPU going down. You can clearly see some pods loosing CPU, they all belong to the same node). The memory keeps staying almost normal. The crawler stays indefinitely on starting the puppeteer browser and do not crawl any more.

Healthy state, everything running fine

Pods on one node have their CPU going down, you can see the memory stays stable

Graph of a Pod with CPU 1/0%

2nd node failing

Nodes kubectl top nodes:

             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
 ...e-zwl2   2261m        56%    2456Mi          36%
 ...e-zwll   2240m        56%    2809Mi          41%
 ...e-zwpb   123m         3%     2562Mi          38%
 ...e-zwss   159m         3%     2878Mi          43%

Graph, After ~1h of crawling

            CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
...e-zwl2   1996m        49%    2395Mi          35%
...e-zwll   2106m        52%    2925Mi          43%
...e-zwpb   204m         5%     2554Mi          38%
...e-zwss   129m         3%     2875Mi          42%

Any ideas how to debug this?

What I tried:

  • Increase the limits and requests of a pod.
  • Reduce the number of worker to 1.
  • Add more nodes (8G 4cpu).
  • Run it to different time of the day.
  • Remove HPA.
  • Killing a not-working pod: it restarts and crawls normally.
  • Check Redis CPU/Mem, logs: Everything is normal.
  • Add more CPU and memory to the crawler pods.

Side Notes:

Something I noticed, cilium pods on kube-system are restarting sometimes, I don't know if it is related to my problem or useful to precise.

The crawler is not new and has been coded to work on a docker container with pm2, it is working fine in production. We are moving it to Kubernetes.

I should use 2 other micro services, and Cluster Auto Scale. But to simplify I stay 3 micro services, the ones related to the bug. As well, I disabled the storing scrapped data in another external DB to simplify the problem and highlight the bug.

-- François
kubernetes
node.js
redis
web-crawler
web-scraping

0 Answers