I use Kubernetes to do process jobs with crawler/scraper. 4 nodes (each 8G & 4cpu). No Cluster Auto-scale, with Digital Ocean.
There are 3 different containers:
\=> Crawler use 1 replica, and activate HPA to 15 pods if CPU going up to 75% (When start processing jobs in the queue). Crawler Resources: Requests: 300Mi memory, 0.3 CPU Limits: 450Mi, 0.4 cpu
When it crawls, a crawler container consumes always 100% CPU and 100% on requests and is getting close to the limits.
Expected comportment:
The crawler should starts crawling items once a day, use a lot of CPU, triggers HPA and then 15 pods are running to crawl items. It should crawl for few hours (consuming every ID in the queue), then going back to a normal state (1 replica) when the crawling is over.
With a queue of 1000 items, it should take around 10 min to crawl and everything is working.
Using 37000 items in the queue: After ~50min, every crawler's CPU on a node are going down to 1 (0%) but memory stays around request. It happens first with one node, then 10 min later on another one as well (Image from Grafana/Prometheus: Graphical view of CPU going down. You can clearly see some pods loosing CPU, they all belong to the same node). The memory keeps staying almost normal. The crawler stays indefinitely on starting the puppeteer browser and do not crawl any more.
Healthy state, everything running fine
Pods on one node have their CPU going down, you can see the memory stays stable
Nodes kubectl top nodes
:
CPU(cores) CPU% MEMORY(bytes) MEMORY%
...e-zwl2 2261m 56% 2456Mi 36%
...e-zwll 2240m 56% 2809Mi 41%
...e-zwpb 123m 3% 2562Mi 38%
...e-zwss 159m 3% 2878Mi 43%
CPU(cores) CPU% MEMORY(bytes) MEMORY%
...e-zwl2 1996m 49% 2395Mi 35%
...e-zwll 2106m 52% 2925Mi 43%
...e-zwpb 204m 5% 2554Mi 38%
...e-zwss 129m 3% 2875Mi 42%
Any ideas how to debug this?
What I tried:
Side Notes:
Something I noticed, cilium pods on kube-system are restarting sometimes, I don't know if it is related to my problem or useful to precise.
The crawler is not new and has been coded to work on a docker container with pm2, it is working fine in production. We are moving it to Kubernetes.
I should use 2 other micro services, and Cluster Auto Scale. But to simplify I stay 3 micro services, the ones related to the bug. As well, I disabled the storing scrapped data in another external DB to simplify the problem and highlight the bug.