We have a service that consumes messages from SQS, each message could have 2K image links (on average) that should be downloaded then uploaded to S3. this service is written in nodeJS and containerized using Docker and deployed to a k8s cluster on 10 replicas. we are using Grafana to monitor the pods of K8s. after a random amount of time the graph of some pods becomes like the following:
and the number of messages in the queue -whom we consume- keep increasing which means this process is in a state that doesn't work without throwing any error If I restart the process it works for some amount of time then becomes idle again (if I can call this state idle), I don't know what should I do and I don't know what we can name these phenomena? how I can investigate? any help would be appreciated.