I have deployed PagerBot https://github.com/stripe-contrib/pagerbot to our internal k8s cluster as a learning opportunity. I had fun writing a helm chart for it!
The bot appears to disconnect from slack at an unknown time and never reconnect. I kill the pod and the deployment recreates it and it connects again (we are using the Slack RTM option).
The pod logs the following entry when it disconnects:
2018-02-24 02:31:14.382590 I [9:34765020] PagerBot::SlackRTMAdapter -- Closed connection to chat. --
I want to learn a method of monitoring for this log entry and taking action. Initially I thought a Liveness probe would be the way to go using a command that returns non-zero when this entry is logged. But the logs aren't stored internally to the container (that I can see).
How do you monitor and take action based on logs that can be seen using kubectl logs pod-name
?
Can I achieve this in our Prometheus test deployment? Should I be using a known k8s feature?
I would argue the best course of action is to extend pagerbot to surface more than just the string literal pong
in its /ping endpoint, then use that as its livelinessProbe
, with a close second being to teach the thing to just reconnect, as that's almost certainly cheaper than tearing down the Pod
Having said that, one approach you may consider is a sidecar container that uses the Pod's service account credentials to monitor the sibling's container (akin to if kubectl logs -f -c pagerbot $my_pod_name | grep "Closed connection to chat"; then kill -9 $pagerbot_pid; fi
type deal). That is a little awkward, but I can't immediately think of why it wouldn't work
I ended up landing on a "liveness probe" to solve my problem. I've added the following to my deployment for the pageyBot deployment:
livenessProbe:
exec:
command:
- bash
- -c
- "ss -an | grep -q 'EST.*:443 *#x27;"
initialDelaySeconds: 120
periodSeconds: 60
Basically tests to see if a connection is established for 443 which we noticed goes away when the bot disconnects.