Monitor and take action based on pod log event

2/27/2018

I have deployed PagerBot https://github.com/stripe-contrib/pagerbot to our internal k8s cluster as a learning opportunity. I had fun writing a helm chart for it!

The bot appears to disconnect from slack at an unknown time and never reconnect. I kill the pod and the deployment recreates it and it connects again (we are using the Slack RTM option).

The pod logs the following entry when it disconnects:

2018-02-24 02:31:14.382590 I [9:34765020] PagerBot::SlackRTMAdapter -- Closed connection to chat. --

I want to learn a method of monitoring for this log entry and taking action. Initially I thought a Liveness probe would be the way to go using a command that returns non-zero when this entry is logged. But the logs aren't stored internally to the container (that I can see).

How do you monitor and take action based on logs that can be seen using kubectl logs pod-name?

Can I achieve this in our Prometheus test deployment? Should I be using a known k8s feature?

-- SinFulNard
kubernetes
prometheus

2 Answers

2/27/2018

I would argue the best course of action is to extend pagerbot to surface more than just the string literal pong in its /ping endpoint, then use that as its livelinessProbe, with a close second being to teach the thing to just reconnect, as that's almost certainly cheaper than tearing down the Pod

Having said that, one approach you may consider is a sidecar container that uses the Pod's service account credentials to monitor the sibling's container (akin to if kubectl logs -f -c pagerbot $my_pod_name | grep "Closed connection to chat"; then kill -9 $pagerbot_pid; fi type deal). That is a little awkward, but I can't immediately think of why it wouldn't work

-- mdaniel
Source: StackOverflow

3/7/2018

I ended up landing on a "liveness probe" to solve my problem. I've added the following to my deployment for the pageyBot deployment:

    livenessProbe:
      exec:
        command:
        - bash
        - -c
        - "ss -an | grep -q 'EST.*:443 *
#x27;"
initialDelaySeconds: 120 periodSeconds: 60

Basically tests to see if a connection is established for 443 which we noticed goes away when the bot disconnects.

-- SinFulNard
Source: StackOverflow