Kubernetes Python Client Watch stops returning results, doesn't fail

5/5/2020

Using the Python Client for Kubernetes, I've created a small service to watch for new Pods and send the data to an external service, for metrics gathering. I find that it works completely, but after a few days the Watch seems to stop receiving new changes. It doesn't report any errors or throw any exceptions; it just acts as if there are no more changes. I can see new changes coming in if I start a new watch, and the process resumes if I restart the container, but it seems as though I can't have one process running continuously.

I'm running on GKE, and I wonder if maybe the Kubernetes API endpoint becomes unavailable. But all I want is to resume once it's available again. I'd be happy with the pod crashing and having to restart in this case, but I get no report from the Watch at all so there's no situation I can attempt to handle.

Here's the relevant parts of my code:

def main():
    log = app.logger.get()
    kube_api = get_kubernetes_config()

    resource_version = get_resource_version(kube_api)

    watch_params = {
        'resource_version': resource_version
    }

    log.debug(f'Watching from resource version {resource_version}')
    w = watch.Watch()

    stream = w.stream(kube_api.list_pod_for_all_namespaces, **watch_params)
    log.info('Started watching for new pods')

    for message in stream:
        process_pod_change(message['object'], log)

def process_pod_change(pod, log):
    if not pod.metadata.deletion_timestamp is None or pod.status.container_statuses is None or not all(status.ready for status in pod.status.container_statuses):
        return
    pod_name = f'{pod.metadata.namespace}/{pod.metadata.name}'
    for status in pod.status.container_statuses:
        docker_image_sha = status.image_id.split('@')[-1]
        report_deployment(docker_image_sha, pod_name, status.name, log)
    with open(RESOURCE_VERSION_FILE, 'w') as f:
        f.write(str(pod.metadata.resource_version))

def report_deployment(sha, pod_name, container_name, log):
    log.info(f'Seen new deployment of {pod_name} container {container_name}: {sha}')
    authorised_session = app.auth.get_authorised_session()
    jsonbody = {
        'artefact_type': 'docker',
        'artefact_id': sha,
        'client': os.environ['CLIENT'],
        'environment': os.environ['ENVIRONMENT'],
        'product': os.environ['PRODUCT']
    }
    r = authorised_session.post(os.environ['NOTIFICATION_URL'], json=jsonbody)
    r.raise_for_status()

The resulting logs show a continuous stream of processed messages until at some point they stop coming in. There's no indication of anything odd happening in the logs. I also believe this is related to the Kubernetes Watch and not any downstream processing I'm doing because this is the second application I've written that has exhibited this behaviour of a Watch seemingly falling asleep and doing nothing.

Am I using this right? I can't find many examples online and no-one else seems to have this problem, so I don't see any workarounds.

My cluster version is 1.14.10-gke.27, I'm using the Python 3.6-alpine container and my Python dependencies are only from the past couple of weeks. But I also saw the same problem over six months ago in another attempt to use Watch.

-- Goldstein
google-kubernetes-engine
kubernetes
python

0 Answers