Why shouldn't you run Kubernetes pods for longer than an hour from Composer?

10/24/2018

The Cloud Composer documentation explicitly states that:

Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.

However, it doesn't provide any more context than that, and I can't find a definitively relevant issue on the Kubernetes Python client project.

To test it, I ran a pod for two hours and saw no problems. What issue creates this restriction, and how does it manifest?

-- Mike Precup
google-cloud-composer
kubernetes
kubernetes-python-client

3 Answers

10/24/2018

I'm not deeply familiar with either the Cloud Composer or Kubernetes Python client library ecosystems, but sorting the GitHub issue tracker by most comments shows this open item near the top of the list: https://github.com/kubernetes-client/python/issues/492

It sounds like there is a token expiration issue:

@yliaog this is an issue for us, as we are running kubernetes pods as batch processes and tracking the state of the pods with a static client. Once the client object is initialized, it does no refresh, and therefore any job that takes longer than 60 minutes will fail. Looking through python-base, it seems like we could make a wrapper class that generates a new client (or refreshes the config) every n minutes, or checks status prior to every call (as @mvle suggested). The best fix would be in swagger-codegen, but a temporary solution would probably be very useful for a lot of people.

- @flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140

-- Mike Hill
Source: StackOverflow

11/2/2018

https://issues.apache.org/jira/browse/AIRFLOW-3253 is the reason (and hopefully, my fix will be merged soon). As the others suggested, this affects anyone using the Kubernetes Python client with GCP auth. If you are authenticating with a Kubernetes service account, you should see no problem.

If you are authenticating via a GCP service account with gcloud (e.g. using the GKEPodOperator), you will generally see this problem with jobs that take longer than an hour because the auth token expires after an hour.

-- Trevor Edwards
Source: StackOverflow

10/24/2018

There are more insights here too.

Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.

You can see here that Snakemake uses the kubernetes python client.

-- Rico
Source: StackOverflow