When a container encounters a device error, what is the best way for it to tell kubernetes?

7/22/2019

Given a running container that has been given one to many SRIOV devices, as assigned by the scheduler on the cluster master during launch, if the container app using the device(s) encounters, say, a device timeout, how should it report the error to kubernetes?

This is almost like an HA event sort of thing... So maybe there's a best way to do this from an application perspective?

-- user3109016
kubernetes

2 Answers

7/23/2019

Kubernetes Liveness and Readiness Probes can be used to do this:

    livenessProbe:
      exec:
        command:
        - <command or HTTP GET to check SRIOV device timeout>
      initialDelaySeconds: 5
      periodSeconds: 5

    readinessProbe:
      exec:
        command:
        - <command or HTTP GET to check SRIOV device timeout>
      initialDelaySeconds: 5
      periodSeconds: 5

Here are more links to check pod health:

-- Vikram Hosakote
Source: StackOverflow

7/23/2019

The question is a bit ambiguous as it is not clear what "report to Kubernetes" implies exactly.

If your main concern is to manifest the information about the error inside Kubernetes, you could generate a custom Kubernetes event, an approach e. g. implemented by Xing in their oom-event-generator. This would be an approach to trigger custom logic inside a custom operator watching these events.

If you want native Kubernetes to act upon this information, the liveness and readiness checks are what you are looking for. The liveness fail tells Kubernetes to restart the container according to the POD's restart policy, while the readiness fail tells Kubernetes not to route any traffic through load balancers (services) to the container.

-- B M
Source: StackOverflow