Kubernetes Intermittent Delay in response data from REST application

10/4/2019

I'm looking for some guidance on how to debug what appears to be a networking issue within a bare-metal k8s cluster.

The cluster hosts multiple applications apart from the one discussed below. I'm not the owner of everything running in the cluster, but I do know istio was introduced somewhat recently (previously using nginx-ingress).

I have a REST API application deployed on the cluster with a particular route that returns about 7MB worth of historical data (as a json structure). The API application caches the data using a python module to avoid any overhead from the DB (helm-mysql) queries or data processing. A web application (also deployed in the cluster) fetches the data from the API for display, but by using curl directly I've narrowed the problem to either the API app or networking. Further narrowing the problem scope, I am able to consistently get the data successfully when running curl in the container (docker exec bash and executing curl on localhost).

Example of running curl within the container:

# time curl -o /dev/null localhost/$ROUTE?numDays=30
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6380k  100 6380k    0     0  2716k      0  0:00:02  0:00:02 --:--:-- 2716k

real    0m2.359s
user    0m0.007s
sys     0m0.012s

The problem comes when using the ingress or the cluster-ip to access the API application, which intermittently takes more than 2 min to complete. The failure occurs more than 50% of the time.

$ time curl -o /dev/null $CLUSTER_IP/$ROUTE?numDays=30
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6380k  100 6380k    0     0  51523      0  0:02:06  0:02:06 --:--:--  331k

real    2m6.810s
user    0m0.011s
sys     0m0.038s

In these examples of the response taking 2 min, they originally send a partial amount of the data.. then after 2 min we get 100%. (Below output shows 81% data after a few seconds have elapsed)

$ time curl -o /dev/null $CLUSTER_IP/$ROUTE?numDays=30
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 81 6380k   81 5178k    0     0   988k      0  0:00:06  0:00:05  0:00:01 1222k

I'm looking for any help or suggestions on how to proceed with debugging the issue.

Cluster information:

Kubernetes version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3",    GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:36:19Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud being used: bare-metal
  • Installation method: unknown (possibly relevant metallb-system)
  • Host OS: RHEL7.5
  • CNI and version: calico/cni:v3.8.0
  • CRI and version: Docker version 18.06.1-ce, build e68fc7a

Additional detail to answer comments:

A work-around was in place to reduce the amount of data for 30days from ~7MB to ~3MB. Additional changes were made to reduce the data further by changing the default from 30 days to 14 days (further reducing data to ~1.5MB). This worked for a while, until strangely one of the other independent applications in the cluster was upgraded, let’s call it APP2. I do not own APP2, I know it was upgraded from release notes and can see the AGE in the get pods output.

Please correct me if I’m wrong:

  • POD_IP is a 192. (found from the pod describe)
  • CLUSTER_IP is a 10. (found via the service)
  • INGRESS is a 10. IP that istio binds to.

Running from a node within the cluster (POD is running on a separate node), the CLUSTER_IP and POD_IP both respond consistently, but the INGRESS hits the issue. Running on some VM external to the cluster, the CLUSTER_IP and INGRESS both hit the issue (POD_IP not possible externally).

Furthermore, when a curl to the CLUSTER_IP on an external host is stuck.. I can successfully run internal curls using the POD and CLUSTER_IP.

The get pods output is pretty long. A summary: all pods related to MYAPP are running. There are a few istio-telemetry pods that are Evicted. There also appears to be one APP2 pod stuck in Terminating with over 2k restarts.

There is one endpoint registered to the service (one IP address under subsets->addresses->ip) and there is 1 POD serving the data.

-- Rob
kubernetes

1 Answer

10/28/2019

Thanks all for the feedback.

We upgraded Isitio to 1.3.3, and noticed the sidecar was missing and re-added that. The intermittency appears to be resolved.

-- Rob
Source: StackOverflow