I'm looking for some guidance on how to debug what appears to be a networking issue within a bare-metal k8s cluster.
The cluster hosts multiple applications apart from the one discussed below. I'm not the owner of everything running in the cluster, but I do know istio was introduced somewhat recently (previously using nginx-ingress).
I have a REST API application deployed on the cluster with a particular route that returns about 7MB worth of historical data (as a json structure). The API application caches the data using a python module to avoid any overhead from the DB (helm-mysql) queries or data processing. A web application (also deployed in the cluster) fetches the data from the API for display, but by using curl directly I've narrowed the problem to either the API app or networking. Further narrowing the problem scope, I am able to consistently get the data successfully when running curl in the container (docker exec bash and executing curl on localhost).
Example of running curl within the container:
# time curl -o /dev/null localhost/$ROUTE?numDays=30
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6380k 100 6380k 0 0 2716k 0 0:00:02 0:00:02 --:--:-- 2716k
real 0m2.359s
user 0m0.007s
sys 0m0.012s
The problem comes when using the ingress or the cluster-ip to access the API application, which intermittently takes more than 2 min to complete. The failure occurs more than 50% of the time.
$ time curl -o /dev/null $CLUSTER_IP/$ROUTE?numDays=30
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6380k 100 6380k 0 0 51523 0 0:02:06 0:02:06 --:--:-- 331k
real 2m6.810s
user 0m0.011s
sys 0m0.038s
In these examples of the response taking 2 min, they originally send a partial amount of the data.. then after 2 min we get 100%. (Below output shows 81% data after a few seconds have elapsed)
$ time curl -o /dev/null $CLUSTER_IP/$ROUTE?numDays=30
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
81 6380k 81 5178k 0 0 988k 0 0:00:06 0:00:05 0:00:01 1222k
I'm looking for any help or suggestions on how to proceed with debugging the issue.
Kubernetes version:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:36:19Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Additional detail to answer comments:
A work-around was in place to reduce the amount of data for 30days from ~7MB to ~3MB. Additional changes were made to reduce the data further by changing the default from 30 days to 14 days (further reducing data to ~1.5MB). This worked for a while, until strangely one of the other independent applications in the cluster was upgraded, let’s call it APP2. I do not own APP2, I know it was upgraded from release notes and can see the AGE in the get pods output.
Please correct me if I’m wrong:
Running from a node within the cluster (POD is running on a separate node), the CLUSTER_IP and POD_IP both respond consistently, but the INGRESS hits the issue. Running on some VM external to the cluster, the CLUSTER_IP and INGRESS both hit the issue (POD_IP not possible externally).
Furthermore, when a curl to the CLUSTER_IP on an external host is stuck.. I can successfully run internal curls using the POD and CLUSTER_IP.
The get pods output is pretty long. A summary: all pods related to MYAPP are running. There are a few istio-telemetry pods that are Evicted. There also appears to be one APP2 pod stuck in Terminating with over 2k restarts.
There is one endpoint registered to the service (one IP address under subsets->addresses->ip) and there is 1 POD serving the data.
Thanks all for the feedback.
We upgraded Isitio to 1.3.3, and noticed the sidecar was missing and re-added that. The intermittency appears to be resolved.