Recovering/ retrying in case of failed or stucked HTTP requests

3/11/2019

I have a Java-based server managed by the kubernetes cluster. It's a distributed environment where the number of the instance is set to 4 to handle millions of request per minute.

The issue that I am facing is kubernetes tries to balance the cluster and in the process kills the pod and take it to another node, but there are pending HTTP request GET and POST that gets lost.

What is the solution by kubernetes or architectural solution that would let me retry if the request is stuck/ failed?

UPDATE:

I have two configurations for kubernetes service:

  1. LoadBalancer (is with AWS ELB): for external facing
  2. ClusterIP: for internal microservice based architecture
-- Vishrant
grizzly
jakarta-ee
java
kubernetes
server

2 Answers

3/11/2019

Kubernetes gives you the means to gracefully handle pod terminations via SIGTERM and preStop hooks. There are several articles on this, e.g. Graceful shutdown of pods with Kubernetes. In your Java app, you should listen for SIGTERM and gracefully shutdown the server (most http frameworks have this "shutdown" functionality built in them).

The issue that I am facing is kubernetes tries to balance the cluster and in the process kills the pod and take it to another node

Now this sounds a little suspicious - in general K8s only evicts and reschedules pods on different nodes under specific circumstances, for example when a node is running out of resources to serve the pod. If your pods are frequently getting rescheduled, this is generally a sign that something else is happening, so you should probably determine the root cause (if you have resource limits set in your deployment spec make sure your service container isn't exceeding those - this is a common problem with JVM containers).

Finally, HTTP retries are inherently unsafe for non-idempotent requests (POST/PUT), so you can't just retry on any failed request without knowing the logical implications. In any case, retries generally happen on the client side, not server, so it's not a flag you can set in K8s to enable them.

-- PoweredByOrange
Source: StackOverflow

3/12/2019

Service mesh solves the particular issue that you are facing.

There are different service mesh available. General features of service mesh are

  • Load balancing
  • Fine-grained traffic policies
  • Service discovery
  • Service monitoring
  • Tracing
  • Routing

Service Mesh

  • Istio
  • Envoy
  • Linkerd

Linkerd: https://linkerd.io/2/features/retries-and-timeouts/

-- Shashank Sinha
Source: StackOverflow