Timing discrepancy for http calls between microservices while under load

10/13/2018

Asp.Net Core microservices on Docker/Kubernetes are disagreeing on the duration of inter-service calls between the caller and callee.

The caller logs can show anywhere from a few milliseconds up to 10 full seconds more than the callee. The problem worsens under heavy load, but is still present under light load. Many calls do agree between the caller and callee, but this discrepancy does happen frequent enough to make a real dent in performance overall.

The timestamps indicate that the time gap can either be before or after the callee has reported that its response is complete.

Example logs (numbers from a real time discrepancy)

ServiceB: [2018-10-11T22:41:41.374Z] S2S request complete to ServiceA, Duration: 11644
ServiceA: [2018-10-11T22:41:29.732Z] Request complete, Duration: 5

Caller Timing (common class for all S2S calls)

var timer = Stopwatch.StartNew();
var response = await _httpClientFactory.CreateClient().SendAsync(request);
timer.Stop();
Logger.Info(
quot;S2S request complete to {service}, Duration: {timer.EllapsedMilliseconds}")
;

Callee Timing (custom Asp.Net middleware)

var timer = Stopwatch.StartNew();
await _next(context);
timer.Stop();
Logger.Info($"Request complete, Duration: {timer.EllapsedMilliseconds}");

This middleware is registered as almost the first in the pipeline (second to only the ActivityId / TraceId middleware for log correlation).

Troubleshooting Steps

  • Not able to reproduce the issue on Windows development machine
  • Monitored CPU, Memory, Thread Count, GC Collects, Open Handles (all at reasonable levels)
  • Adjusted k8s spec CPU and Memory request / limit (various levels with some effect, but does not alleviate the problem)
  • Turned on Server GC with Environment Variable: COMPlus_gcServer=1
  • Issue occurs on services that are within resource limits and have not needed to autoscale
  • Changed to new Kestrel Socket Transport (instead of libuv)
  • Changed to new .Net Core 2.1 SocketsHttpHandler

System Topology

Asp.Net Core 2.1 self-hosted Kestrel
.Net Core 2.1.5 runtime
Docker / Kubernetes 1.10.5
K8s Addons: kube-proxy, weave, etcd, SkyDNS
AWS c5.4xlarge

Updates

  1. Found out that the time gap can sometimes be before or after the callee starts/completes
-- TylerOhlsen
.net-core
asp.net-core
docker
kubernetes

1 Answer

10/18/2018

In this case, this issue was fixed by removing the k8s spec CPU limit.

Monitoring the container_cpu_cfs_throttled_seconds_total metric found that one of the service containers was getting paused very frequently. These pauses were mostly on the caller side of the S2S calls. Which increased the elapsed time reported by the caller.

Removing the CPU limit in the k8s spec prevents k8s from passing the --cpu-quota and --cpu-period docker parameters. Which is what controls the container pauses.

-- TylerOhlsen
Source: StackOverflow