Asp.Net Core microservices on Docker/Kubernetes are disagreeing on the duration of inter-service calls between the caller and callee.
The caller logs can show anywhere from a few milliseconds up to 10 full seconds more than the callee. The problem worsens under heavy load, but is still present under light load. Many calls do agree between the caller and callee, but this discrepancy does happen frequent enough to make a real dent in performance overall.
The timestamps indicate that the time gap can either be before or after the callee has reported that its response is complete.
Example logs (numbers from a real time discrepancy)
ServiceB: [2018-10-11T22:41:41.374Z] S2S request complete to ServiceA, Duration: 11644
ServiceA: [2018-10-11T22:41:29.732Z] Request complete, Duration: 5
Caller Timing (common class for all S2S calls)
var timer = Stopwatch.StartNew();
var response = await _httpClientFactory.CreateClient().SendAsync(request);
timer.Stop();
Logger.Info(quot;S2S request complete to {service}, Duration: {timer.EllapsedMilliseconds}");
Callee Timing (custom Asp.Net middleware)
var timer = Stopwatch.StartNew();
await _next(context);
timer.Stop();
Logger.Info($"Request complete, Duration: {timer.EllapsedMilliseconds}");
This middleware is registered as almost the first in the pipeline (second to only the ActivityId / TraceId middleware for log correlation).
Troubleshooting Steps
System Topology
Asp.Net Core 2.1 self-hosted Kestrel
.Net Core 2.1.5 runtime
Docker / Kubernetes 1.10.5
K8s Addons: kube-proxy, weave, etcd, SkyDNS
AWS c5.4xlarge
Updates
In this case, this issue was fixed by removing the k8s spec CPU limit.
Monitoring the container_cpu_cfs_throttled_seconds_total
metric found that one of the service containers was getting paused very frequently. These pauses were mostly on the caller side of the S2S calls. Which increased the elapsed time reported by the caller.
Removing the CPU limit in the k8s spec prevents k8s from passing the --cpu-quota
and --cpu-period
docker parameters. Which is what controls the container pauses.