Traefik how to investigate 100s of retried 500 errors

12/11/2018

Traefik v1.7.5 Kubernetes 1.10 (with kubenet networking on AWS)

I'm using Traefik as a Kubernetes ingress controller. It's been working well for my Elixir apps in production, but now I'm working to migrate a Ruby service using the puma webserver. Most requests (~200/s) are handled correctly. But some appear to cause Traefik to retry 120-200 times. The logs just show 100s of these:

172.58.x.x - - [11/Dec/2018:01:34:48 +0000] "PUT /users/123/game_results/234 HTTP/2.0" 500 21 "-" "okhttp/3.5.0" 610758 "www.example.com/" "http://100.96.13.37:5000" 329108ms

And there are zero corresponding errors in the Rails logs.

How can I troubleshoot this?

traefik config template (using Helm):

defaultEntryPoints = ["http","https"]
debug = false
logLevel = "INFO"

# Do not verify backend certificates (use https backends)
InsecureSkipVerify = true

[entryPoints]
  [entryPoints.traefik]
    address = ":8080"
  [entryPoints.http]
    address = ":80"
    compress = true
    [entryPoints.http.redirect]
      entryPoint = "https"
  [entryPoints.https]
    address = ":443"
    compress = true
    [entryPoints.https.proxyProtocol]
      trustedIPs = ["0.0.0.0/0"]
    [entryPoints.https.tls]
      sniStrict = true
      minVersion = "VersionTLS12"

[accessLog]

[api]

[kubernetes]

[metrics]
  [metrics.prometheus]
  buckets=[0.1,0.3,1.2,5.0]
  entryPoint = "traefik"

[ping]
  entryPoint = "http"

[acme]
  email = "{{ .Values.acme.email }}"
  storage = "{{ .Values.acme.storage }}"
  acmeLogging = true
  entryPoint = "https"
  OnHostRule = true
  caServer = "https://acme-v02.api.letsencrypt.org/directory"
  [acme.dnsChallenge]
    provider = "route53"
    delayBeforeCheck = 5
  {{- range .Values.acme.domains }}
  [[acme.domains]]
    main = "{{ .main }}"
  {{- end }}

[consul]
  endpoint = "traefik-consul.ingress:8500"
  watch = true
  prefix = "traefik"

[retry]
  attempts = 1

edit

I ran with debug level logs for long enough to see if that has any more information. Unfortunately, I just got a bunch of these:

msg=vulcand/oxy/forward: completed ServeHttp on request

Request={"Method":"PUT","URL":{"Scheme":"http","Opaque":"","User":null,"Host":"100.96.13.37:5000","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"Proto":"HTTP/2.0","ProtoMajor":2,"ProtoMinor":0,"Header":{"Accept":["application/json version=1"],"Accept-Encoding":["gzip"],"Authorization":["Bearer abc"],"Content-Length":["138"],"Content-Type":["application/json; charset=utf-8"],"User-Agent":["okhttp/3.5.0"]},"ContentLength":138,"TransferEncoding":null,"Host":"www.example.com","Form":null,"PostForm":null,"MultipartForm":null,"Trailer":null,"RemoteAddr":"1.2.3.4:55773","RequestURI":"/users/123/game_results/123","TLS":null}

So that didn't seem to help me understand the problem any. I also upgraded to Traefik 1.7.5, but that didn't help either.

-- Donald Plummer
kubernetes
traefik

0 Answers