Envoy: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"

11/9/2021

I'm newbie in envoy. I have been struggling during a week with error below. So my downstream(server which requests for some data/update) receives response:

Status code: 503

Headers:
...
Server:"envoy"
X-Envoy-Response-Code-Details:"upstream_reset_before_response_started{connection_failure}"
X-Envoy-Response-Flags: "UF,URX"

Body: upstream connect error or disconnect/reset before headers. reset reason: connection failure

On the other side, my upstream gets disconnection(context cancelled). And upstream service doesn't return 503 codes at all.

All network is going by http1.

My envoy.yaml:

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

  static_resources:
    listeners:
      - name: listener_0
        address:
          socket_address: { address: 0.0.0.0, port_value: 80 }
        filter_chains:
          - filters:
              - name: envoy.filters.network.http_connection_manager
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                  stat_prefix: ingress_http
                  codec_type: http1
                  route_config:
                    name: local_route
                    virtual_hosts:
                      - name: local_service
                        domains: [ "*" ]
                        response_headers_to_add: # added for debugging
                          - header:
                              key: x-envoy-response-code-details
                              value: "%RESPONSE_CODE_DETAILS%"
                          - header:
                              key: x-envoy-response-flags
                              value: "%RESPONSE_FLAGS%"
                        routes:
                          - match: # consistent routing
                              safe_regex:
                                google_re2: { }
                                regex: SOME_STRANGE_REGEX_FOR_CONSISTENT_ROUTING
                            route:
                              cluster: consistent_cluster
                              hash_policy:
                                header:
                                  header_name: ":path"
                                  regex_rewrite:
                                    pattern:
                                      google_re2: { }
                                      regex: SOME_STRANGE_REGEX_FOR_CONSISTENT_ROUTING
                                    substitution: "\\1"
                              retry_policy: # attempt to avoid 503 errors by retries
                                retry_on: "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes"
                                retriable_status_codes: [ 503 ]
                                num_retries: 3
                                retriable_request_headers:
                                  - name: ":method"
                                    exact_match: "GET"

                          - match: { prefix: "/" } # default routing (all routes except consistent)
                            route:
                              cluster: default_cluster
                              retry_policy: # attempt to avoid 503 errors by retries
                                retry_on: "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes"
                                retriable_status_codes: [ 503 ]
                                retry_host_predicate:
                                  - name: envoy.retry_host_predicates.previous_hosts
                                host_selection_retry_max_attempts: 3
                  http_filters:
                    - name: envoy.filters.http.router

    clusters:
      - name: consistent_cluster
        connect_timeout: 0.05s
        type: STRICT_DNS
        dns_refresh_rate: 1s
        dns_lookup_family: V4_ONLY
        lb_policy: MAGLEV
        health_checks:
          - timeout: 1s
            interval: 1s
            unhealthy_threshold: 1
            healthy_threshold: 1
            http_health_check:
              path: "/health"
        load_assignment:
          cluster_name: consistent_cluster
          endpoints:
            - lb_endpoints:
                - endpoint:
                    address:
                      socket_address:
                        address: consistent-host
                        port_value: 80
                        
      - name: default_cluster
        connect_timeout: 0.05s
        type: STRICT_DNS
        dns_refresh_rate: 1s
        dns_lookup_family: V4_ONLY
        lb_policy: ROUND_ROBIN
        health_checks:
          - timeout: 1s
            interval: 1s
            unhealthy_threshold: 1
            healthy_threshold: 1
            http_health_check:
              path: "/health"
        outlier_detection: # attempt to avoid 503 errors by ejecting unhealth pods
          consecutive_gateway_failure: 1
        load_assignment:
          cluster_name: default_cluster
          endpoints:
            - lb_endpoints:
                - endpoint:
                    address:
                      socket_address:
                        address: default-host
                        port_value: 80

I also tried to add logs:

access_log:
  - name: accesslog
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: /tmp/http_access.log
      log_format:
        text_format: "[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %CONNECTION_TERMINATION_DETAILS% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\"\n"
    filter:
      status_code_filter:
        comparison:
          op: GE
          value:
            default_value: 500
            runtime_key: access_log.access_error.status

It gave me nothing, because %CONNECTION_TERMINATION_DETAILS% is always empty("-") and response flags I have seen already from headers in downstream responses.

I increased connect_timeout twice (0.01s -> 0.02s -> 0.05s). It didn't help at all. And other services(by direct routing) work okay with connect timeout 10ms. BTW everything works nice after redeploy during approximately 20 minutes for sure.

Hope to hear your ideas what it can be and where i should dig into)

P.S: I also receive health check errors sometimes(in logs), but i have no idea why. And everything without envoy worked well(no errors, no timeouts): health checking, direct requests, etc.

-- sirsova
devops
envoyproxy
http
kubernetes
load-balancing

1 Answer

11/10/2021

I experienced a similar problem when starting envoy as a docker container. In the end, the reason was a missing --network host option in the docker run command which lead to the clusters not being visible from within envoy's docker container. Maybe this helps you, too?

-- OLF
Source: StackOverflow