I'm newbie in envoy. I have been struggling during a week with error below. So my downstream(server which requests for some data/update) receives response:
Status code: 503
Headers:
...
Server:"envoy"
X-Envoy-Response-Code-Details:"upstream_reset_before_response_started{connection_failure}"
X-Envoy-Response-Flags: "UF,URX"
Body: upstream connect error or disconnect/reset before headers. reset reason: connection failure
On the other side, my upstream gets disconnection(context cancelled). And upstream service doesn't return 503 codes at all.
All network is going by http1.
My envoy.yaml:
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 80 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: http1
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: [ "*" ]
response_headers_to_add: # added for debugging
- header:
key: x-envoy-response-code-details
value: "%RESPONSE_CODE_DETAILS%"
- header:
key: x-envoy-response-flags
value: "%RESPONSE_FLAGS%"
routes:
- match: # consistent routing
safe_regex:
google_re2: { }
regex: SOME_STRANGE_REGEX_FOR_CONSISTENT_ROUTING
route:
cluster: consistent_cluster
hash_policy:
header:
header_name: ":path"
regex_rewrite:
pattern:
google_re2: { }
regex: SOME_STRANGE_REGEX_FOR_CONSISTENT_ROUTING
substitution: "\\1"
retry_policy: # attempt to avoid 503 errors by retries
retry_on: "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes"
retriable_status_codes: [ 503 ]
num_retries: 3
retriable_request_headers:
- name: ":method"
exact_match: "GET"
- match: { prefix: "/" } # default routing (all routes except consistent)
route:
cluster: default_cluster
retry_policy: # attempt to avoid 503 errors by retries
retry_on: "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes"
retriable_status_codes: [ 503 ]
retry_host_predicate:
- name: envoy.retry_host_predicates.previous_hosts
host_selection_retry_max_attempts: 3
http_filters:
- name: envoy.filters.http.router
clusters:
- name: consistent_cluster
connect_timeout: 0.05s
type: STRICT_DNS
dns_refresh_rate: 1s
dns_lookup_family: V4_ONLY
lb_policy: MAGLEV
health_checks:
- timeout: 1s
interval: 1s
unhealthy_threshold: 1
healthy_threshold: 1
http_health_check:
path: "/health"
load_assignment:
cluster_name: consistent_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: consistent-host
port_value: 80
- name: default_cluster
connect_timeout: 0.05s
type: STRICT_DNS
dns_refresh_rate: 1s
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
health_checks:
- timeout: 1s
interval: 1s
unhealthy_threshold: 1
healthy_threshold: 1
http_health_check:
path: "/health"
outlier_detection: # attempt to avoid 503 errors by ejecting unhealth pods
consecutive_gateway_failure: 1
load_assignment:
cluster_name: default_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: default-host
port_value: 80
I also tried to add logs:
access_log:
- name: accesslog
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /tmp/http_access.log
log_format:
text_format: "[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %CONNECTION_TERMINATION_DETAILS% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\"\n"
filter:
status_code_filter:
comparison:
op: GE
value:
default_value: 500
runtime_key: access_log.access_error.status
It gave me nothing, because %CONNECTION_TERMINATION_DETAILS%
is always empty("-") and response flags I have seen already from headers in downstream responses.
I increased connect_timeout
twice (0.01s -> 0.02s -> 0.05s). It didn't help at all. And other services(by direct routing) work okay with connect timeout 10ms.
BTW everything works nice after redeploy during approximately 20 minutes for sure.
Hope to hear your ideas what it can be and where i should dig into)
P.S: I also receive health check errors sometimes(in logs), but i have no idea why. And everything without envoy worked well(no errors, no timeouts): health checking, direct requests, etc.
I experienced a similar problem when starting envoy as a docker container. In the end, the reason was a missing --network host
option in the docker run
command which lead to the clusters not being visible from within envoy's docker container. Maybe this helps you, too?