We are trying to build HA Kubernetese cluster with 3 core nodes each of having full set of vital components: ETCD + APIServer + Scheduller + ControllerManager and external balancer. Since ETCD can make clusters by themselves, we are stack with making HA APIServers. What seemed an obvious task a couple of weeks ago now became a "no way disaster"...
We decided to use nginx as a balancer for 3 independent APIServers. All the rest parts of our cluster that communicate with APIServer (Kublets, Kube-Proxys, Schedulers, ControllerManagers..) are suppose to use balancer to access it. Everything went well before we started the "destructive" tests (as I call it) with some pods runing. Here is the part of APIServer config that dials with HS:
.. --apiserver-count=3 --endpoint-reconciler-type=lease ..
Here is our nginx.conf:
user nginx;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
worker_processes auto;
events {
multi_accept on;
use epoll;
worker_connections 4096;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
gzip on;
underscores_in_headers on;
include /etc/nginx/conf.d/*.conf;
}
And apiservers.conf:
upstream apiserver_https {
least_conn;
server core1.sbcloud:6443; # max_fails=3 fail_timeout=3s;
server core2.sbcloud:6443; # max_fails=3 fail_timeout=3s;
server core3.sbcloud:6443; # max_fails=3 fail_timeout=3s;
}
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
listen 6443 ssl so_keepalive=1m:10s:3; # http2;
ssl_certificate "/etc/nginx/certs/server.crt";
ssl_certificate_key "/etc/nginx/certs/server.key";
expires -1;
proxy_cache off;
proxy_buffering off;
proxy_http_version 1.1;
proxy_connect_timeout 3s;
proxy_next_upstream error timeout invalid_header http_502; # non_idempotent # http_500 http_503 http_504;
#proxy_next_upstream_tries 3;
#proxy_next_upstream_timeout 3s;
proxy_send_timeout 30m;
proxy_read_timeout 30m;
reset_timedout_connection on;
location / {
proxy_pass https://apiserver_https;
add_header Cache-Control "no-cache";
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $http_host;
proxy_set_header Authorization $http_authorization;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-SSL-CLIENT-CERT $ssl_client_cert;
}
}
What came out after some tests is that Kubernetes seem to use single long living connection instead of tradition open-close sessions. This is probably dew to SSL. So we have to increase proxy_send_timeout and proxy_read_timeout to ridiculous 30m (the default value for APIServer is 1800s). If this settings are under 10m, then all clients (like Scheduler and ControllerManager) will generate tons if INTERNAL_ERROR because of broken streams.
So, for the crash test I simply put one of APIServers down by gently switching it off. Then I restart another one so nginx sees that upstream went down and switch all current connections to the last one. A couple of seconds later restarted APIserver returns back and we have 2 APIServers working. Then, I put network down on the third APIServer by running 'systemctl stop network' on that server so it has no chances to inform Kubernetes or nginx that its going down.
Now, the cluster it totally broken! nginx seem to recognize that upstream went down, but it will not reset already exciting connections to the upstream that is dead. I can still see them with 'ss -tnp'. If I restart Kubernetes services, they will reconnect and continue to work, same if I restart nginx - new sockets will show in ss output.
This happens only if I make APIserver unavailable by putting network down (preventing it from closing existing connections to nginx and informing Kubernetes that it is switching off). If I just stop it - everything work as a charm. But this is not a real case. Server can go down without any warning - just instantly.
What we are doing wrong? Is there is a way to force nginx to drop all connections to the upstream that went down? Anything to try before we move to HAProxy or LVS and ruin a week of kicking nginx in our attempts to make it balance instead of breaking our not so HA cluster.