nginx vs kubernetes (as an external balancer) - fail to balance API servers

5/3/2018

We are trying to build HA Kubernetese cluster with 3 core nodes each of having full set of vital components: ETCD + APIServer + Scheduller + ControllerManager and external balancer. Since ETCD can make clusters by themselves, we are stack with making HA APIServers. What seemed an obvious task a couple of weeks ago now became a "no way disaster"...

We decided to use nginx as a balancer for 3 independent APIServers. All the rest parts of our cluster that communicate with APIServer (Kublets, Kube-Proxys, Schedulers, ControllerManagers..) are suppose to use balancer to access it. Everything went well before we started the "destructive" tests (as I call it) with some pods runing. Here is the part of APIServer config that dials with HS:

.. --apiserver-count=3 --endpoint-reconciler-type=lease ..

Here is our nginx.conf:

user                    nginx;

error_log               /var/log/nginx/error.log warn;
pid                     /var/run/nginx.pid;

worker_processes        auto;

events {
    multi_accept        on;
    use                 epoll;
    worker_connections  4096;
}

http {
    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for"';

    access_log          /var/log/nginx/access.log  main;

    sendfile            on;
    #tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;

    gzip                on;

    underscores_in_headers on;

   include /etc/nginx/conf.d/*.conf;
}

And apiservers.conf:

upstream apiserver_https {
    least_conn;
    server core1.sbcloud:6443; # max_fails=3 fail_timeout=3s;
    server core2.sbcloud:6443; # max_fails=3 fail_timeout=3s;
    server core3.sbcloud:6443; # max_fails=3 fail_timeout=3s;
}

map $http_upgrade $connection_upgrade {
    default upgrade;
    '' close;
}

server {
    listen                      6443 ssl so_keepalive=1m:10s:3; # http2; 

    ssl_certificate             "/etc/nginx/certs/server.crt";
    ssl_certificate_key         "/etc/nginx/certs/server.key";

    expires                     -1;
    proxy_cache                 off;
    proxy_buffering             off;
    proxy_http_version          1.1;

    proxy_connect_timeout       3s;

    proxy_next_upstream         error timeout invalid_header http_502; # non_idempotent # http_500 http_503 http_504;
    #proxy_next_upstream_tries   3;
    #proxy_next_upstream_timeout 3s;
    proxy_send_timeout          30m;
    proxy_read_timeout          30m;
    reset_timedout_connection   on;


    location / {
        proxy_pass              https://apiserver_https;
        add_header              Cache-Control "no-cache";
        proxy_set_header        Upgrade $http_upgrade;
        proxy_set_header        Connection "upgrade";
        proxy_set_header        Host $http_host;
        proxy_set_header        Authorization $http_authorization;
        proxy_set_header        X-Real-IP $remote_addr;
        proxy_set_header        X-SSL-CLIENT-CERT $ssl_client_cert;
     }
}

What came out after some tests is that Kubernetes seem to use single long living connection instead of tradition open-close sessions. This is probably dew to SSL. So we have to increase proxy_send_timeout and proxy_read_timeout to ridiculous 30m (the default value for APIServer is 1800s). If this settings are under 10m, then all clients (like Scheduler and ControllerManager) will generate tons if INTERNAL_ERROR because of broken streams.

So, for the crash test I simply put one of APIServers down by gently switching it off. Then I restart another one so nginx sees that upstream went down and switch all current connections to the last one. A couple of seconds later restarted APIserver returns back and we have 2 APIServers working. Then, I put network down on the third APIServer by running 'systemctl stop network' on that server so it has no chances to inform Kubernetes or nginx that its going down.

Now, the cluster it totally broken! nginx seem to recognize that upstream went down, but it will not reset already exciting connections to the upstream that is dead. I can still see them with 'ss -tnp'. If I restart Kubernetes services, they will reconnect and continue to work, same if I restart nginx - new sockets will show in ss output.

This happens only if I make APIserver unavailable by putting network down (preventing it from closing existing connections to nginx and informing Kubernetes that it is switching off). If I just stop it - everything work as a charm. But this is not a real case. Server can go down without any warning - just instantly.

What we are doing wrong? Is there is a way to force nginx to drop all connections to the upstream that went down? Anything to try before we move to HAProxy or LVS and ruin a week of kicking nginx in our attempts to make it balance instead of breaking our not so HA cluster.

-- e-pirate
kubernetes
nginx

0 Answers