K3s cluster not starting (anymore)

5/16/2020

My local k3s playground decided to suddenly stop working. I have the gut feeling something is wrong with the https certs I start the cluster from docker compose using

version: '3.2'

services:
  server:
    image: rancher/k3s:latest
    command: server  --disable-agent --tls-san 192.168.2.110
    environment:
    - K3S_CLUSTER_SECRET=somethingtotallyrandom
    - K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml
    - K3S_KUBECONFIG_MODE=666
    volumes:
    - k3s-server:/var/lib/rancher/k3s
    # get the kubeconfig file
    - .:/output
    - ./registries.yaml:/etc/rancher/k3s/registries.yaml
    ports:
     - 192.168.2.110:6443:6443

  node:
    image: rancher/k3s:latest
    volumes:
    - ./registries.yaml:/etc/rancher/k3s/registries.yaml

    tmpfs:
    - /run
    - /var/run
    privileged: true
    environment:
    - K3S_URL=https://server:6443
    - K3S_CLUSTER_SECRET=somethingtotallyrandom
    ports:
      - 31000-32000:31000-32000

volumes:
  k3s-server: {}

Nothing special. the registries.yaml can be uncommented without making a difference. contents is

mirrors:
  "192.168.2.110:5055":
    endpoint:
      - "http://192.168.2.110:5055"

However I get now a bunch of weird failures

server_1  | E0516 22:58:03.264451       1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
server_1  | E0516 22:58:08.265272       1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
node_1    | I0516 22:58:12.695365       1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: bb7ee4b14724692f4497e99716b68c4dc4fe77333b03801909092d42c00ef5a2
node_1    | I0516 22:58:15.006306       1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: bb7ee4b14724692f4497e99716b68c4dc4fe77333b03801909092d42c00ef5a2
node_1    | I0516 22:58:15.006537       1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1    | E0516 22:58:15.006757       1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1  | E0516 22:58:22.345501       1 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
node_1    | I0516 22:58:27.695296       1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1    | E0516 22:58:27.695989       1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1  | I0516 22:58:30.328999       1 request.go:621] Throttling request took 1.047650754s, request: GET:https://127.0.0.1:6444/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
server_1  | W0516 22:58:31.081020       1 garbagecollector.go:644] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
server_1  | E0516 22:58:36.442904       1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
node_1    | I0516 22:58:40.695404       1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1    | E0516 22:58:40.696176       1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1  | E0516 22:58:41.443295       1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

also it seems my node is not really connecting to the server anymore

user@ipc:~/dev/test_mk3s_docker$ docker exec -it  $(docker ps |grep "k3s server"|awk -F\  '{print $1}') kubectl cluster-info
Kubernetes master is running at https://127.0.0.1:6443
CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
user@ipc:~/dev/test_mk3s_docker$ docker exec -it  $(docker ps |grep "k3s agent"|awk -F\  '{print $1}') kubectl cluster-info
error: Missing or incomplete configuration info.  Please point to an existing, complete config file:

  1. Via the command-line flag --kubeconfig
  2. Via the KUBECONFIG environment variable
  3. In your home directory as ~/.kube/config

To view or setup config directly use the 'config' command.

if I run `kubectl get apiservice I get the following line

v1beta1.storage.k8s.io                 Local                        True                           20m
v1beta1.scheduling.k8s.io              Local                        True                           20m
v1.storage.k8s.io                      Local                        True                           20m
v1.k3s.cattle.io                       Local                        True                           20m
v1.helm.cattle.io                      Local                        True                           20m
v1beta1.metrics.k8s.io                 kube-system/metrics-server   False (FailedDiscoveryCheck)   20m

also downgrading k3s to k3s:v1.0.1 only changes the error message

server_1  | E0516 23:46:02.951073       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: no kind "CSINode" is registered for version "storage.k8s.io/v1" in scheme "k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30"
server_1  | E0516 23:46:03.444519       1 status.go:71] apiserver received an error that is not an metav1.Status: &runtime.notRegisteredErr{schemeName:"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30", gvk:schema.GroupVersionKind{Group:"storage.k8s.io", Version:"v1", Kind:"CSINode"}, target:runtime.GroupVersioner(nil), t:reflect.Type(nil)}

after executing

 docker exec -it  $(docker ps |grep "k3s server"|awk -F\  '{print $1}') kubectl --namespace kube-system delete apiservice v1beta1.metrics.k8s.io

I only get

node_1    | W0517 07:03:06.346944       1 info.go:51] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
node_1    | I0517 07:03:21.504932       1 log.go:172] http: TLS handshake error from 10.42.1.15:53888: remote error: tls: bad certificate
-- CAFEBABE
k3s
kubernetes

0 Answers