Failed to get job complete status for job rke-network-plugin-deploy-job

1/12/2020

I depolied rke in air-gapped environment with below specification:


Nodes:

3 controller with etcd 2 workers


RKE version: v1.0.0


Docker version:

Client:
Debug Mode: false
Server:
Containers: 24
Running: 7
Paused: 0
Stopped: 17
Images: 4
Server Version: 19.03.1-ol
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: **************
runc version: ******
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.35-1902.8.4.el7uek.x86_64
Operating System: Oracle Linux Server 7.7
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.409GiB
Name: rke01.kuberlocal.co
ID:*******************************
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
registry.console:5000
127.0.0.0/8
Live Restore Enabled: false
Registries:

Operating system and kernel: (Oracle linux 7)

Red Hat Enterprise Linux Server release 7.7
4.14.35-1902.8.4.el7uek.x86_64

Type/provider of hosts: VirtualBox (test environment)


cluster.yml file:

If you intened to deploy Kubernetes in an air-gapped environment,
please consult the documentation on how to configure custom RKE images.

nodes:

    address: rke01
    port: "22"
    internal_address: 192.168.40.11
    role:
        controlplane
        etcd
        hostname_override: ""
        user: rke
        docker_socket: /var/run/docker.sock
        ssh_key: ""
        ssh_key_path: ~/.ssh/id_rsa
        ssh_cert: ""
        ssh_cert_path: ""
        labels: {}
        taints: []
    address: rke02
    port: "22"
    internal_address: 192.168.40.17
    role:
        controlplane
        etcd
        hostname_override: ""
        user: rke
        docker_socket: /var/run/docker.sock
        ssh_key: ""
        ssh_key_path: ~/.ssh/id_rsa
        ssh_cert: ""
        ssh_cert_path: ""
        labels: {}
        taints: []
    address: rke03
    port: "22"
    internal_address: 192.168.40.13
    role:
        controlplane
        etcd
        hostname_override: ""
        user: rke
        docker_socket: /var/run/docker.sock
        ssh_key: ""
        ssh_key_path: ~/.ssh/id_rsa
        ssh_cert: ""
        ssh_cert_path: ""
        labels: {}
        taints: []
    address: rke04
    port: "22"
    internal_address: 192.168.40.14
    role:
        worker
        hostname_override: ""
        user: rke
        docker_socket: /var/run/docker.sock
        ssh_key: ""
        ssh_key_path: ~/.ssh/id_rsa
        ssh_cert: ""
        ssh_cert_path: ""
        labels: {}
        taints: []
    address: rke05
    port: "22"
    internal_address: 192.168.40.15
    role:
        worker
        hostname_override: ""
        user: rke
        docker_socket: /var/run/docker.sock
        ssh_key: ""
        ssh_key_path: ~/.ssh/id_rsa
        ssh_cert: ""
        ssh_cert_path: ""
        labels: {}
        taints: []
        services:
        etcd:
        image: ""
        extra_args: {}
        extra_binds: []
        extra_env: []
        external_urls: []
        ca_cert: ""
        cert: ""
        key: ""
        path: ""
        uid: 0
        gid: 0
        snapshot: null
        retention: ""
        creation: ""
        backup_config: null
        kube-api:
        image: ""
        extra_args: {}
        extra_binds: []
        extra_env: []
        service_cluster_ip_range: 10.43.0.0/16
        service_node_port_range: ""
        pod_security_policy: false
        always_pull_images: false
        secrets_encryption_config: null
        audit_log: null
        admission_configuration: null
        event_rate_limit: null
        kube-controller:
        image: ""
        extra_args: {}
        extra_binds: []
        extra_env: []
        cluster_cidr: 10.42.0.0/16
        service_cluster_ip_range: 10.43.0.0/16
        scheduler:
        image: ""
        extra_args: {}
        extra_binds: []
        extra_env: []
        kubelet:
        image: ""
        extra_args: {}
        extra_binds: []
        extra_env: []
        cluster_domain: bmi.rke.cluster.local
        infra_container_image: ""
        cluster_dns_server: 10.43.0.10
        fail_swap_on: false
        generate_serving_certificate: false
        kubeproxy:
        image: ""
        extra_args: {}
        extra_binds: []
        extra_env: []
        network:
        plugin: weave
        weave_network_provider:
        password: "********"
        options: {}
        node_selector: {}
        authentication:
        strategy: x509
        sans: []
        webhook: null
        addons: ""
        addons_include: []
        system_images:
        etcd: registry.console:5000/rancher/coreos-etcd:v3.3.15-rancher1
        alpine: registry.console:5000/rancher/rke-tools:v0.1.51
        nginx_proxy: registry.console:5000/rancher/rke-tools:v0.1.51
        cert_downloader: registry.console:5000/rancher/rke-tools:v0.1.51
        kubernetes_services_sidecar: registry.console:5000/rancher/rke-tools:v0.1.51
        kubedns: registry.console:5000/rancher/k8s-dns-kube-dns:1.15.0
        dnsmasq: registry.console:5000/rancher/k8s-dns-dnsmasq-nanny:1.15.0
        kubedns_sidecar: registry.console:5000/rancher/k8s-dns-sidecar:1.15.0
        kubedns_autoscaler: registry.console:5000/rancher/cluster-proportional-autoscaler:1.7.1
        coredns: registry.console:5000/rancher/coredns-coredns:1.6.2
        coredns_autoscaler: registry.console:5000/rancher/cluster-proportional-autoscaler:1.7.1
        kubernetes: registry.console:5000/rancher/hyperkube:v1.16.3-rancher1
        flannel: registry.console:5000/rancher/coreos-flannel:v0.11.0-rancher1
        flannel_cni: registry.console:5000/rancher/flannel-cni:v0.3.0-rancher5
        calico_node: registry.console:5000/rancher/calico-node:v3.8.1
        calico_cni: registry.console:5000/rancher/calico-cni:v3.8.1
        calico_controllers: registry.console:5000/rancher/calico-kube-controllers:v3.8.1
        calico_ctl: ""
        calico_flexvol: registry.console:5000/rancher/calico-pod2daemon-flexvol:v3.8.1
        canal_node: registry.console:5000/rancher/calico-node:v3.8.1
        canal_cni: registry.console:5000/rancher/calico-cni:v3.8.1
        canal_flannel: registry.console:5000/rancher/coreos-flannel:v0.11.0
        canal_flexvol: registry.console:5000/rancher/calico-pod2daemon-flexvol:v3.8.1
        weave_node: registry.console:5000/weaveworks/weave-kube:2.5.2
        weave_cni: registry.console:5000/weaveworks/weave-npc:2.5.2
        pod_infra_container: registry.console:5000/rancher/pause:3.1
        ingress: registry.console:5000/rancher/nginx-ingress-controller:nginx-0.25.1-rancher1
        ingress_backend: registry.console:5000/rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
        metrics_server: registry.console:5000/rancher/metrics-server:v0.3.4
        windows_pod_infra_container: rancher/kubelet-pause:v0.1.3
        ssh_key_path: ~/.ssh/id_rsa
        ssh_cert_path: ""
        ssh_agent_auth: false
        authorization:
        mode: rbac
        options: {}
        #ignore_docker_version: false
        ignore_docker_version: true
        kubernetes_version: ""
        private_registries:
            url: registry.console:5000
            user: registry_user
            password: ***********
            is_default: true
            ingress:
            provider: ""
            options: {}
            node_selector: {}
            extra_args: {}
            dns_policy: ""
            extra_envs: []
            extra_volumes: []
            extra_volume_mounts: []
            cluster_name: ""
            cloud_provider:
            name: ""
            prefix_path: "/opt/rke/"
            addon_job_timeout: 30
            bastion_host:
            address: ""
            port: ""
            user: ""
            ssh_key: ""
            ssh_key_path: ""
            ssh_cert: ""
            ssh_cert_path: ""
            monitoring:
            provider: ""
            options: {}
            node_selector: {}
            restore:
            restore: false
            snapshot_name: ""
            dns:
            provider: coredns

Steps to Reproduce:

rke -d up --config cluster.yml

Results:

INFO[0129] [sync] Successfully synced nodes Labels and Taints
DEBU[0129] Host: rke01 has role: controlplane
DEBU[0129] Host: rke01 has role: etcd
DEBU[0129] Host: rke03 has role: controlplane
DEBU[0129] Host: rke03 has role: etcd
DEBU[0129] Host: rke04 has role: worker
DEBU[0129] Host: rke05 has role: worker
INFO[0129] [network] Setting up network plugin: weave
INFO[0129] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0129] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0129] [addons] Executing deploy job rke-network-plugin
DEBU[0129] [k8s] waiting for job rke-network-plugin-deploy-job to complete..
FATA[0159] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

kubectl get pods --all-namespaces

NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system rke-network-plugin-deploy-job-4jgcq 0/1 Error 0 4m6s
kube-system rke-network-plugin-deploy-job-57jr8 0/1 Error 0 3m50s
kube-system rke-network-plugin-deploy-job-h2gr8 0/1 Error 0 90s
kube-system rke-network-plugin-deploy-job-p92br 0/1 Error 0 2m50s
kube-system rke-network-plugin-deploy-job-xrgpl 0/1 Error 0 4m1s
kube-system rke-network-plugin-deploy-job-zqhmk 0/1 Error 0 3m30s

kubectl describe pod rke-network-plugin-deploy-job-zqhmk -n kube-system

Name: rke-network-plugin-deploy-job-zqhmk
Namespace: kube-system
Priority: 0
Node: rke01/192.168.40.11
Start Time: Sun, 12 Jan 2020 09:40:00 +0330
Labels: controller-uid=*******************
job-name=rke-network-plugin-deploy-job
Annotations:
Status: Failed
IP: 192.168.40.11
IPs:
IP: 192.168.40.11
Controlled By: Job/rke-network-plugin-deploy-job
Containers:
rke-network-plugin-pod:
Container ID: docker://7658aecff174e4ac53caaf088782dab50654911065371cd0d8dcdd50b8fbef3b
Image: registry.console:5000/rancher/hyperkube:v1.16.3-rancher1
Image ID: docker-pullable://registry.console:5000/rancher/hyperkube@sha256:0a55590eb8453bcc46a4bdb8217a48cf56a7c7f7c52d72a267632ffa35b3b8c8
Port:
Host Port:
Command:
kubectl
apply
-f
/etc/config/rke-network-plugin.yaml
State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 12 Jan 2020 09:40:00 +0330
Finished: Sun, 12 Jan 2020 09:40:01 +0330
Ready: False
Restart Count: 0
Environment:
Mounts:
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from rke-job-deployer-token-9dt6n (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rke-network-plugin
Optional: false
rke-job-deployer-token-9dt6n:
Type: Secret (a volume populated by a Secret)
SecretName: rke-job-deployer-token-9dt6n
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations:
Events:
Type Reason Age From Message

Normal Pulled 4m10s kubelet, rke01 Container image "registry.console:5000/rancher/hyperkube:v1.16.3-rancher1" already present on machine
Normal Created 4m10s kubelet, rke01 Created container rke-network-plugin-pod
Normal Started 4m10s kubelet, rke01 Started container rke-network-plugin-pod

container logs: docker logs -f 267a894bb999

unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable

network interfaces

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether *********** brd ff:ff:ff:ff:ff:ff
inet 192.168.40.11/24 brd 192.168.40.255 scope global dynamic enp0s8
valid_lft 847sec preferred_lft 847sec
inet6 ************* scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether *************** brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 ************* scope link
valid_lft forever preferred_lft forever

docker network status

docker network ls

NETWORK ID NAME DRIVER SCOPE
c6063ba5a4d0 bridge bridge local
822441eae3cf host host local
314798c82599 none null local

is the issue related to network interfaces? if yes: how can i create it?

-- mehdi.aghayari
kubernetes
rancher
rke

2 Answers

5/16/2020

I had the same issue, and these two steps solved my problem.

  1. Increase addon_job_timeout
  2. Check node free space (at lease 15%)

In my case, one of the nodes had DiskPressure state

-- Ali.MD
Source: StackOverflow

1/13/2020

that's resolved by below command and I created a network interface:

docker network create --driver=bridge --subnet=10.43.0.0/16 br0_rke
-- mehdi.aghayari
Source: StackOverflow