failed to set up pod network: Unhandled Exception killed plugin

10/8/2016

I'm trying to play with kubernetes 1.4 install with rkt containers on CoreOS beta (1185.1.0).

In general I have two CoreOS pc machines at home that are configured with etcd2 tls certificates.

I patched the coreos-kubernetes automated generic install script to support etcd2 tls certificates. the latest versions of the worker and controller install scripts are posted at https://github.com/kfirufk/coreos-kubernetes-multi-node-generic-install-script

I used the following environment variables for the controller coreos installation script (ip:10.79.218.2,domain:coreos-2.tux-in.com)

ADVERTISE_IP=10.79.218.2
ETCD_ENDPOINTS="https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379"
K8S_VER=v1.4.1_coreos.0
HYPERKUBE_IMAGE_REPO=quay.io/coreos/hyperkube
POD_NETWORK=10.2.0.0/16
SERVICE_IP_RANGE=10.3.0.0/24
K8S_SERVICE_IP=10.3.0.1
DNS_SERVICE_IP=10.3.0.10
USE_CALICO=true
CONTAINER_RUNTIME=rkt
ETCD_CERT_FILE="/etc/ssl/etcd/etcd1.pem"
ETCD_KEY_FILE="/etc/ssl/etcd/etcd1-key.pem"
ETCD_TRUSTED_CA_FILE="/etc/ssl/etcd/ca.pem"
ETCD_CLIENT_CERT_AUTH=true
OVERWRITE_ALL_FILES=true
CONTROLLER_HOSTNAME="coreos-2.tux-in.com"
ETCD_CERT_ROOT_DIR="/etc/ssl/etcd"
ETCD_SCHEME="https"
ETCD_AUTHORITY="coreos-2.tux-in.com:2379"
IS_MASK_UPDATE_ENGINE=false

and these are the environment variables I used for the worker coreos installation script (ip:10.79.218.3,domain:coreos-3.tux-in.com)

ETCD_AUTHORITY=coreos-3.tux-in.com:2379
ETCD_ENDPOINTS="https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379"
CONTROLLER_ENDPOINT=https://coreos-2.tux-in.com
K8S_VER=v1.4.1_coreos.0
HYPERKUBE_IMAGE_REPO=quay.io/coreos/hyperkube
DNS_SERVICE_IP=10.3.0.10
USE_CALICO=true
CONTAINER_RUNTIME=rkt
OVERWRITE_ALL_FILES=true
ADVERTISE_IP=10.79.218.3
ETCD_CERT_FILE="/etc/ssl/etcd/etcd2.pem"
ETCD_KEY_FILE="/etc/ssl/etcd/etcd2-key.pem"
ETCD_TRUSTED_CA_FILE="/etc/ssl/etcd/ca.pem"
ETCD_SCHEME="https"
IS_MASK_UPDATE_ENGINE=false

after installing kubernetes on both machines, and configuring kubectl properly, when I type kubectl get nodes I get:

NAME          STATUS                     AGE
10.79.218.2   Ready,SchedulingDisabled   1h
10.79.218.3   Ready                      1h

kubectl get pods --namespace=kube-system returns

NAME                                  READY     STATUS              RESTARTS   AGE
heapster-v1.2.0-3646253287-j951o      0/2       ContainerCreating   0          1d
kube-apiserver-10.79.218.2            1/1       Running             0          1d
kube-controller-manager-10.79.218.2   1/1       Running             0          1d
kube-dns-v20-u3pd0                    0/3       ContainerCreating   0          1d
kube-proxy-10.79.218.2                1/1       Running             0          1d
kube-proxy-10.79.218.3                1/1       Running             0          1d
kube-scheduler-10.79.218.2            1/1       Running             0          1d
kubernetes-dashboard-v1.4.1-ehiez     0/1       ContainerCreating   0          1d

so heapster-v1.2.0-3646253287-j951o, kube-dns-v20-u3pd0 and kubernetes-dashboard-v1.4.1-ehiez are stuck in ContainerCreating status.

when I run kubectl describe on any of them, I basically get the same error: Error syncing pod, skipping: failed to SyncPod: failed to set up pod network: Unhandled Exception killed plugin.

for example, kubectl describe pods kubernetes-dashboard-v1.4.1-ehiez --namespace kube-system returns:

Name:       kubernetes-dashboard-v1.4.1-ehiez
Namespace:  kube-system
Node:       10.79.218.3/10.79.218.3
Start Time: Mon, 17 Oct 2016 23:31:43 +0300
Labels:     k8s-app=kubernetes-dashboard
        kubernetes.io/cluster-service=true
        version=v1.4.1
Status:     Pending
IP:
Controllers:    ReplicationController/kubernetes-dashboard-v1.4.1
Containers:
  kubernetes-dashboard:
    Container ID:
    Image:      gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.1
    Image ID:
    Port:       9090/TCP
    Limits:
      cpu:  100m
      memory:   50Mi
    Requests:
      cpu:      100m
      memory:       50Mi
    State:      Waiting
      Reason:       ContainerCreating
    Ready:      False
    Restart Count:  0
    Liveness:       http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-svbiv (ro)
    Environment Variables:  <none>
Conditions:
  Type      Status
  Initialized   True
  Ready     False
  PodScheduled  True
Volumes:
  default-token-svbiv:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-svbiv
QoS Class:  Guaranteed
Tolerations:    CriticalAddonsOnly=:Exists
Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----            -------------   --------    ------      -------
  1d        25s     9350    {kubelet 10.79.218.3}           Warning     FailedSync  Error syncing pod, skipping: failed to SyncPod: failed to set up pod network: Unhandled Exception killed plugin

I'm guessing that pod networking isn't working because of faulty calico configuration..

so I tried to install calicoctl rkt container, but had problems with that. but that's a different stackoverflow question :) starting calicoctl container on coreos

so I can't really check if calico works properly.

this is the calico-network systemd service file for the controller node:

[Unit]
Description=Calico per-host agent
Requires=network-online.target
After=network-online.target

[Service]
Slice=machine.slice
Environment=CALICO_DISABLE_FILE_LOGGING=true
Environment=HOSTNAME=10.79.218.3
Environment=IP=10.79.218.3
Environment=FELIX_FELIXHOSTNAME=10.79.218.3
Environment=CALICO_NETWORKING=true
Environment=NO_DEFAULT_POOLS=true
Environment=ETCD_ENDPOINTS=https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379
Environment=ETCD_AUTHORITY=coreos-3.tux-in.com:2379
Environment=ETCD_SCHEME=https
Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
Environment=ETCD_CERT_FILE=/etc/ssl/etcd/etcd2.pem
Environment=ETCD_KEY_FILE=/etc/ssl/etcd/etcd2-key.pem

ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci --volume=var-run-calico,kind=host,source=/var/run/calico --volume=modules,kind=host,source=/lib/modules,readOnly=false --mount=volume=modules,target=/lib/modules --volume=dns,kind=host,source=/etc/resolv.conf,readOnly=true --volume=etcd-tls-certs,kind=host,source=/etc/ssl/etcd,readOnly=true --mount=volume=dns,target=/etc/resolv.conf --mount=volume=etcd-tls-certs,target=/etc/ssl/etcd --mount=volume=var-run-calico,target=/var/run/calico --trust-keys-from-https quay.io/calico/node:v0.22.0
KillMode=mixed
Restart=always
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

and is the calico-node service file for the worker node:

[Unit]
Description=Calico per-host agent
Requires=network-online.target
After=network-online.target

[Service]
Slice=machine.slice
Environment=CALICO_DISABLE_FILE_LOGGING=true
Environment=HOSTNAME=10.79.218.2
Environment=IP=10.79.218.2
Environment=FELIX_FELIXHOSTNAME=10.79.218.2
Environment=CALICO_NETWORKING=true
Environment=NO_DEFAULT_POOLS=false
Environment=ETCD_ENDPOINTS=https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379
ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci --volume=var-run-calico,kind=host,source=/var/run/calico --volume=modules,kind=host,source=/lib/modules,readOnly=false --mount=volume=modules,target=/lib/modules --volume=dns,kind=host,source=/etc/resolv.conf,readOnly=true --volume=etcd-tls-certs,kind=host,source=/etc/ssl/etcd,readOnly=true --mount=volume=dns,target=/etc/resolv.conf --mount=volume=etcd-tls-certs,target=/etc/ssl/etcd --mount=volume=var-run-calico,target=/var/run/calico --trust-keys-from-https quay.io/calico/node:v0.22.0
KillMode=mixed
Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
Environment=ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem
Environment=ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem
Restart=always
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

and this is the content of /etc/kubernetes/cni/net.d/10-calico.conf of the controller node:

{
    "name": "calico",
    "type": "flannel",
    "delegate": {
        "type": "calico",
        "etcd_endpoints": "https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379",
"etcd_key_file": "/etc/ssl/etcd/etcd1-key.pem",
"etcd_cert_file": "/etc/ssl/etcd/etcd1.pem",
"etcd_ca_cert_file": "/etc/ssl/etcd/ca.pem",
        "log_level": "none",
        "log_level_stderr": "info",
        "hostname": "10.79.218.2",
        "policy": {
            "type": "k8s",
            "k8s_api_root": "http://127.0.0.1:8080/api/v1/"
        }
    }
}

and this is the /etc/kubernetes/cni/net.d/10-calico.conf of the worker node:

{
    "name": "calico",
    "type": "flannel",
    "delegate": {
        "type": "calico",
        "etcd_endpoints": "https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379",
"etcd_key_file": "/etc/ssl/etcd/etcd2-key.pem",
"etcd_cert_file": "/etc/ssl/etcd/etcd2.pem",
"etcd_ca_cert_file": "/etc/ssl/etcd/ca.pem",
        "log_level": "debug",
        "log_level_stderr": "info",
        "hostname": "10.79.218.3",
        "policy": {
            "type": "k8s",
            "k8s_api_root": "https://coreos-2.tux-in.com:443/api/v1/",
            "k8s_client_key": "/etc/kubernetes/ssl/worker-key.pem",
            "k8s_client_certificate": "/etc/kubernetes/ssl/worker.pem"
        }
    }
}

now idea how to investigate the issue further. I understand that since new calico-cni was moved to go, it doesn't store log information in a log file anymore, so i'm lost from here. any information regarding the issue would be greatly appreciated.

thanks!

-- ufk
coreos
kubernetes
rkt

1 Answer

10/8/2016

The "Unhandled Exception Killed plugin" error message is being generated by the Calico CNI plugin. From my experience that means it is unlikely to be something wrong with the calico-node.service causing that error.

As such it is probably something subtly wrong with you CNI network configuration. Could you share that file?

The CNI plugin should also emit more detailed logging information - either to stderr or to /var/log/calico/cni/calico.log based on how its configured in your CNI network config. I suspect that file will give you more clues into exactly what is going wrong.

All that said, the "Unhandled Exception" error is coming from the Python version of the CNI plugin, which is rather old at this point. I'd recommend upgrading to the latest stable release from here: https://github.com/projectcalico/calico-cni/releases

-- Casey Davenport
Source: StackOverflow