I'm trying to play with kubernetes 1.4 install with rkt containers on CoreOS beta (1185.1.0).
In general I have two CoreOS pc machines at home that are configured with etcd2 tls certificates.
I patched the coreos-kubernetes automated generic install script to support etcd2 tls certificates. the latest versions of the worker and controller install scripts are posted at https://github.com/kfirufk/coreos-kubernetes-multi-node-generic-install-script
I used the following environment variables for the controller coreos installation script (ip:10.79.218.2,domain:coreos-2.tux-in.com)
ADVERTISE_IP=10.79.218.2
ETCD_ENDPOINTS="https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379"
K8S_VER=v1.4.1_coreos.0
HYPERKUBE_IMAGE_REPO=quay.io/coreos/hyperkube
POD_NETWORK=10.2.0.0/16
SERVICE_IP_RANGE=10.3.0.0/24
K8S_SERVICE_IP=10.3.0.1
DNS_SERVICE_IP=10.3.0.10
USE_CALICO=true
CONTAINER_RUNTIME=rkt
ETCD_CERT_FILE="/etc/ssl/etcd/etcd1.pem"
ETCD_KEY_FILE="/etc/ssl/etcd/etcd1-key.pem"
ETCD_TRUSTED_CA_FILE="/etc/ssl/etcd/ca.pem"
ETCD_CLIENT_CERT_AUTH=true
OVERWRITE_ALL_FILES=true
CONTROLLER_HOSTNAME="coreos-2.tux-in.com"
ETCD_CERT_ROOT_DIR="/etc/ssl/etcd"
ETCD_SCHEME="https"
ETCD_AUTHORITY="coreos-2.tux-in.com:2379"
IS_MASK_UPDATE_ENGINE=falseand these are the environment variables I used for the worker coreos installation script (ip:10.79.218.3,domain:coreos-3.tux-in.com)
ETCD_AUTHORITY=coreos-3.tux-in.com:2379
ETCD_ENDPOINTS="https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379"
CONTROLLER_ENDPOINT=https://coreos-2.tux-in.com
K8S_VER=v1.4.1_coreos.0
HYPERKUBE_IMAGE_REPO=quay.io/coreos/hyperkube
DNS_SERVICE_IP=10.3.0.10
USE_CALICO=true
CONTAINER_RUNTIME=rkt
OVERWRITE_ALL_FILES=true
ADVERTISE_IP=10.79.218.3
ETCD_CERT_FILE="/etc/ssl/etcd/etcd2.pem"
ETCD_KEY_FILE="/etc/ssl/etcd/etcd2-key.pem"
ETCD_TRUSTED_CA_FILE="/etc/ssl/etcd/ca.pem"
ETCD_SCHEME="https"
IS_MASK_UPDATE_ENGINE=falseafter installing kubernetes on both machines, and configuring kubectl properly, when I type kubectl get nodes I get:
NAME STATUS AGE
10.79.218.2 Ready,SchedulingDisabled 1h
10.79.218.3 Ready 1hkubectl get pods --namespace=kube-system returns
NAME READY STATUS RESTARTS AGE
heapster-v1.2.0-3646253287-j951o 0/2 ContainerCreating 0 1d
kube-apiserver-10.79.218.2 1/1 Running 0 1d
kube-controller-manager-10.79.218.2 1/1 Running 0 1d
kube-dns-v20-u3pd0 0/3 ContainerCreating 0 1d
kube-proxy-10.79.218.2 1/1 Running 0 1d
kube-proxy-10.79.218.3 1/1 Running 0 1d
kube-scheduler-10.79.218.2 1/1 Running 0 1d
kubernetes-dashboard-v1.4.1-ehiez 0/1 ContainerCreating 0 1dso heapster-v1.2.0-3646253287-j951o, kube-dns-v20-u3pd0 and kubernetes-dashboard-v1.4.1-ehiez are stuck in ContainerCreating status.
when I run kubectl describe on any of them, I basically get the same error: Error syncing pod, skipping: failed to SyncPod: failed to set up pod network: Unhandled Exception killed plugin.
for example, kubectl describe pods kubernetes-dashboard-v1.4.1-ehiez --namespace kube-system returns:
Name: kubernetes-dashboard-v1.4.1-ehiez
Namespace: kube-system
Node: 10.79.218.3/10.79.218.3
Start Time: Mon, 17 Oct 2016 23:31:43 +0300
Labels: k8s-app=kubernetes-dashboard
kubernetes.io/cluster-service=true
version=v1.4.1
Status: Pending
IP:
Controllers: ReplicationController/kubernetes-dashboard-v1.4.1
Containers:
kubernetes-dashboard:
Container ID:
Image: gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.1
Image ID:
Port: 9090/TCP
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 100m
memory: 50Mi
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
Volume Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-svbiv (ro)
Environment Variables: <none>
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-svbiv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-svbiv
QoS Class: Guaranteed
Tolerations: CriticalAddonsOnly=:Exists
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1d 25s 9350 {kubelet 10.79.218.3} Warning FailedSync Error syncing pod, skipping: failed to SyncPod: failed to set up pod network: Unhandled Exception killed pluginI'm guessing that pod networking isn't working because of faulty calico configuration..
so I tried to install calicoctl rkt container, but had problems with that. but that's a different stackoverflow question :) starting calicoctl container on coreos
so I can't really check if calico works properly.
this is the calico-network systemd service file for the controller node:
[Unit]
Description=Calico per-host agent
Requires=network-online.target
After=network-online.target
[Service]
Slice=machine.slice
Environment=CALICO_DISABLE_FILE_LOGGING=true
Environment=HOSTNAME=10.79.218.3
Environment=IP=10.79.218.3
Environment=FELIX_FELIXHOSTNAME=10.79.218.3
Environment=CALICO_NETWORKING=true
Environment=NO_DEFAULT_POOLS=true
Environment=ETCD_ENDPOINTS=https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379
Environment=ETCD_AUTHORITY=coreos-3.tux-in.com:2379
Environment=ETCD_SCHEME=https
Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
Environment=ETCD_CERT_FILE=/etc/ssl/etcd/etcd2.pem
Environment=ETCD_KEY_FILE=/etc/ssl/etcd/etcd2-key.pem
ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci --volume=var-run-calico,kind=host,source=/var/run/calico --volume=modules,kind=host,source=/lib/modules,readOnly=false --mount=volume=modules,target=/lib/modules --volume=dns,kind=host,source=/etc/resolv.conf,readOnly=true --volume=etcd-tls-certs,kind=host,source=/etc/ssl/etcd,readOnly=true --mount=volume=dns,target=/etc/resolv.conf --mount=volume=etcd-tls-certs,target=/etc/ssl/etcd --mount=volume=var-run-calico,target=/var/run/calico --trust-keys-from-https quay.io/calico/node:v0.22.0
KillMode=mixed
Restart=always
TimeoutStartSec=0
[Install]
WantedBy=multi-user.targetand is the calico-node service file for the worker node:
[Unit]
Description=Calico per-host agent
Requires=network-online.target
After=network-online.target
[Service]
Slice=machine.slice
Environment=CALICO_DISABLE_FILE_LOGGING=true
Environment=HOSTNAME=10.79.218.2
Environment=IP=10.79.218.2
Environment=FELIX_FELIXHOSTNAME=10.79.218.2
Environment=CALICO_NETWORKING=true
Environment=NO_DEFAULT_POOLS=false
Environment=ETCD_ENDPOINTS=https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379
ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci --volume=var-run-calico,kind=host,source=/var/run/calico --volume=modules,kind=host,source=/lib/modules,readOnly=false --mount=volume=modules,target=/lib/modules --volume=dns,kind=host,source=/etc/resolv.conf,readOnly=true --volume=etcd-tls-certs,kind=host,source=/etc/ssl/etcd,readOnly=true --mount=volume=dns,target=/etc/resolv.conf --mount=volume=etcd-tls-certs,target=/etc/ssl/etcd --mount=volume=var-run-calico,target=/var/run/calico --trust-keys-from-https quay.io/calico/node:v0.22.0
KillMode=mixed
Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
Environment=ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem
Environment=ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem
Restart=always
TimeoutStartSec=0
[Install]
WantedBy=multi-user.targetand this is the content of /etc/kubernetes/cni/net.d/10-calico.conf of the controller node:
{
"name": "calico",
"type": "flannel",
"delegate": {
"type": "calico",
"etcd_endpoints": "https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379",
"etcd_key_file": "/etc/ssl/etcd/etcd1-key.pem",
"etcd_cert_file": "/etc/ssl/etcd/etcd1.pem",
"etcd_ca_cert_file": "/etc/ssl/etcd/ca.pem",
"log_level": "none",
"log_level_stderr": "info",
"hostname": "10.79.218.2",
"policy": {
"type": "k8s",
"k8s_api_root": "http://127.0.0.1:8080/api/v1/"
}
}
}
and this is the /etc/kubernetes/cni/net.d/10-calico.conf of the worker node:
{
"name": "calico",
"type": "flannel",
"delegate": {
"type": "calico",
"etcd_endpoints": "https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379",
"etcd_key_file": "/etc/ssl/etcd/etcd2-key.pem",
"etcd_cert_file": "/etc/ssl/etcd/etcd2.pem",
"etcd_ca_cert_file": "/etc/ssl/etcd/ca.pem",
"log_level": "debug",
"log_level_stderr": "info",
"hostname": "10.79.218.3",
"policy": {
"type": "k8s",
"k8s_api_root": "https://coreos-2.tux-in.com:443/api/v1/",
"k8s_client_key": "/etc/kubernetes/ssl/worker-key.pem",
"k8s_client_certificate": "/etc/kubernetes/ssl/worker.pem"
}
}
}
now idea how to investigate the issue further. I understand that since new calico-cni was moved to go, it doesn't store log information in a log file anymore, so i'm lost from here. any information regarding the issue would be greatly appreciated.
thanks!
The "Unhandled Exception Killed plugin" error message is being generated by the Calico CNI plugin. From my experience that means it is unlikely to be something wrong with the calico-node.service causing that error.
As such it is probably something subtly wrong with you CNI network configuration. Could you share that file?
The CNI plugin should also emit more detailed logging information - either to stderr or to /var/log/calico/cni/calico.log based on how its configured in your CNI network config. I suspect that file will give you more clues into exactly what is going wrong.
All that said, the "Unhandled Exception" error is coming from the Python version of the CNI plugin, which is rather old at this point. I'd recommend upgrading to the latest stable release from here: https://github.com/projectcalico/calico-cni/releases