Issue upgrading calico-node in kubeadm cluster

1/20/2019

I am going to upgrade Calico node and cni as per this link for "Upgrading Components Individually"

The directions are very clear (I will cordon each node and do the step for the calico/cni and calico/node), but I am not sure what is meant by

Update the image in your process management to reference the new version

wrt to upgrading the calico/node container.

Otherwise, I see no other issues wrt the directions. Our environment is a k8s kubeadm cluster.

I suppose the real question is: where do I tell k8s to use the newer version of the calico/node image?

EDIT

To answer the above:

I just did a kubectl delete -f on both calico.yaml and rbac-kdd.yaml and then did a kubectl create -f on the newest version of these files.

Everything appears now to be at version 3.3.2, but I am getting this error now on all the calico-node pods:

Warning Unhealthy 84s (x181 over 31m) kubelet, thalia4 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with <node IP addresses here

I ran calicoctl nodd status and got

Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+---------------+-------------------+-------+----------+--------------------------------+
| 134.x.x.163 | node-to-node mesh | start | 02:36:29 | Connect                        |
| 134.x.x.164 | node-to-node mesh | start | 02:36:29 | Connect                        |
| 134.x.x.165 | node-to-node mesh | start | 02:36:29 | Connect                        |
| 134.x.x.168 | node-to-node mesh | start | 02:36:29 | Active Socket: Host is         |
|             |                   |       |          | unreachable                    |
+---------------+-------------------+-------+----------+--------------------------------+

IPv6 BGP status
No IPv6 peers found.

I would assume 134.x.x.168 being unreachable is why I am getting the above health check warning.

Not exactly sure what to do though. This node is available in the k8s cluster (this is node thalia4):

[gms@thalia0 calico]$ kubectl get nodes
NAME                  STATUS   ROLES    AGE   VERSION
thalia0               Ready    master   87d   v1.13.1
thalia1               Ready    <none>   48d   v1.13.1
thalia2               Ready    <none>   30d   v1.13.1
thalia3               Ready    <none>   87d   v1.13.1
thalia4               Ready    <none>   48d   v1.13.1

EDIT 2

calicoctl node status on thalia4 gave

[sudo] password for gms:
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+---------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |  INFO   |
+---------------+-------------------+-------+----------+---------+
| 134.xx.xx.162 | node-to-node mesh | start | 02:36:29 | Connect |
| 134.xx.xx.163 | node-to-node mesh | start | 02:36:29 | Connect |
| 134.xx.xx.164 | node-to-node mesh | start | 02:36:29 | Connect |
| 134.xx.xx.165 | node-to-node mesh | start | 02:36:29 | Connect |
+---------------+-------------------+-------+----------+---------+

while kubectl describe node thalia4 gave

Name:               thalia4.domain
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    dns=dns4
                    kubernetes.io/hostname=thalia4
                    node_name=thalia4
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 134.xx.xx.168/26
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 03 Dec 2018 14:17:07 -0600
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------    -----------------                 ------------------                ------                       -------
  OutOfDisk        Unknown   Fri, 21 Dec 2018 11:58:38 -0600   Sat, 12 Jan 2019 16:44:10 -0600   NodeStatusUnknown            Kubelet stopped posting node status.
  MemoryPressure   False     Mon, 21 Jan 2019 20:54:38 -0600   Sat, 12 Jan 2019 16:50:18 -0600   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False     Mon, 21 Jan 2019 20:54:38 -0600   Sat, 12 Jan 2019 16:50:18 -0600   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False     Mon, 21 Jan 2019 20:54:38 -0600   Sat, 12 Jan 2019 16:50:18 -0600   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True      Mon, 21 Jan 2019 20:54:38 -0600   Sun, 20 Jan 2019 20:27:10 -0600   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  134.xx.xx.168
  Hostname:    thalia4
Capacity:
 cpu:                4
 ephemeral-storage:  6878Mi
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             8009268Ki
 pods:               110
Allocatable:
 cpu:                4
 ephemeral-storage:  6490895145
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             7906868Ki
 pods:               110
System Info:
 Machine ID:                 c011569a40b740a88a672a5cc526b3ba
 System UUID:                42093037-F27E-CA90-01E1-3B253813B904
 Boot ID:                    ffa5170e-da2b-4c09-bd8a-032ce9fca2ee
 Kernel Version:             3.10.0-957.1.3.el7.x86_64
 OS Image:                   Red Hat Enterprise Linux
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://1.13.1
 Kubelet Version:            v1.13.1
 Kube-Proxy Version:         v1.13.1
PodCIDR:                     192.168.4.0/24
Non-terminated Pods:         (3 in total)
  Namespace                  Name                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                        ------------  ----------  ---------------  -------------  ---
  kube-system                calico-node-8xqbs           250m (6%)     0 (0%)      0 (0%)           0 (0%)         24h
  kube-system                coredns-786f4c87c8-sbks2    100m (2%)     0 (0%)      70Mi (0%)        170Mi (2%)     47h
  kube-system                kube-proxy-zp4fk            0 (0%)        0 (0%)      0 (0%)           0 (0%)         31d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                350m (8%)  0 (0%)
  memory             70Mi (0%)  170Mi (2%)
  ephemeral-storage  0 (0%)     0 (0%)
Events:              <none>

I'm thinking this is a firewall problem, but I was told on the Slack channel that "If you're not using host endpoints then we don't mess with your host's connectivity. It sounds like you've got something blocking port 179 on that host."

Not sure where that would be? The iptables rules look the same across all nodes.

-- horcle_buzz
docker
kubeadm
kubernetes
project-calico

2 Answers

1/21/2019

--network-plugin=cni specifies that we use the cni network plugin with actual CNI plugin binaries located in --cni-bin-dir (default /opt/cni/bin) and CNI plugin configuration located in --cni-conf-dir (default /etc/cni/net.d).

For example

--network-plugin=cni

--cni-bin-dir=/opt/cni/bin #there maybe multi cni bin, such as calico/weave..., you can use command '/opt/cni/bin/calico -v' to show the calico version

--cni-conf-dir=/etc/cni/net.d #define detail cni plugin config, such as below:

{
  "name": "calico-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "calico",
      "mtu": 8950,
      "policy": {
        "type": "k8s"
      },
      "ipam": {
        "type": "calico-ipam",
        "assign_ipv6": "false",
        "assign_ipv4": "true"
      },
      "etcd_endpoints": "https://172.16.1.5:2379,https://172.16.1.9:2379,https://172.16.1.15:2379",
      "etcd_key_file": "/etc/etcd/ssl/etcd-client-key.pem",
      "etcd_cert_file": "/etc/etcd/ssl/etcd-client.pem",
      "etcd_ca_cert_file": "/etc/etcd/ssl/ca.pem",
      "kubernetes": {
        "kubeconfig": "/etc/kubernetes/cluster-admin.kubeconfig"
      }
    }
  ]
}
-- baozhenli
Source: StackOverflow

1/22/2019

I figured out the issue. I had to add an explicit rule to iptables for the cali-failsafe-in chain as sudo iptables -A cali-failsafe-in -p tcp --match multiport --dport 179 -j ACCEPT on all nodes.

Now, everything appears to be functional across all nodes:

IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+---------------+-------------------+-------+----------+-------------+
| 134.xx.xx.163 | node-to-node mesh | up    | 19:33:58 | Established |
| 134.xx.xx.164 | node-to-node mesh | up    | 19:33:40 | Established |
| 134.xx.xx.165 | node-to-node mesh | up    | 19:35:07 | Established |
| 134.xx.xx.168 | node-to-node mesh | up    | 19:35:01 | Established |
+---------------+-------------------+-------+----------+-------------+
-- horcle_buzz
Source: StackOverflow