Updating containers of a Pod of a running system without messing up the application

9/3/2021

I have setup a AWS cluster with 5 nodes using kubernetes and kops. A FaaS application is running in the cluster with a KVS (Key-value store) at the backend. For testing purpose, I have updated a container image at the function-nodes-5p6fs pod (listed on the first line) which is attached to a daemonset function-nodes.

This function node pod is used by the scheduler pod to schedule function execution at the function node daemon-set.

Details about the function-node pod:

ubuntu@ip-172-31-22-220:~/hydro-project/cluster$ kubectl get pod/function-nodes-5swwv -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-09-07T21:20:45Z"
  generateName: function-nodes-
  labels:
    controller-revision-hash: 859745cbc
    pod-template-generation: "1"
    role: function
  name: function-nodes-5swwv
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: function-nodes
    uid: 1e1e3ebd-c4ce-41f2-9dec-d268ca2cc693
  resourceVersion: "3492"
  uid: 93181869-6e8a-4071-97dd-b10f5c66130e
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - ip-172-20-60-197.ec2.internal
  containers:
  - env:
    - name: ROUTE_ADDR
      value: a1710e6d6c58c4eae861335cae02dc66-1996401780.us-east-1.elb.amazonaws.com
    - name: MGMT_IP
      value: 100.96.1.5
    - name: SCHED_IPS
      value: 172.20.32.73
    - name: THREAD_ID
      value: "0"
    - name: ROLE
      value: executor
    - name: REPO_ORG
      value: hydro-project
    - name: REPO_BRANCH
      value: master
    - name: ANNA_REPO_ORG
      value: hydro-project
    - name: ANNA_REPO_BRANCH
      value: master
    image: akazad1/srlcloudburst:v3
    imagePullPolicy: Always
    name: function-1
    resources:
      limits:
        cpu: "2"
        memory: 2G
      requests:
        cpu: "2"
        memory: 2G
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /requests
      name: ipc
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-kfzth
      readOnly: true
  - env:
    - name: ROUTE_ADDR
      value: a1710e6d6c58c4eae861335cae02dc66-1996401780.us-east-1.elb.amazonaws.com
    - name: MGMT_IP
      value: 100.96.1.5
    - name: SCHED_IPS
      value: 172.20.32.73
    - name: THREAD_ID
      value: "1"
    - name: ROLE
      value: executor
    - name: REPO_ORG
      value: hydro-project
    - name: REPO_BRANCH
      value: master
    - name: ANNA_REPO_ORG
      value: hydro-project
    - name: ANNA_REPO_BRANCH
      value: master
    image: akazad1/srlcloudburst:v3
    imagePullPolicy: Always
    name: function-2
    resources:
      limits:
        cpu: "2"
        memory: 2G
      requests:
        cpu: "2"
        memory: 2G
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /requests
      name: ipc
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-kfzth
      readOnly: true
  - env:
    - name: ROUTE_ADDR
      value: a1710e6d6c58c4eae861335cae02dc66-1996401780.us-east-1.elb.amazonaws.com
    - name: MGMT_IP
      value: 100.96.1.5
    - name: SCHED_IPS
      value: 172.20.32.73
    - name: THREAD_ID
      value: "2"
    - name: ROLE
      value: executor
    - name: REPO_ORG
      value: hydro-project
    - name: REPO_BRANCH
      value: master
    - name: ANNA_REPO_ORG
      value: hydro-project
    - name: ANNA_REPO_BRANCH
      value: master
    image: akazad1/srlcloudburst:v3
    imagePullPolicy: Always
    name: function-3
    resources:
      limits:
        cpu: "2"
        memory: 2G
      requests:
        cpu: "2"
        memory: 2G
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /requests
      name: ipc
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-kfzth
      readOnly: true
  - env:
    - name: ROUTE_ADDR
      value: a1710e6d6c58c4eae861335cae02dc66-1996401780.us-east-1.elb.amazonaws.com
    - name: MGMT_IP
      value: 100.96.1.5
    - name: REPO_ORG
      value: hydro-project
    - name: REPO_BRANCH
      value: master
    image: hydroproject/anna-cache
    imagePullPolicy: Always
    name: cache-container
    resources:
      limits:
        cpu: "1"
        memory: 8G
      requests:
        cpu: "1"
        memory: 8G
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /requests
      name: ipc
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-kfzth
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostIPC: true
  hostNetwork: true
  nodeName: ip-172-20-60-197.ec2.internal
  nodeSelector:
    role: function
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /tmp
      type: ""
    name: ipc
  - name: kube-api-access-kfzth
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-07T21:20:45Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-07T21:21:53Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-07T21:21:53Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-07T21:20:45Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://742b944e744fcd951c21f6e47a4bdaafacc90d2c0ce0d8e172b62429172bceaf
    image: docker.io/hydroproject/anna-cache:latest
    imageID: docker.io/hydroproject/anna-cache@sha256:50a5aac7fd6b742bdeeedef855f48c6307aae688987d86f680d1bbdb57050d8b
    lastState: {}
    name: cache-container
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-09-07T21:21:53Z"
  - containerID: containerd://62279440c50bad86386acecdd8a0d406282cfe25646c46eb3f2b2004a662ee3b
    image: docker.io/akazad1/srlcloudburst:v3
    imageID: docker.io/akazad1/srlcloudburst@sha256:4ef979d9202e519203cca186354f60a5c0ee3d47ed873fca5f1602549bf14bfa
    lastState: {}
    name: function-1
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-09-07T21:21:48Z"
  - containerID: containerd://56a7263acac5a3ed291aaf2d77cce4a9490c87710afed76857aedcc15d5b2dc5
    image: docker.io/akazad1/srlcloudburst:v3
    imageID: docker.io/akazad1/srlcloudburst@sha256:4ef979d9202e519203cca186354f60a5c0ee3d47ed873fca5f1602549bf14bfa
    lastState: {}
    name: function-2
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-09-07T21:21:49Z"
  - containerID: containerd://49e50972d5cb059b29e1130e78613b89827239450f14b46bad633a897b7d3e6f
    image: docker.io/akazad1/srlcloudburst:v3
    imageID: docker.io/akazad1/srlcloudburst@sha256:4ef979d9202e519203cca186354f60a5c0ee3d47ed873fca5f1602549bf14bfa
    lastState: {}
    name: function-3
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-09-07T21:21:49Z"
  hostIP: 172.20.60.197
  phase: Running
  podIP: 172.20.60.197
  podIPs:
  - ip: 172.20.60.197
  qosClass: Guaranteed
  startTime: "2021-09-07T21:20:45Z"

The cluster

ubuntu@ip-172-31-22-220:/$ kubectl get all -o wide
NAME                        READY   STATUS    RESTARTS   AGE     IP              NODE                            NOMINATED NODE   READINESS GATES
pod/function-nodes-5p6fs    4/4     Running   0          79m     172.20.56.188   ip-172-20-56-188.ec2.internal   <none>           <none>
pod/management-pod          1/1     Running   0          4h21m   100.96.1.4      ip-172-20-46-4.ec2.internal     <none>           <none>
pod/memory-nodes-mtlxh      1/1     Running   0          4h14m   172.20.61.87    ip-172-20-61-87.ec2.internal    <none>           <none>
pod/monitoring-pod          1/1     Running   0          4h20m   100.96.1.6      ip-172-20-46-4.ec2.internal     <none>           <none>
pod/routing-nodes-kl8wb     1/1     Running   0          4h18m   172.20.46.83    ip-172-20-46-83.ec2.internal    <none>           <none>
pod/scheduler-nodes-q8std   1/1     Running   0          4h11m   172.20.59.122   ip-172-20-59-122.ec2.internal   <none>           <none>

NAME                       TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)                                                                                                    AGE     SELECTOR
service/function-service   LoadBalancer   100.65.229.234   ab696981b80b84918a30ed81201726b6-371417546.us-east-1.elb.amazonaws.com    5000:32189/TCP,5001:30872/TCP,5002:31765/TCP,5003:30711/TCP,5004:32544/TCP,5005:31007/TCP,5006:32097/TCP   4h7m    role=scheduler
service/kubernetes         ClusterIP      100.64.0.1       <none>                                                                    443/TCP                                                                                                    4h25m   <none>
service/routing-service    LoadBalancer   100.68.27.23     af4491484277a42388857d471f4bb220-1539998664.us-east-1.elb.amazonaws.com   6450:32127/TCP,6451:31251/TCP,6452:32116/TCP,6453:31126/TCP                                                4h12m   role=routing

I have used the following command to update the container image- kubectl set image ds/function-nodes container-name=image-name

After the container image updates (3of them) to function-node pod, I get the following error.

Errors after updating images..

$ kubectl describe pod function-nodes-vg5pr
......
.....
Events:
  Type     Reason         Age                  From     Message
  ----     ------         ----                 ----     -------
  Normal   Killing        39m                  kubelet  Container function-3 definition changed, will be restarted
  Normal   Killing        39m                  kubelet  Container function-1 definition changed, will be restarted
  Normal   Killing        38m                  kubelet  Container function-2 definition changed, will be restarted
  Normal   Pulling        38m                  kubelet  Pulling image "hydroproject/cloudburst"
  Normal   Created        38m (x2 over 3h18m)  kubelet  Created container function-3
  Normal   Started        38m (x2 over 3h18m)  kubelet  Started container function-3
  Normal   Pulled         38m                  kubelet  Successfully pulled image "hydroproject/cloudburst" in 20.006668839s
  Warning  InspectFailed  37m (x4 over 38m)    kubelet  Failed to apply default image tag "hydroproject/cloudburst,": couldn't parse image reference "hydroproject/cloudburst,": invalid reference format
  Warning  Failed         37m (x4 over 38m)    kubelet  Error: InvalidImageName
  Warning  InspectFailed  37m (x5 over 38m)    kubelet  Failed to apply default image tag "hydroproject/cloudburst,": couldn't parse image reference "hydroproject/cloudburst,": invalid reference format
  Warning  Failed         37m (x5 over 38m)    kubelet  Error: InvalidImageName
  Warning  BackOff        34m (x12 over 37m)   kubelet  Back-off restarting failed container

Therefore, the scheduler pod is not being able to find the function pod to schedule function execution --

root@ip-172-20-38-118:/hydro/cloudburst# python3 run_benchmark.py 
Traceback (most recent call last):
  File "run_benchmark.py", line 22, in <module>
    from cloudburst.server.benchmarks import (
ImportError: cannot import name 'retwis_benchmark'
root@ip-172-20-38-118:/hydro/cloudburst# vi run_benchmark.py 
root@ip-172-20-38-118:/hydro/cloudburst# python3 run_benchmark.py 
Usage: ./run_benchmark.py benchmark_name function_elb num_requests {ip}
root@ip-172-20-38-118:/hydro/cloudburst# python3 run_benchmark.py  locality a94718831527b4048b43c7817b5d1212-1314702864.us-east-1.elb.amazonaws.com 1 172.20.38.118
INFO:root:Successfully registered the dot function.
INFO:root:Successfully tested function!
ERROR:root:Scheduler returned unexpected error: 
error: NO_RESOURCES

Traceback (most recent call last):
  File "run_benchmark.py", line 59, in <module>
    False, None)
  File "/hydro/cloudburst/cloudburst/server/benchmarks/locality.py", line 134, in run
    cloudburst_client.call_dag(dag_name, arg_map, True)
  File "/hydro/cloudburst/cloudburst/client/client.py", line 283, in call_dag
    raise RuntimeError(str(r.error))
RuntimeError: 5

Can anyone please give some pointers to resolve this issue? I mean updating pod container images without messing up the whole running system?

Thanks in advance!

-- Azad Md Abul Kalam
amazon-web-services
kubernetes

0 Answers