I have setup a AWS cluster with 5 nodes using kubernetes and kops. A FaaS application is running in the cluster with a KVS (Key-value store) at the backend. For testing purpose, I have updated a container image at the function-nodes-5p6fs
pod (listed on the first line) which is attached to a daemonset function-nodes.
This function node pod is used by the scheduler pod to schedule function execution at the function node daemon-set.
ubuntu@ip-172-31-22-220:~/hydro-project/cluster$ kubectl get pod/function-nodes-5swwv -o yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2021-09-07T21:20:45Z"
generateName: function-nodes-
labels:
controller-revision-hash: 859745cbc
pod-template-generation: "1"
role: function
name: function-nodes-5swwv
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: function-nodes
uid: 1e1e3ebd-c4ce-41f2-9dec-d268ca2cc693
resourceVersion: "3492"
uid: 93181869-6e8a-4071-97dd-b10f5c66130e
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-172-20-60-197.ec2.internal
containers:
- env:
- name: ROUTE_ADDR
value: a1710e6d6c58c4eae861335cae02dc66-1996401780.us-east-1.elb.amazonaws.com
- name: MGMT_IP
value: 100.96.1.5
- name: SCHED_IPS
value: 172.20.32.73
- name: THREAD_ID
value: "0"
- name: ROLE
value: executor
- name: REPO_ORG
value: hydro-project
- name: REPO_BRANCH
value: master
- name: ANNA_REPO_ORG
value: hydro-project
- name: ANNA_REPO_BRANCH
value: master
image: akazad1/srlcloudburst:v3
imagePullPolicy: Always
name: function-1
resources:
limits:
cpu: "2"
memory: 2G
requests:
cpu: "2"
memory: 2G
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /requests
name: ipc
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-kfzth
readOnly: true
- env:
- name: ROUTE_ADDR
value: a1710e6d6c58c4eae861335cae02dc66-1996401780.us-east-1.elb.amazonaws.com
- name: MGMT_IP
value: 100.96.1.5
- name: SCHED_IPS
value: 172.20.32.73
- name: THREAD_ID
value: "1"
- name: ROLE
value: executor
- name: REPO_ORG
value: hydro-project
- name: REPO_BRANCH
value: master
- name: ANNA_REPO_ORG
value: hydro-project
- name: ANNA_REPO_BRANCH
value: master
image: akazad1/srlcloudburst:v3
imagePullPolicy: Always
name: function-2
resources:
limits:
cpu: "2"
memory: 2G
requests:
cpu: "2"
memory: 2G
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /requests
name: ipc
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-kfzth
readOnly: true
- env:
- name: ROUTE_ADDR
value: a1710e6d6c58c4eae861335cae02dc66-1996401780.us-east-1.elb.amazonaws.com
- name: MGMT_IP
value: 100.96.1.5
- name: SCHED_IPS
value: 172.20.32.73
- name: THREAD_ID
value: "2"
- name: ROLE
value: executor
- name: REPO_ORG
value: hydro-project
- name: REPO_BRANCH
value: master
- name: ANNA_REPO_ORG
value: hydro-project
- name: ANNA_REPO_BRANCH
value: master
image: akazad1/srlcloudburst:v3
imagePullPolicy: Always
name: function-3
resources:
limits:
cpu: "2"
memory: 2G
requests:
cpu: "2"
memory: 2G
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /requests
name: ipc
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-kfzth
readOnly: true
- env:
- name: ROUTE_ADDR
value: a1710e6d6c58c4eae861335cae02dc66-1996401780.us-east-1.elb.amazonaws.com
- name: MGMT_IP
value: 100.96.1.5
- name: REPO_ORG
value: hydro-project
- name: REPO_BRANCH
value: master
image: hydroproject/anna-cache
imagePullPolicy: Always
name: cache-container
resources:
limits:
cpu: "1"
memory: 8G
requests:
cpu: "1"
memory: 8G
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /requests
name: ipc
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-kfzth
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostIPC: true
hostNetwork: true
nodeName: ip-172-20-60-197.ec2.internal
nodeSelector:
role: function
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
- hostPath:
path: /tmp
type: ""
name: ipc
- name: kube-api-access-kfzth
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-09-07T21:20:45Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-09-07T21:21:53Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-09-07T21:21:53Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-09-07T21:20:45Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://742b944e744fcd951c21f6e47a4bdaafacc90d2c0ce0d8e172b62429172bceaf
image: docker.io/hydroproject/anna-cache:latest
imageID: docker.io/hydroproject/anna-cache@sha256:50a5aac7fd6b742bdeeedef855f48c6307aae688987d86f680d1bbdb57050d8b
lastState: {}
name: cache-container
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-09-07T21:21:53Z"
- containerID: containerd://62279440c50bad86386acecdd8a0d406282cfe25646c46eb3f2b2004a662ee3b
image: docker.io/akazad1/srlcloudburst:v3
imageID: docker.io/akazad1/srlcloudburst@sha256:4ef979d9202e519203cca186354f60a5c0ee3d47ed873fca5f1602549bf14bfa
lastState: {}
name: function-1
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-09-07T21:21:48Z"
- containerID: containerd://56a7263acac5a3ed291aaf2d77cce4a9490c87710afed76857aedcc15d5b2dc5
image: docker.io/akazad1/srlcloudburst:v3
imageID: docker.io/akazad1/srlcloudburst@sha256:4ef979d9202e519203cca186354f60a5c0ee3d47ed873fca5f1602549bf14bfa
lastState: {}
name: function-2
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-09-07T21:21:49Z"
- containerID: containerd://49e50972d5cb059b29e1130e78613b89827239450f14b46bad633a897b7d3e6f
image: docker.io/akazad1/srlcloudburst:v3
imageID: docker.io/akazad1/srlcloudburst@sha256:4ef979d9202e519203cca186354f60a5c0ee3d47ed873fca5f1602549bf14bfa
lastState: {}
name: function-3
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-09-07T21:21:49Z"
hostIP: 172.20.60.197
phase: Running
podIP: 172.20.60.197
podIPs:
- ip: 172.20.60.197
qosClass: Guaranteed
startTime: "2021-09-07T21:20:45Z"
ubuntu@ip-172-31-22-220:/$ kubectl get all -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/function-nodes-5p6fs 4/4 Running 0 79m 172.20.56.188 ip-172-20-56-188.ec2.internal <none> <none>
pod/management-pod 1/1 Running 0 4h21m 100.96.1.4 ip-172-20-46-4.ec2.internal <none> <none>
pod/memory-nodes-mtlxh 1/1 Running 0 4h14m 172.20.61.87 ip-172-20-61-87.ec2.internal <none> <none>
pod/monitoring-pod 1/1 Running 0 4h20m 100.96.1.6 ip-172-20-46-4.ec2.internal <none> <none>
pod/routing-nodes-kl8wb 1/1 Running 0 4h18m 172.20.46.83 ip-172-20-46-83.ec2.internal <none> <none>
pod/scheduler-nodes-q8std 1/1 Running 0 4h11m 172.20.59.122 ip-172-20-59-122.ec2.internal <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/function-service LoadBalancer 100.65.229.234 ab696981b80b84918a30ed81201726b6-371417546.us-east-1.elb.amazonaws.com 5000:32189/TCP,5001:30872/TCP,5002:31765/TCP,5003:30711/TCP,5004:32544/TCP,5005:31007/TCP,5006:32097/TCP 4h7m role=scheduler
service/kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 4h25m <none>
service/routing-service LoadBalancer 100.68.27.23 af4491484277a42388857d471f4bb220-1539998664.us-east-1.elb.amazonaws.com 6450:32127/TCP,6451:31251/TCP,6452:32116/TCP,6453:31126/TCP 4h12m role=routing
I have used the following command to update the container image-
kubectl set image ds/function-nodes container-name=image-name
After the container image updates (3of them) to function-node pod, I get the following error.
$ kubectl describe pod function-nodes-vg5pr
......
.....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 39m kubelet Container function-3 definition changed, will be restarted
Normal Killing 39m kubelet Container function-1 definition changed, will be restarted
Normal Killing 38m kubelet Container function-2 definition changed, will be restarted
Normal Pulling 38m kubelet Pulling image "hydroproject/cloudburst"
Normal Created 38m (x2 over 3h18m) kubelet Created container function-3
Normal Started 38m (x2 over 3h18m) kubelet Started container function-3
Normal Pulled 38m kubelet Successfully pulled image "hydroproject/cloudburst" in 20.006668839s
Warning InspectFailed 37m (x4 over 38m) kubelet Failed to apply default image tag "hydroproject/cloudburst,": couldn't parse image reference "hydroproject/cloudburst,": invalid reference format
Warning Failed 37m (x4 over 38m) kubelet Error: InvalidImageName
Warning InspectFailed 37m (x5 over 38m) kubelet Failed to apply default image tag "hydroproject/cloudburst,": couldn't parse image reference "hydroproject/cloudburst,": invalid reference format
Warning Failed 37m (x5 over 38m) kubelet Error: InvalidImageName
Warning BackOff 34m (x12 over 37m) kubelet Back-off restarting failed container
Therefore, the scheduler pod is not being able to find the function pod to schedule function execution --
root@ip-172-20-38-118:/hydro/cloudburst# python3 run_benchmark.py
Traceback (most recent call last):
File "run_benchmark.py", line 22, in <module>
from cloudburst.server.benchmarks import (
ImportError: cannot import name 'retwis_benchmark'
root@ip-172-20-38-118:/hydro/cloudburst# vi run_benchmark.py
root@ip-172-20-38-118:/hydro/cloudburst# python3 run_benchmark.py
Usage: ./run_benchmark.py benchmark_name function_elb num_requests {ip}
root@ip-172-20-38-118:/hydro/cloudburst# python3 run_benchmark.py locality a94718831527b4048b43c7817b5d1212-1314702864.us-east-1.elb.amazonaws.com 1 172.20.38.118
INFO:root:Successfully registered the dot function.
INFO:root:Successfully tested function!
ERROR:root:Scheduler returned unexpected error:
error: NO_RESOURCES
Traceback (most recent call last):
File "run_benchmark.py", line 59, in <module>
False, None)
File "/hydro/cloudburst/cloudburst/server/benchmarks/locality.py", line 134, in run
cloudburst_client.call_dag(dag_name, arg_map, True)
File "/hydro/cloudburst/cloudburst/client/client.py", line 283, in call_dag
raise RuntimeError(str(r.error))
RuntimeError: 5
Can anyone please give some pointers to resolve this issue? I mean updating pod container images without messing up the whole running system?
Thanks in advance!