I am trying to set up a Hyperledger Fabric network on Kubernetes by using this.
I am at the step where I am trying to create channels. I run the command argo submit output.yaml -v
where output.yaml
is the output of the command helm template channel-flow/ -f samples/simple/network.yaml -f samples/simple/crypto-config.yaml
but with spec.securityContext
added as follows:
...
spec:
securityContext:
runAsNonRoot: true
#runAsUser: 8737 (I commented out this because I don't know my user ID; not sure if this could cause a problem)
entrypoint: channels
...
My argo workflow ends up getting stuck in the pending state. I say this because I check my orderer and peer logs but I see no movement in their logs.
I referenced https://stackoverflow.com/questions/61799013/argo-sample-workflows-stuck-in-the-pending-state and I start with getting the argo logs:
[user@vmmock3 fabric-kube]$ kubectl logs -n argo -l app=workflow-controller
time="2021-05-31T05:02:41.145Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:41.150Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:46.162Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:46.168Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:51.179Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:51.185Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:56.193Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:56.199Z" level=info msg="Update leases 200"
time="2021-05-31T05:03:01.213Z" level=info msg="Get leases 200"
time="2021-05-31T05:03:01.219Z" level=info msg="Update leases 200"
I try to describe the workflow controller pod:
[user@vmmock3 fabric-kube]$ kubectl -n argo describe pod workflow-controller-57fcfb5df8-qvn74
Name: workflow-controller-57fcfb5df8-qvn74
Namespace: argo
Priority: 0
Node: hlf-pool1-8rnem/10.104.0.8
Start Time: Tue, 25 May 2021 13:44:56 +0800
Labels: app=workflow-controller
pod-template-hash=57fcfb5df8
Annotations: <none>
Status: Running
IP: 10.244.0.158
IPs:
IP: 10.244.0.158
Controlled By: ReplicaSet/workflow-controller-57fcfb5df8
Containers:
workflow-controller:
Container ID: containerd://78c7f8dcb0f3a3b861293559ae0a11b92ce6843065e6f9459556a6b7099c8961
Image: argoproj/workflow-controller:v3.0.5
Image ID: docker.io/argoproj/workflow-controller@sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v3.0.5
--namespaced
State: Running
Started: Mon, 31 May 2021 13:08:11 +0800
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Mon, 31 May 2021 12:59:05 +0800
Finished: Mon, 31 May 2021 13:03:04 +0800
Ready: True
Restart Count: 1333
Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
Environment:
LEADER_ELECTION_IDENTITY: workflow-controller-57fcfb5df8-qvn74 (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from argo-token-hflpb (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
argo-token-hflpb:
Type: Secret (a volume populated by a Secret)
SecretName: argo-token-hflpb
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 7m44s (x3994 over 5d23h) kubelet Liveness probe failed: Get "http://10.244.0.158:6060/healthz": dial tcp 10.244.0.158:6060: connect: connection refused
Warning BackOff 3m46s (x16075 over 5d22h) kubelet Back-off restarting failed container
Could this failure be why my argo workflow is stuck in the pending state? How should I go about troubleshooting this?
EDIT: Output of kubectl get pods --all-namespaces
(FYI these are being run on Digital Ocean):
[user@vmmock3 fabric-kube]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
argo argo-server-5695555c55-867bx 1/1 Running 1 6d19h
argo minio-58977b4b48-r2m2h 1/1 Running 0 6d19h
argo postgres-6b5c55f477-7swpp 1/1 Running 0 6d19h
argo workflow-controller-57fcfb5df8-qvn74 0/1 CrashLoopBackOff 1522 6d19h
default hlf-ca--atlantis-58bbd79d9d-x4mz4 1/1 Running 0 21h
default hlf-ca--karga-547dbfddc8-7w6b5 1/1 Running 0 21h
default hlf-ca--nevergreen-7ffb98484c-nlg4j 1/1 Running 0 21h
default hlf-orderer--groeifabriek--orderer0-0 1/1 Running 0 21h
default hlf-peer--atlantis--peer0-0 2/2 Running 0 21h
default hlf-peer--karga--peer0-0 2/2 Running 0 21h
default hlf-peer--nevergreen--peer0-0 2/2 Running 0 21h
kube-system cilium-2kjfz 1/1 Running 3 26d
kube-system cilium-operator-84bdd6f7b6-kp9vb 1/1 Running 1 6d20h
kube-system cilium-operator-84bdd6f7b6-pkkf9 1/1 Running 1 6d20h
kube-system coredns-55ff57f948-jb5jc 1/1 Running 0 6d20h
kube-system coredns-55ff57f948-r2q4g 1/1 Running 0 6d20h
kube-system csi-do-node-4r9gj 2/2 Running 0 26d
kube-system do-node-agent-sbc8b 1/1 Running 0 26d
kube-system kube-proxy-hpsc7 1/1 Running 0 26d
I will answer partially on your question as I'm not promising everything else will work fine, however I know how to fix the issue with argo workflow-controller
pod.
Answer
In short words you need to update argo workflows
to a new version (at least 3.0.6, ideally 3.0.7 which available) because it looks like a bug in 3.0.5 version.
How I got there
First I installed argo 3.0.5 version (which is not production ready)
Ended up with workflow-controller
pod restarts:
kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-645cf8bc47-sbnqv 1/1 Running 0 9m7s
workflow-controller-768565d958-9lftf 1/1 Running 2 9m7s
curl-pod 1/1 Running 0 6m47s
And the same liveness probe failed
:
kubectl describe pod workflow-controller-768565d958-9lftf -n argo
Name: workflow-controller-768565d958-9lftf
Namespace: argo
Priority: 0
Node: worker1/10.186.0.3
Start Time: Tue, 01 Jun 2021 14:25:00 +0000
Labels: app=workflow-controller
pod-template-hash=768565d958
Annotations: <none>
Status: Running
IP: 10.244.1.151
IPs:
IP: 10.244.1.151
Controlled By: ReplicaSet/workflow-controller-768565d958
Containers:
workflow-controller:
Container ID: docker://4b797b57ae762f9fc3f7acdd890d25434a8d9f6f165bbb7a7bda35745b5f4092
Image: argoproj/workflow-controller:v3.0.5
Image ID: docker-pullable://argoproj/workflow-controller@sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v3.0.5
State: Running
Started: Tue, 01 Jun 2021 14:33:00 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Tue, 01 Jun 2021 14:29:00 +0000
Finished: Tue, 01 Jun 2021 14:33:00 +0000
Ready: True
Restart Count: 2
Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
Environment:
LEADER_ELECTION_IDENTITY: workflow-controller-768565d958-9lftf (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ts9zf (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-ts9zf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m57s default-scheduler Successfully assigned argo/workflow-controller-768565d958-9lftf to worker1
Normal Pulled 57s (x3 over 8m56s) kubelet Container image "argoproj/workflow-controller:v3.0.5" already present on machine
Normal Created 57s (x3 over 8m56s) kubelet Created container workflow-controller
Normal Started 57s (x3 over 8m56s) kubelet Started container workflow-controller
Warning Unhealthy 57s (x6 over 6m57s) kubelet Liveness probe failed: Get "http://10.244.1.151:6060/healthz": dial tcp 10.244.1.151:6060: connect: connection refused
Normal Killing 57s (x2 over 4m57s) kubelet Container workflow-controller failed liveness probe, will be restarted
I also tested this endpoint with a pod in the same namespace based on curlimages/curl
image - it has built-in curl
.
here's a pod.yaml
apiVersion: v1
kind: Pod
metadata:
namespace: argo
labels:
app: curl
name: curl-pod
spec:
containers:
- image: curlimages/curl
name: curl-pod
command: ['sh', '-c', 'while true ; do sleep ; done']
dnsPolicy: ClusterFirst
restartPolicy: Always
kubectl exec -it curl-pod -n argo -- curl http://10.244.1.151:6060/healthz
Which resulted in the same error:
curl: (7) Failed to connect to 10.244.1.151 port 6060: Connection refused
Next step was trying a newer version (3.10rc and then 3.0.7). And it succeeded!
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned argo/workflow-controller-74b4b5455d-skb2f to worker1
Normal Pulling 27m kubelet Pulling image "argoproj/workflow-controller:v3.0.7"
Normal Pulled 27m kubelet Successfully pulled image "argoproj/workflow-controller:v3.0.7" in 15.728042003s
Normal Created 27m kubelet Created container workflow-controller
Normal Started 27m kubelet Started container workflow-controller
And check it with curl
:
kubectl exec -it curl-pod -n argo -- curl 10.244.1.169:6060/healthz
ok