Argo workflow stuck in pending due to liveness probe fail?

5/31/2021

I am trying to set up a Hyperledger Fabric network on Kubernetes by using this.

I am at the step where I am trying to create channels. I run the command argo submit output.yaml -v where output.yaml is the output of the command helm template channel-flow/ -f samples/simple/network.yaml -f samples/simple/crypto-config.yaml but with spec.securityContext added as follows:

...
spec:
  securityContext:
    runAsNonRoot: true
    #runAsUser: 8737 (I commented out this because I don't know my user ID; not sure if this could cause a problem)

  entrypoint: channels
...

My argo workflow ends up getting stuck in the pending state. I say this because I check my orderer and peer logs but I see no movement in their logs.

I referenced https://stackoverflow.com/questions/61799013/argo-sample-workflows-stuck-in-the-pending-state and I start with getting the argo logs:

[user@vmmock3 fabric-kube]$ kubectl logs -n argo -l app=workflow-controller
time="2021-05-31T05:02:41.145Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:41.150Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:46.162Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:46.168Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:51.179Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:51.185Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:56.193Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:56.199Z" level=info msg="Update leases 200"
time="2021-05-31T05:03:01.213Z" level=info msg="Get leases 200"
time="2021-05-31T05:03:01.219Z" level=info msg="Update leases 200"

I try to describe the workflow controller pod:

[user@vmmock3 fabric-kube]$ kubectl -n argo describe pod workflow-controller-57fcfb5df8-qvn74
Name:         workflow-controller-57fcfb5df8-qvn74
Namespace:    argo
Priority:     0
Node:         hlf-pool1-8rnem/10.104.0.8
Start Time:   Tue, 25 May 2021 13:44:56 +0800
Labels:       app=workflow-controller
              pod-template-hash=57fcfb5df8
Annotations:  <none>
Status:       Running
IP:           10.244.0.158
IPs:
  IP:           10.244.0.158
Controlled By:  ReplicaSet/workflow-controller-57fcfb5df8
Containers:
  workflow-controller:
    Container ID:  containerd://78c7f8dcb0f3a3b861293559ae0a11b92ce6843065e6f9459556a6b7099c8961
    Image:         argoproj/workflow-controller:v3.0.5
    Image ID:      docker.io/argoproj/workflow-controller@sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
    Ports:         9090/TCP, 6060/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      workflow-controller
    Args:
      --configmap
      workflow-controller-configmap
      --executor-image
      argoproj/argoexec:v3.0.5
      --namespaced
    State:          Running
      Started:      Mon, 31 May 2021 13:08:11 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 31 May 2021 12:59:05 +0800
      Finished:     Mon, 31 May 2021 13:03:04 +0800
    Ready:          True
    Restart Count:  1333
    Liveness:       http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
    Environment:
      LEADER_ELECTION_IDENTITY:  workflow-controller-57fcfb5df8-qvn74 (v1:metadata.name)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from argo-token-hflpb (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  argo-token-hflpb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  argo-token-hflpb
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                        From     Message
  ----     ------     ----                       ----     -------
  Warning  Unhealthy  7m44s (x3994 over 5d23h)   kubelet  Liveness probe failed: Get "http://10.244.0.158:6060/healthz": dial tcp 10.244.0.158:6060: connect: connection refused
  Warning  BackOff    3m46s (x16075 over 5d22h)  kubelet  Back-off restarting failed container

Could this failure be why my argo workflow is stuck in the pending state? How should I go about troubleshooting this?


EDIT: Output of kubectl get pods --all-namespaces (FYI these are being run on Digital Ocean):

[user@vmmock3 fabric-kube]$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                    READY   STATUS             RESTARTS   AGE
argo          argo-server-5695555c55-867bx            1/1     Running            1          6d19h
argo          minio-58977b4b48-r2m2h                  1/1     Running            0          6d19h
argo          postgres-6b5c55f477-7swpp               1/1     Running            0          6d19h
argo          workflow-controller-57fcfb5df8-qvn74    0/1     CrashLoopBackOff   1522       6d19h
default       hlf-ca--atlantis-58bbd79d9d-x4mz4       1/1     Running            0          21h
default       hlf-ca--karga-547dbfddc8-7w6b5          1/1     Running            0          21h
default       hlf-ca--nevergreen-7ffb98484c-nlg4j     1/1     Running            0          21h
default       hlf-orderer--groeifabriek--orderer0-0   1/1     Running            0          21h
default       hlf-peer--atlantis--peer0-0             2/2     Running            0          21h
default       hlf-peer--karga--peer0-0                2/2     Running            0          21h
default       hlf-peer--nevergreen--peer0-0           2/2     Running            0          21h
kube-system   cilium-2kjfz                            1/1     Running            3          26d
kube-system   cilium-operator-84bdd6f7b6-kp9vb        1/1     Running            1          6d20h
kube-system   cilium-operator-84bdd6f7b6-pkkf9        1/1     Running            1          6d20h
kube-system   coredns-55ff57f948-jb5jc                1/1     Running            0          6d20h
kube-system   coredns-55ff57f948-r2q4g                1/1     Running            0          6d20h
kube-system   csi-do-node-4r9gj                       2/2     Running            0          26d
kube-system   do-node-agent-sbc8b                     1/1     Running            0          26d
kube-system   kube-proxy-hpsc7                        1/1     Running            0          26d
-- user10931326
argo-workflows
hyperledger-fabric
kubernetes

1 Answer

6/1/2021

I will answer partially on your question as I'm not promising everything else will work fine, however I know how to fix the issue with argo workflow-controller pod.

Answer

In short words you need to update argo workflows to a new version (at least 3.0.6, ideally 3.0.7 which available) because it looks like a bug in 3.0.5 version.

How I got there

First I installed argo 3.0.5 version (which is not production ready)

Ended up with workflow-controller pod restarts:

kubectl get pods -n argo

NAME                                   READY   STATUS    RESTARTS   AGE
argo-server-645cf8bc47-sbnqv           1/1     Running   0          9m7s
workflow-controller-768565d958-9lftf   1/1     Running   2          9m7s
curl-pod                               1/1     Running   0          6m47s

And the same liveness probe failed:

kubectl describe pod workflow-controller-768565d958-9lftf -n argo
Name:         workflow-controller-768565d958-9lftf
Namespace:    argo
Priority:     0
Node:         worker1/10.186.0.3
Start Time:   Tue, 01 Jun 2021 14:25:00 +0000
Labels:       app=workflow-controller
              pod-template-hash=768565d958
Annotations:  <none>
Status:       Running
IP:           10.244.1.151
IPs:
  IP:           10.244.1.151
Controlled By:  ReplicaSet/workflow-controller-768565d958
Containers:
  workflow-controller:
    Container ID:  docker://4b797b57ae762f9fc3f7acdd890d25434a8d9f6f165bbb7a7bda35745b5f4092
    Image:         argoproj/workflow-controller:v3.0.5
    Image ID:      docker-pullable://argoproj/workflow-controller@sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
    Ports:         9090/TCP, 6060/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      workflow-controller
    Args:
      --configmap
      workflow-controller-configmap
      --executor-image
      argoproj/argoexec:v3.0.5
    State:          Running
      Started:      Tue, 01 Jun 2021 14:33:00 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 01 Jun 2021 14:29:00 +0000
      Finished:     Tue, 01 Jun 2021 14:33:00 +0000
    Ready:          True
    Restart Count:  2
    Liveness:       http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
    Environment:
LEADER_ELECTION_IDENTITY:  workflow-controller-768565d958-9lftf (v1:metadata.name)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ts9zf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-ts9zf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  8m57s                default-scheduler  Successfully assigned argo/workflow-controller-768565d958-9lftf to worker1
  Normal   Pulled     57s (x3 over 8m56s)  kubelet            Container image "argoproj/workflow-controller:v3.0.5" already present on machine
  Normal   Created    57s (x3 over 8m56s)  kubelet            Created container workflow-controller
  Normal   Started    57s (x3 over 8m56s)  kubelet            Started container workflow-controller
  Warning  Unhealthy  57s (x6 over 6m57s)  kubelet            Liveness probe failed: Get "http://10.244.1.151:6060/healthz": dial tcp 10.244.1.151:6060: connect: connection refused
  Normal   Killing    57s (x2 over 4m57s)  kubelet            Container workflow-controller failed liveness probe, will be restarted

I also tested this endpoint with a pod in the same namespace based on curlimages/curl image - it has built-in curl.

here's a pod.yaml

apiVersion: v1
kind: Pod
metadata:
  namespace: argo
  labels:
    app: curl
  name: curl-pod
spec:
  containers:
  - image: curlimages/curl
    name: curl-pod
    command: ['sh', '-c', 'while true ; do sleep ; done']
  dnsPolicy: ClusterFirst
  restartPolicy: Always

kubectl exec -it curl-pod -n argo -- curl http://10.244.1.151:6060/healthz

Which resulted in the same error:

curl: (7) Failed to connect to 10.244.1.151 port 6060: Connection refused

Next step was trying a newer version (3.10rc and then 3.0.7). And it succeeded!

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  27m   default-scheduler  Successfully assigned argo/workflow-controller-74b4b5455d-skb2f to worker1
  Normal  Pulling    27m   kubelet            Pulling image "argoproj/workflow-controller:v3.0.7"
  Normal  Pulled     27m   kubelet            Successfully pulled image "argoproj/workflow-controller:v3.0.7" in 15.728042003s
  Normal  Created    27m   kubelet            Created container workflow-controller
  Normal  Started    27m   kubelet            Started container workflow-controller

And check it with curl:

kubectl exec -it curl-pod -n argo -- curl 10.244.1.169:6060/healthz
ok
-- moonkotte
Source: StackOverflow