Volume node affinity conflicts in Argo workflows

11/23/2020

I have an Argo workflow that has two steps, the first runs on Linux and the second runs on Windows

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: my-workflow-v1.13
spec:
  entrypoint: process
  volumeClaimTemplates:
    - metadata:
        name: workdir
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 1Gi
  arguments:
    parameters:
      - name: jobId
        value: 0
  templates:
    - name: process
      steps:
        - - name: prepare
            template: prepare
        - - name: win-step
            template: win-step

    - name: win-step
      nodeSelector:
        kubernetes.io/os: windows
      container:
        image: mcr.microsoft.com/windows/nanoserver:1809
        command: ["cmd", "/c"]
        args: ["dir", "C:\\workdir\\source"]
        volumeMounts:
          - name: workdir
            mountPath: /workdir

    - name: prepare
      nodeSelector:
        kubernetes.io/os: linux
      inputs:
        artifacts:
          - name: src
            path: /opt/workdir/source.zip
            s3:
              endpoint: minio:9000
              insecure: true
              bucket: "{{workflow.parameters.jobId}}"
              key: "source.zip"
              accessKeySecret:
                name: my-minio-cred
                key: accesskey
              secretKeySecret:
                name: my-minio-cred
                key: secretkey
      script:
        image: garthk/unzip:latest
        imagePullPolicy: IfNotPresent
        command: [sh]
        source: |
          unzip /opt/workdir/source.zip -d /opt/workdir/source
        volumeMounts:
          - name: workdir
            mountPath: /opt/workdir

both steps share a volume.

To achieve that in Azure Kubernetes Service, I had to create two node pools, one for Linux nodes and another for Windows nodes

enter image description here

The problem is, when I queue the workflow, sometimes it completes, and sometimes, the win-step (the step that runs in the windows container), hangs/fails and shows this message

enter image description here

1 node(s) had volume node affinity conflict

I've read that this could happen because the volume gets scheduled on a specific zone and the windows container (since it's in a different pool) gets scheduled in a different zone that doesn't have access to that volume, but I couldn't find a solution for that.

Please help.

-- areller
argo-workflows
azure
azure-aks
kubernetes
persistent-volumes

1 Answer

11/23/2020

the first runs on Linux and the second runs on Windows

I doubt that you can mount the same volume on both Linux, typically ext4 file system and on a Windows node, Azure Windows containers uses NTFS file system.

So the volume that you try to mount in the second step, is located on the node pool that does not match your nodeSelector.

-- Jonas
Source: StackOverflow