Docker multi-stage builds stuck between layers when using Docker-in-Docker with Jenkins and Kubernetes

1/19/2022

Big title, I know, but it is a very specific issue.

I'm creating a new Jenkins cluster, and trying to use Docker-in-Docker containers to build images, differently from the current Jenkins cluster that uses that ugly-as-hell /var/run/docker.sock. The context of the things being built is a monorepo with some Dockerfiles, with builds running in parallel.

The problem is, when building huge layers (for example, after an yarn install that downloads half of the internet), the step hangs in that Done in XX.XXs and does not goes to the next step, whatever it is.

Sometimes the build passes successfully (generally when I change something in the cluster), but the next ones hangs forever. When it passes, I can build 8 nodejs images in ~28min, but the next ones times out after 60min.

Here follows some code to show how I'm doing this. All the other images have the same template than the provided one.

  • Jenkins pod template:

    apiVersion: "v1"
    kind: "Pod"
    metadata:
      labels:
        name: "jnlp"
        jenkins/jenkins-jenkins-agent: "true"
    spec:
      containers:
      - env:
        - name: "DOCKER_HOST"
          value: "tcp://localhost:2375"
        image: "12345678910.dkr.ecr.us-east-1.amazonaws.com/kubernetes-agent:2.0" # internal image
        imagePullPolicy: "IfNotPresent"
        name: "jnlp"
        resources:
          limits:
            cpu: "1000m"
            memory: "1Gi"
          requests:
            cpu: "500m"
            memory: "500Mi"
        tty: true
        volumeMounts:
        - mountPath: "/home/jenkins"
          name: "workspace-volume"
          readOnly: false
        workingDir: "/home/jenkins"
      - args:
        - "--tls=false"
        env:
        - name: "DOCKER_BUILDKIT"
          value: "1"
        - name: "DOCKER_TLS_CERTDIR"
          value: ""
        - name: "DOCKER_DRIVER"
          value: "overlay2"
        image: "docker:20.10.12-dind-alpine3.15"
        imagePullPolicy: "IfNotPresent"
        name: "docker"
        resources:
          limits:
            memory: "4Gi"
            cpu: "2"
          requests:
            memory: "1Gi"
            cpu: "500m"
        securityContext:
          privileged: true
        tty: true
        volumeMounts:
        - mountPath: "/var/lib/docker"
          name: "docker"
          readOnly: false
        - mountPath: "/home/jenkins"
          name: "workspace-volume"
          readOnly: false
        workingDir: "/home/jenkins"
      nodeSelector:
        spot: "true"
      restartPolicy: "Never"
      volumes:
      - emptyDir:
          medium: ""
        name: "docker"
      - emptyDir:
          medium: ""
        name: "workspace-volume"
  • Dockerfile

    # We don't use alpine image due to dependency issues
    FROM node:12.14.1-stretch-slim as base
    
    RUN apt-get update \
      && DEBIAN_FRONTEND=noninteractive apt-get -y install --no-install-recommends \
        apt-utils build-essential bzip2 ca-certificates cron curl g++ git libfontconfig make python \
      && update-ca-certificates \
      && apt-get autoremove -y \
      && apt-get clean \
      && rm -rf /tmp/* /var/tmp/* \
      && rm -f /var/log/alternatives.log /var/log/apt/* \
      && rm -rf /var/lib/apt/lists/* \
      && rm /var/cache/debconf/*-old
    
    ENV NODE_ENV development
    
    # Put here, to optimize caching
    EXPOSE 8043
    
    WORKDIR /opt/app
    RUN chown -R node:node /opt/app
    
    USER node
    
    COPY --chown=node:node package.json yarn.lock .yarnclean /opt/app/
    COPY 100-wkhtmltoimage-special.conf /etc/fonts/conf.d/
    
    RUN yarn config set network-timeout 600000 -g && \
        yarn --frozen-lockfile && \
        yarn autoclean --force && \
        yarn cache clean
    
    FROM base as dev
    
    # --debug and inspect port
    EXPOSE 5858 9229
    COPY --chown=node:node . /opt/app
    RUN npx gulp build && sh ./app-ssl
    
    FROM base as prod
    
    COPY --from=dev /opt/app /opt/app
    
    # Like `npm prune --production`
    RUN yarn --production --ignore-scripts --prefer-offline
    
    CMD ["yarn", "start"]
  • The command:

    docker build \
      --network host --force-rm \
      --build-arg BUILDKIT_INLINE_CACHE=1 \
      --cache-from 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:latest \
      --cache-from 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:latest-dev \
      --cache-from 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:${VERSION} \
      --cache-from 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:${VERSION}-dev \
      --tag 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:${VERSION}-dev \
      --tag 12345678910.dkr.ecr.us-east-1.amazonaws.com/name-of-my-image:latest-dev \
      --target dev .
    
  • The end of the log:

    ...
    [2022-01-18T19:37:19.928Z] [4/5] Building fresh packages...
    [2022-01-18T19:37:19.928Z] [5/5] Cleaning modules...
    [2022-01-18T19:37:34.774Z] Done in 486.04s.
    [2022-01-18T19:37:34.774Z] yarn autoclean v1.21.1
    [2022-01-18T19:37:34.774Z] [1/1] Cleaning modules...
    [2022-01-18T19:37:46.952Z] info Removed 0 files
    [2022-01-18T19:37:46.952Z] info Saved 0 MB.
    [2022-01-18T19:37:46.952Z] Done in 12.85s.
    [2022-01-18T19:37:46.952Z] yarn cache v1.21.1
    [2022-01-18T19:38:13.453Z] success Cleared cache.
    [2022-01-18T19:38:13.453Z] Done in 24.21s.
    [2022-01-18T20:28:51.170Z] make: *** [Makefile:21: build-dev] Terminated <=== Pipeline reaches timeout! Look how long it hangs from the previous line.
    script returned exit code 2

If anyone needs any more information, please let me know. Thanks!

-- Igor Brites
docker
docker-in-docker
jenkins
kubernetes
node.js

0 Answers