GitLab Auto DevOps on Kubernetes hangs, network timeouts, cannot execute yj

12/19/2021

When using GitLab Auto DevOps to build and deploy application from my repository to microk8s, the build jobs often take a long time to run, eventually timing out. The issue happens 99% of the time, but some builds run through. Often, the build stops at a different time in the build script.

The projects do not contain a .gitlab-ci.yml file and fully rely on the Auto DevOps feature to do its magic.

For Spring Boot/Java projects, the build often fails when downloading the Gradle via the Gradle wrapper, other times it fails while downloading the dependencies itself. The error message is very vague and not helpful at all:

Step 5/11 : RUN /bin/herokuish buildpack build
 ---> Running in e9ec110c0dfe
       -----> Gradle app detected
-----> Spring Boot detected
The command '/bin/sh -c /bin/herokuish buildpack build' returned a non-zero code: 35

Sometimes, if you get lucky, the error is different:

Step 5/11 : RUN /bin/herokuish buildpack build
 ---> Running in fe284971a79c
       -----> Gradle app detected
-----> Spring Boot detected
-----> Installing JDK 11... done
-----> Building Gradle app...
-----> executing ./gradlew build -x check
       Downloading https://services.gradle.org/distributions/gradle-7.0-bin.zip
       ..........10%...........20%...........30%..........40%...........50%...........60%...........70%..........80%...........90%...........100%
       To honour the JVM settings for this build a single-use Daemon process will be forked. See https://docs.gradle.org/7.0/userguide/gradle_daemon.html#sec:disabling_the_daemon.
       Daemon will be stopped at the end of the build
       > Task :compileJava
       > Task :compileJava FAILED
       
       FAILURE: Build failed with an exception.
       
       * What went wrong:
       Execution failed for task ':compileJava'.
       > Could not download netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar (io.netty:netty-resolver-dns-native-macos:4.1.65.Final)
       > Could not get resource 'https://repo.maven.apache.org/maven2/io/netty/netty-resolver-dns-native-macos/4.1.65.Final/netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar'.
       > Could not GET 'https://repo.maven.apache.org/maven2/io/netty/netty-resolver-dns-native-macos/4.1.65.Final/netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar'.
       > Read timed out

For React/TypeScript problems, the symptoms are similar but the error itself manifests in a different way:

[INFO] Using npm v8.1.0 from package.json
/cnb/buildpacks/heroku_nodejs-npm/0.4.4/lib/build.sh: line 179: /layers/heroku_nodejs-engine/toolbox/bin/yj: Permission denied
ERROR: failed to build: exit status 126
ERROR: failed to build: executing lifecycle: failed with status code: 145

The problem seems to occur mostly when the GitLab runners itself are deplyoed in Kubernetes. microk8s uses Project Calico to implement virtual networks.

What gives? Why are the error messages to unhelpful? Is there a way to turn up verbose build logs or debug the build steps?

-- knittl
continuous-integration
docker
gitlab-autodevops
kubernetes
microk8s

1 Answer

12/19/2021

This seems to be a networking problem caused by incompatbile MTU settings between the Calico network layer and Docker's network configuration (and an inability to autoconfige the MTU correctly?) When the MTU values don't match, network packets get fragmented and the Docker runners fail to complete TLS handshakes. As far as I understand, this only affects DIND (docker-in-docker) runners.

Even finding this out requires jumping through a few hoops. You have to:

  1. Start a CI pipeline and wait for the job to "hang"
  2. kubectl exec into the current/active GitLab runner pod
  3. Find out the correct value for the DOCKER_HOST environment variable (e.g. by grepping through /proc/$pid/environ. Very likely, this will be tcp://localhost:2375.
  4. Export the value to be used by the docker client: export DOCKER_HOST=tcp://localhost:2375
  5. docker ps and then docker exec into the actual CI job container
  6. Use ping and other tools to find proper MTU values (but MTU for what? Docker, Calico, OS, router, …?). Use curl/openssl to verify that (certain) https sites cause problems from inside the DIND container.

Execute

microk8s kubectl get -n kube-system cm calico-config -o yaml

and look for the veth_mtu value, which will very likely be set to 1440. DIND uses the same MTU and thus fails send or receive certain network packages (each virtual network needs to add its own header to the network packet, which adds a few bytes at every layer).

The naïve fix would be to change the Calico settings to a higher or lower value, but somehow this did not really work, even after the Calico deployment. Furthermore, the value seems to be reset to its original value from time to time; probably caused by automatic updates to microk8s (which comes as a Snap).

So what is a solution that actually works and is permanent? It is possible to override DIND settings for Auto DevOps by writing a custom .gitlab-ci.yml file and simply includes the Auto DevOps template:

build:
  services:
    - name: docker:20.10.6-dind # make sure to update version
      command: ['--tls=false', '--host=tcp://0.0.0.0:2375', '--mtu=1240']

include:
    - template: Auto-DevOps.gitlab-ci.yml

The build.services definition is copied from the Jobs/Build.gitlab-ci template and extended with an additional --mtu option.

I've had good experience so far by setting the DIND MTU to 1240, which is 200 bytes lower than Calico's MTU. As an added bonus, it doesn't affect any other pods' network settings. And for CI builds I can live with non-optimal network settings.

References:

-- knittl
Source: StackOverflow