Calico `Socket error: bind: Address not available` after Gitlab kube runner job

5/24/2021

I have a Kubernetes cluster deployed with Kops in AWS, not using EKS (too expensive for now). The CNI is Calico. I have 4 nodes:

  • 1 master t3a.medium
  • 2 small workers t2.micro
  • 1 bigger worker t3a.medium

They are labelled.

I have a Gitlab kubernetes runner installed with this Helm chart, it's configured to run on the big worker. It's also configured to spawn runners on the big worker as well.

It was working well, but since a few days I noticed that sometimes (if not each time) the first job of a pipeline will cause an error in the pod calico-node running on the big node. The error is:

bird: Mesh_10_1_1_136: Socket error: bind: Address not available
bird: Mesh_10_1_1_212: Socket error: bind: Address not available
bird: Mesh_10_1_1_14: Socket error: bind: Address not available

10.1.1.136 212 and 14 being the ips of the master and the 2 smaller nodes. IP of big node is never appearing.

So my questions are:

  • what's happening?
  • what can I do to prevent this calico pod from erroring?

Thank you very much in advance. Cheers.

EDIT I've found these lines in the calico-node logs:

2021-05-24 22:43:26.881 [INFO][49] monitor-addresses/startup.go 576: Node IPv4 changed, will check for conflicts
2021-05-24 22:43:26.907 [WARNING][49] monitor-addresses/startup.go 1107: IPv4 address has changed. This could happen if there are multiple nodes with the same name. node="ip-10-1-1-198.eu-west-1.compute.internal" original="10.1.1.198" updated="192.168.0.1"
2021-05-24 22:43:26.936 [INFO][45] confd/client.go 877: Recompute BGP peerings: HostBGPConfig(node=ip-10-1-1-198.eu-west-1.compute.internal; name=ip_addr_v4) updated; HostBGPConfig(node=ip-10-1-1-198.eu-west-1.compute.internal; name=network_v4) updated
2021-05-24 22:43:26.937 [INFO][53] felix/int_dataplane.go 1325: Received *proto.HostMetadataUpdate update from calculation graph msg=hostname:"ip-10-1-1-198.eu-west-1.compute.internal" ipv4_addr:"192.168.0.1" 
2021-05-24 22:43:26.937 [INFO][53] felix/int_dataplane.go 1453: Applying dataplane updates
2021-05-24 22:43:26.937 [INFO][53] felix/ipip_mgr.go 222: All-hosts IP set out-of sync, refreshing it.
2021-05-24 22:43:26.937 [INFO][53] felix/ipsets.go 119: Queueing IP set for creation family="inet" setID="all-hosts-net" setType="hash:net"
2021-05-24 22:43:26.941 [INFO][49] monitor-addresses/startup.go 308: Updated node IP addresses
2021-05-24 22:43:26.951 [INFO][53] felix/ipsets.go 749: Doing full IP set rewrite family="inet" numMembersInPendingReplace=4 setID="all-hosts-net"
bird: Mesh_10_1_1_212: Received: Peer de-configured
bird: Mesh_10_1_1_212: State changed to stop
bird: Mesh_10_1_1_212: State changed to down
bird: Mesh_10_1_1_212: Starting
bird: Mesh_10_1_1_212: State changed to start
2021-05-24 22:43:26.971 [INFO][53] felix/int_dataplane.go 1467: Finished applying updates to dataplane. msecToApply=33.685021
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Reconfigured
2021-05-24 22:43:26.984 [INFO][45] confd/resource.go 277: Target config /etc/calico/confd/config/bird6.cfg has been updated due to change in key: /calico/bgp/v1/host
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Restarting protocol Mesh_10_1_1_136
bird: Mesh_10_1_1_136: Shutting down
bird: Mesh_10_1_1_136: State changed to stop
bird: Restarting protocol Mesh_10_1_1_14
bird: Mesh_10_1_1_14: Shutting down
bird: Mesh_10_1_1_14: State changed to stop
bird: Restarting protocol Mesh_10_1_1_212
bird: Mesh_10_1_1_212: Shutting down
bird: Mesh_10_1_1_212: State changed to stop
bird: Mesh_10_1_1_212: State changed to down
bird: Mesh_10_1_1_212: Initializing
bird: Mesh_10_1_1_212: Starting
bird: Mesh_10_1_1_212: State changed to start
bird: Mesh_10_1_1_136: State changed to down
bird: Mesh_10_1_1_136: Initializing
bird: Mesh_10_1_1_136: Starting
bird: Mesh_10_1_1_136: State changed to start
bird: Mesh_10_1_1_14: State changed to down
bird: Mesh_10_1_1_14: Initializing
bird: Mesh_10_1_1_14: Starting
bird: Mesh_10_1_1_14: State changed to start
bird: Reconfigured

EDIT 2 The gitlab step that provoque this error is the following:

integrationtesting:
  tags:
    - kubernetes
  image: docker/compose:alpine-1.29.2
  stage: tests
  before_script:
    - echo "NPM_TOKEN=$NPM_TOKEN" > test_integ/dependencies/.env
    - docker-compose -f test_integ/dependencies/docker-compose.yaml up --build -d
  script:
    - docker-compose -f test_integ/dependencies/tester-compose.yaml up --build --abort-on-container-exit --exit-code-from tester
  after_script:
    - docker-compose -f test_integ/dependencies/docker-compose.yaml -f test_integ/dependencies/tester-compose.yaml down

With: docker-compose.yaml

version: "3.9"
networks:
  testinteg:
    name: testinteg
services:
  mongosrv:
    container_name: "mongosrv"
    image: mongo
    networks:
      - testinteg
  users:
    container_name: "users"
    build:
      context: "../.."
      dockerfile: "Dockerfile"
      target: run
      args:
        NPM_TOKEN: "${NPM_TOKEN}"
      network: host
    environment:
      NODE_ENV: "dev"
      PORT: 80
      LOG_LEVEL: "debug"
      LOG_FORMAT: "splat,simple"
      PASSWORD_JWT_SECRET: "anothersecurestring"
      PASSWORD_JWT_TTL: "30s"
      SSL_ENABLED: "false"
      MOCK_DB: "false"
      MONGO_DB: "users"
      MONGO_HOST: "mongosrv"
    depends_on:
      - mongosrv
    networks:
      - testinteg

test-compose.yaml

version: "3.9"
networks:
  testinteg:
    name: testinteg
services:
  tester:
    container_name: "tester"
    build:
      context: "../.."
      dockerfile: "Dockerfile"
      target: testinteg
      args:
        NPM_TOKEN: "${NPM_TOKEN}"
      network: host
    environment:
      MSHOST: users
      MSPORT: 80
    volumes:
      - ../tests:/app/test_integ/tests
    networks:
      - testinteg

Final information: the Dockerfile is running npm ci against a private npm registry at Jfrog Artifactory. Without the network: host option in the build section it fails to resolve domain (Docker in docker issue).

-- Nicolas Espiau
amazon-ec2
calico
gitlab
gitlab-ci-runner
kubernetes

0 Answers