Running kubernetes kubeadm cluster behind a corporate firewall/proxy server

4/5/2019

We have a 5-node cluster that was moved behind our corporate firewall/proxy server.

As per the directions here: setting-up-standalone-kubernetes-cluster-behind-corporate-proxy

I set the proxy server environment variables using:

export http_proxy=http://proxy-host:proxy-port/
export HTTP_PROXY=$http_proxy
export https_proxy=$http_proxy
export HTTPS_PROXY=$http_proxy
printf -v lan '%s,' localip_of_machine
printf -v pool '%s,' 192.168.0.{1..253}
printf -v service '%s,' 10.96.0.{1..253}
export no_proxy="${lan%,},${service%,},${pool%,},127.0.0.1";
export NO_PROXY=$no_proxy

Now everything in our cluster works internally. However, when I try to create a pod that pulls down an image from the outside, the pod is stuck on ContainerCreating, e.g.,

[gms@thalia0 ~]$ kubectl apply -f https://k8s.io/examples/admin/dns/busybox.yaml
pod/busybox created

is stuck here:

[gms@thalia0 ~]$ kubectl get pods
NAME                            READY   STATUS              RESTARTS   AGE
busybox                         0/1     ContainerCreating   0          17m

I assume this is due to the host/domain that the image is being pulled from not being in our corporate proxy rules. We do have rules for

k8s.io
kubernetes.io
docker.io
docker.com

so, I'm not sure what other hosts/domains need to be added.

I did a describe pods for busybox and see reference to node.kubernetes.io (I am putting in a domain-wide exception for *.kubernetes.io which will hopefully suffice).

This is what I get from kubectl describe pods busybox:

Volumes:
  default-token-2kfbw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-2kfbw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age   From                          Message
  ----     ------                  ----  ----                          -------
  Normal   Scheduled               73s   default-scheduler             Successfully assigned default/busybox to thalia3.ahc.umn.edu
  Warning  FailedCreatePodSandBox  10s   kubelet, thalia3.ahc.umn.edu  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "6af48c5dadf6937f9747943603a3951bfaf25fe1e714cb0b0cbd4ff2d59aa918" network for pod "busybox": NetworkPlugin cni failed to set up pod "busybox_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "6af48c5dadf6937f9747943603a3951bfaf25fe1e714cb0b0cbd4ff2d59aa918" network for pod "busybox": NetworkPlugin cni failed to teardown pod "busybox_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
  Normal   SandboxChanged          10s   kubelet, thalia3.ahc.umn.edu  Pod sandbox changed, it will be killed and re-created.

I would assume the calico error is due to this:

Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                     node.kubernetes.io/unreachable:NoExecute for 300s

The calico and coredns pods seem to have similar errors reaching node.kubernetes.io, so I would assume this is due to our server not being able to pull down the new images on a restart.

-- horcle_buzz
corporate-policy
docker
http-proxy
kubernetes

3 Answers

4/5/2019

It doesn't seem you have any problem pulling the image as you should see an ImagePullBackOff status. (Although that may come later after the error message you are seeing)

The error you are seeing from your pods is related to them not being able to connect to the kube-apiserver internally. It looks like a timeout so most likely there's something with the kubernetes service in your default namespace. You can check it like this, for example:

$ kubectl -n default get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   2d20h

It could be that is missing(?) You can always re-create it:

$ cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  labels:
    component: apiserver
    provider: kubernetes
  name: kubernetes
  namespace: default
spec:
  clusterIP: 10.96.0.1
  type: ClusterIP
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 443
EOF

The toleration is basically saying that the pod can tolerate being scheduled on a node that has the node.kubernetes.io/not-ready:NoExecute and node.kubernetes.io/unreachable:NoExecute taints but your error doesn't look like is related to that.

-- Rico
Source: StackOverflow

4/6/2019

The issue normally means docker daemon is unable to respond.

If there any other service consuming more CPU or I/O, then this issue might occur.

-- Akash Sharma
Source: StackOverflow

4/6/2019

It looks like you are misunderstanding a few Kubernetes concepts that I'd like to help clarify here. References to node.kubernetes.io is not an attempt make any network calls to that domain. It is simply the convention that Kubernetes uses to specify string keys. So if you ever have to apply labels, annotations, or tolerations, you would define your own keys like subdomain.domain.tld/some-key.

As for the Calico issue that you are experiencing, it looks like the error:

network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]

is our culprit here. 10.96.0.1 is the IP address used to refer to the Kubernetes API server within pods. It seems like the calico/node pod running on your node is failing to reach the API server. Could you more context around how you set up Calico? Do you know what version of Calico you are running?

The fact that your calico/node instance is trying to access the crd.projectcalico.org/v1/clusterinformations resource tells me that it is using the Kubernetes datastore for its backend. Are you sure you're not trying to run Calico in Etcd mode?

-- brianSan
Source: StackOverflow