We have a 5-node cluster that was moved behind our corporate firewall/proxy server.
As per the directions here: setting-up-standalone-kubernetes-cluster-behind-corporate-proxy
I set the proxy server environment variables using:
export http_proxy=http://proxy-host:proxy-port/
export HTTP_PROXY=$http_proxy
export https_proxy=$http_proxy
export HTTPS_PROXY=$http_proxy
printf -v lan '%s,' localip_of_machine
printf -v pool '%s,' 192.168.0.{1..253}
printf -v service '%s,' 10.96.0.{1..253}
export no_proxy="${lan%,},${service%,},${pool%,},127.0.0.1";
export NO_PROXY=$no_proxy
Now everything in our cluster works internally. However, when I try to create a pod that pulls down an image from the outside, the pod is stuck on ContainerCreating
, e.g.,
[gms@thalia0 ~]$ kubectl apply -f https://k8s.io/examples/admin/dns/busybox.yaml
pod/busybox created
is stuck here:
[gms@thalia0 ~]$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 0/1 ContainerCreating 0 17m
I assume this is due to the host/domain that the image is being pulled from not being in our corporate proxy rules. We do have rules for
k8s.io
kubernetes.io
docker.io
docker.com
so, I'm not sure what other hosts/domains need to be added.
I did a describe pods for busybox and see reference to node.kubernetes.io
(I am putting in a domain-wide exception for *.kubernetes.io
which will hopefully suffice).
This is what I get from kubectl describe pods busybox
:
Volumes:
default-token-2kfbw:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-2kfbw
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 73s default-scheduler Successfully assigned default/busybox to thalia3.ahc.umn.edu
Warning FailedCreatePodSandBox 10s kubelet, thalia3.ahc.umn.edu Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "6af48c5dadf6937f9747943603a3951bfaf25fe1e714cb0b0cbd4ff2d59aa918" network for pod "busybox": NetworkPlugin cni failed to set up pod "busybox_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "6af48c5dadf6937f9747943603a3951bfaf25fe1e714cb0b0cbd4ff2d59aa918" network for pod "busybox": NetworkPlugin cni failed to teardown pod "busybox_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
Normal SandboxChanged 10s kubelet, thalia3.ahc.umn.edu Pod sandbox changed, it will be killed and re-created.
I would assume the calico error is due to this:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
The calico
and coredns
pods seem to have similar errors reaching node.kubernetes.io
, so I would assume this is due to our server not being able to pull down the new images on a restart.
It doesn't seem you have any problem pulling the image as you should see an ImagePullBackOff
status. (Although that may come later after the error message you are seeing)
The error you are seeing from your pods is related to them not being able to connect to the kube-apiserver internally. It looks like a timeout so most likely there's something with the kubernetes
service in your default namespace. You can check it like this, for example:
$ kubectl -n default get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d20h
It could be that is missing(?) You can always re-create it:
$ cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
labels:
component: apiserver
provider: kubernetes
name: kubernetes
namespace: default
spec:
clusterIP: 10.96.0.1
type: ClusterIP
ports:
- name: https
port: 443
protocol: TCP
targetPort: 443
EOF
The toleration is basically saying that the pod can tolerate being scheduled on a node that has the node.kubernetes.io/not-ready:NoExecute
and node.kubernetes.io/unreachable:NoExecute
taints but your error doesn't look like is related to that.
The issue normally means docker daemon is unable to respond.
If there any other service consuming more CPU or I/O, then this issue might occur.
It looks like you are misunderstanding a few Kubernetes concepts that I'd like to help clarify here. References to node.kubernetes.io
is not an attempt make any network calls to that domain. It is simply the convention that Kubernetes uses to specify string keys. So if you ever have to apply labels, annotations, or tolerations, you would define your own keys like subdomain.domain.tld/some-key
.
As for the Calico issue that you are experiencing, it looks like the error:
network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
is our culprit here. 10.96.0.1
is the IP address used to refer to the Kubernetes API server within pods. It seems like the calico/node
pod running on your node is failing to reach the API server. Could you more context around how you set up Calico? Do you know what version of Calico you are running?
The fact that your calico/node
instance is trying to access the crd.projectcalico.org/v1/clusterinformations
resource tells me that it is using the Kubernetes datastore for its backend. Are you sure you're not trying to run Calico in Etcd mode?