I am getting an error "No nodes are available that match all of the predicates: MatchNodeSelector (7), PodToleratesNodeTaints (1)" for kube-state-metrics. Please guide me how to troubleshoot this issue
admin@ip-172-20-58-79:~/kubernetes-prometheus$ kubectl describe po -n kube-system kube-state-metrics-747bcc4d7d-kfn7t
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3s (x20 over 4m) default-scheduler No nodes are available that match all of the predicates: MatchNodeSelector (7), PodToleratesNodeTaints (1).
is this issue related to memory on a node? If yes how do I confirm it? I checked all nodes only one node seems to be above 80%, remaining are between 45% to 70% memory usage
following screenshot shows kube-state-metrics (0/1 up) :
Furthermore, Prometheus showing kubernetes-pods (0/0 up) is it due to kube-state-metrics not working or any other reason? and kubernetes-apiservers (0/1 up) seen in the above screenshot why is not up? How to troubleshoot it?
admin@ip-172-20-58-79:~/kubernetes-prometheus$ sudo tail -f /var/log/kube-apiserver.log | grep error
I0110 10:15:37.153827 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60828: remote error: tls: bad certificate
I0110 10:15:42.153543 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60854: remote error: tls: bad certificate
I0110 10:15:47.153699 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60898: remote error: tls: bad certificate
I0110 10:15:52.153788 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60936: remote error: tls: bad certificate
I0110 10:15:57.154014 7 logs.go:41] http: TLS handshake error from 172.20.44.75:60992: remote error: tls: bad certificate
E0110 10:15:58.929167 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58104: write: connection reset by peer
E0110 10:15:58.931574 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58098: write: connection reset by peer
E0110 10:15:58.933864 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58088: write: connection reset by peer
E0110 10:16:00.842018 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58064: write: connection reset by peer
E0110 10:16:00.844301 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.42.187:58058: write: connection reset by peer
E0110 10:18:17.275590 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.44.75:37402: write: connection reset by peer
E0110 10:18:17.275705 7 runtime.go:66] Observed a panic: &errors.errorString{s:"kill connection/stream"} (kill connection/stream)
E0110 10:18:17.276401 7 runtime.go:66] Observed a panic: &errors.errorString{s:"kill connection/stream"} (kill connection/stream)
E0110 10:18:17.277808 7 status.go:62] apiserver received an error that is not an metav1.Status: write tcp 172.20.58.79:443->172.20.44.75:37392: write: connection reset by peer
Update after MaggieO's reply:
admin@ip-172-20-58-79:~/kubernetes-prometheus/kube-state-metrics-configs$ cat deployment.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.8.0
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
template:
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.8.0
spec:
containers:
- image: quay.io/coreos/kube-state-metrics:v1.8.0
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: kube-state-metrics
Furthermore, I want to add this command to above deployment.yaml but getting indentation error. show please help me where should I add it exactly.
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
Update 2: @MaggieO even after adding commands/args it is showing same error and pod is in pending state :
Update deployment.yaml :
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app.kubernetes.io/name":"kube-state-metrics","app.kubernetes.io/version":"v1.8.0"},"name":"kube-state-metrics","namespace":"kube-system"},"spec":{"replicas":1,"selector":{"matchLabels":{"app.kubernetes.io/name":"kube-state-metrics"}},"template":{"metadata":{"labels":{"app.kubernetes.io/name":"kube-state-metrics","app.kubernetes.io/version":"v1.8.0"}},"spec":{"containers":[{"args":["--kubelet-insecure-tls","--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname"],"image":"quay.io/coreos/kube-state-metrics:v1.8.0","imagePullPolicy":"Always","livenessProbe":{"httpGet":{"path":"/healthz","port":8080},"initialDelaySeconds":5,"timeoutSeconds":5},"name":"kube-state-metrics","ports":[{"containerPort":8080,"name":"http-metrics"},{"containerPort":8081,"name":"telemetry"}],"readinessProbe":{"httpGet":{"path":"/","port":8081},"initialDelaySeconds":5,"timeoutSeconds":5}}],"nodeSelector":{"kubernetes.io/os":"linux"},"serviceAccountName":"kube-state-metrics"}}}}
creationTimestamp: 2020-01-10T05:33:13Z
generation: 4
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.8.0
name: kube-state-metrics
namespace: kube-system
resourceVersion: "178851301"
selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/kube-state-metrics
uid: b20aa645-336a-11ea-9618-0607d7cb72ed
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.8.0
spec:
containers:
- args:
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
image: quay.io/coreos/kube-state-metrics:v1.8.0
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
protocol: TCP
- containerPort: 8081
name: telemetry
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 8081
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: kube-state-metrics
serviceAccountName: kube-state-metrics
terminationGracePeriodSeconds: 30
status:
conditions:
- lastTransitionTime: 2020-01-10T05:33:13Z
lastUpdateTime: 2020-01-10T05:33:13Z
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: 2020-01-15T07:24:27Z
lastUpdateTime: 2020-01-15T07:29:12Z
message: ReplicaSet "kube-state-metrics-7f8c9c6c8d" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
observedGeneration: 4
replicas: 2
unavailableReplicas: 2
updatedReplicas: 1
Update 3: It is not able to get a node as shown in the following screenshot, let me know how to troubleshoot this issue
Error on kubernetes-apiservers Get https:// ...: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 172.20.58.79
means that controlplane nodes are targeted randomly, and the apiEndpoint only changes when the node is deleted from the cluster, it is not immediately noticeable as it requires changes with nodes in the cluster.
Workaround fix: manually synchronize kube-apiserver.pem between master nodes and restart kube-apiserver container.
You can also remove the apiserver. and apiserver-kubelet-client. and recreate them with commands:
$ kubeadm init phase certs apiserver --config=/etc/kubernetes/kubeadm-config.yaml
$ kubeadm init phase certs apiserver-kubelet-client --config=/etc/kubernetes/kubeadm-config.yaml
$ systemctl stop kubelet
delete the docker container with kubelet
$ systemctl restart kubelet
Similar problems: x509 certificate, kubelet-x509.
Then solve problem with metrics server.
Change the metrics-server-deployment.yaml file, and set the following args:
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
The metrics-server is now able to talk to the node (It was failing before because it could not resolve the hostname of the node).
More information you can find here: metrics-server-issue.