We are on kube v1.13.10. We have ~500 nodes in cluster. Recently I've started to get alerts about DiskPressure from masters. After some checks we found out that the reason was kube-scheduler logs. They grew in size to ~20GB each, and there can be 5 of them. And master instance had only 80GB of disk space.
Logrotate is configured to run every hour with delayed compression (default kops settings). Logs are mostly filled with messages like that
E0929 00:34:27.778731 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
E0929 00:34:27.778734 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
E0929 00:34:27.778738 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
E0929 00:34:27.778742 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
E0929 00:34:27.782052 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
E0929 00:34:27.782068 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
E0929 00:34:27.782073 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
E0929 00:34:27.782079 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
E0929 00:34:27.782083 1 predicates.go:1277] Node not found, ip-10-0-0-1.ec2.internal
I've increased disk size for master. But why so many error messages? It generated 20GB of logs in 1 hour, I think it's a bit extreme. How can I avoid it?
The message you receive was recently changed by the developers from: "Node not found, %v"
to "Pod %s has NodeName %q but node is not found"
The new message states that there is a pod scheduled to a non-existing node.
The best way to fix this would be to delete the node using kubectl delete node <node_name>
and if that won't work than try to delete it from etcd using etcdctl. That way it would help scheduler to move the pod to another node which would reduce the error logs and their size.
Please let me know if that helped.