I'm running a kube-upped Kubernetes 1.2.3 cluster on AWS, on two m4.large nodes, and I'm using the auto-installed influx-grafana pod for cluster monitoring.
My problem is that after a week or two, the influx-container dies and will not come up again. I'm a bit unsure what logs to check for relevant error messages, but the syslog on the minion running the container contained the following information:
Jun 16 05:57:41 ip-172-22-29-244 kubelet[4434]: E0616 05:57:41.382751 4434 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"monitoring-influxdb-grafana-v3-dlx9o.145604121bcf8ade", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"407635", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"monitoring-influxdb-grafana-v3-dlx9o", UID:"07c2a623-2b57-11e6-b7a9-068c6a09a769", APIVersion:"v1", ResourceVersion:"850776", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"influxdb\" with CrashLoopBackOff: \"Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)\"\n", Source:api.EventSource{Component:"kubelet", Host:"ip-172-22-29-244.eu-west-1.compute.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63600960004, nsec:0, loc:(*time.Location)(0x2e38da0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63601653461, nsec:379098581, loc:(*time.Location)(0x2e38da0)}}, Count:11023, Type:"Warning"}': 'events "monitoring-influxdb-grafana-v3-dlx9o.145604121bcf8ade" not found' (will not retry!)
Jun 16 05:57:54 ip-172-22-29-244 kubelet[4434]: I0616 05:57:54.378491 4434 manager.go:2050] Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)
Jun 16 05:57:54 ip-172-22-29-244 kubelet[4434]: E0616 05:57:54.378545 4434 pod_workers.go:138] Error syncing pod 07c2a623-2b57-11e6-b7a9-068c6a09a769, skipping: failed to "StartContainer" for "influxdb" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)"
I've also seen indications that the container was originally OOM-killed. My assumption is that the influx-index just grows too large over time since there is no automatic cleanup, is killed by Kubernetes once the 500MB memory limit from the manifest is breached, and fail to restart for the same reasons or because it times out while reading the index.
Once this happens, the only way I've been able to get it up and running again is to kill the pod entirely to have Kubernetes re-create it from scratch, which basically means losing all existing data.
But what do I do about it? Changing the memory limits on kube-system pods seems to be non-trivial, and may only buy me a few more days anyways. I could set up my own watchdog to clean up data, but only being able to keep 1-2 weeks of monitoring data kind of limits its value.