Some days ago we updated our GKE Cluster to version 1.4.5. Everything seems to work as expected but I'm observing a strange behavior on one of the cluster nodes.
If I run a top command I can see a high wa %:
top - 21:32:09 up 6 days, 9:48, 1 user, load average: 1.67, 1.66, 1.64
Tasks: 124 total, 1 running, 123 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.4 us, 4.0 sy, 0.0 ni, 0.0 id, 93.6 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 3801000 total, 3599260 used, 201740 free, 167304 buffers
KiB Swap: 0 total, 0 used, 0 free, 2056748 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20296 root 20 0 723m 184m 0 S 2.0 5.0 28:20.27 google-fluentd
If I run a iotop command I can see the google-fluentd it's reading and writing a lot:
Total DISK READ: 20.96 M/s | Total DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
20298 be/4 root 2.89 M/s 0.00 B/s 0.00 % 49.48 % ruby /usr/sbin/google-fluentd -q
20329 be/4 root 6.76 M/s 0.00 B/s 0.00 % 33.28 % ruby /usr/sbin/google-fluentd -q
20331 be/4 root 3.60 M/s 0.00 B/s 0.00 % 21.26 % ruby /usr/sbin/google-fluentd -q
20334 be/4 root 2.95 M/s 0.00 B/s 0.00 % 16.17 % ruby /usr/sbin/google-fluentd -q
20350 be/4 root 1455.94 K/s 0.00 B/s 0.00 % 14.73 % ruby /usr/sbin/google-fluentd -q
20335 be/4 root 908.98 K/s 0.00 B/s 0.00 % 7.88 % ruby /usr/sbin/google-fluentd -q
20336 be/4 root 1794.35 K/s 0.00 B/s 0.00 % 7.23 % ruby /usr/sbin/google-fluentd -q
Those are fluent pod logs:
2016-11-22 16:44:50 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 16:13:48 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc" 2016-11-22 16:46:46 +0000 [warn]: suppressed same stacktrace 2016-11-22 17:15:21 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:15:25 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:14:57 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 16:43:52 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc" 2016-11-22 17:17:15 +0000 [warn]: suppressed same stacktrace 2016-11-22 17:44:11 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:44:20 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:44:33 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:43:37 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 17:14:00 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc"2016-11-22 17:44:32 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc"
2016-11-22 17:45:30 +0000 [warn]: suppressed same stacktrace 2016-11-22 18:12:34 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:12:58 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:12:20 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 17:42:49 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc" 2016-11-22 18:14:43 +0000 [warn]: suppressed same stacktrace 2016-11-22 18:43:27 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:43:23 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:43:36 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:42:39 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 18:11:12 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc" 2016-11-22 18:43:45 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:44:45 +0000 [warn]: suppressed same stacktrace
I already restarted the fluentd pod several times.
On our kubernetes cluster, stack driver logging is enabled as you can see here:
Stackdriver Logging Enabled
Stackdriver Monitoring Disabled
Any idea whats happening?