google-fluentd stress our cluster node

11/22/2016

Some days ago we updated our GKE Cluster to version 1.4.5. Everything seems to work as expected but I'm observing a strange behavior on one of the cluster nodes.

If I run a top command I can see a high wa %:

top - 21:32:09 up 6 days,  9:48,  1 user,  load average: 1.67, 1.66, 1.64
Tasks: 124 total,   1 running, 123 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.4 us,  4.0 sy,  0.0 ni,  0.0 id, 93.6 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   3801000 total,  3599260 used,   201740 free,   167304 buffers
KiB Swap:        0 total,        0 used,        0 free,  2056748 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
20296 root      20   0  723m 184m    0 S   2.0  5.0  28:20.27 google-fluentd

If I run a iotop command I can see the google-fluentd it's reading and writing a lot:

Total DISK READ:      20.96 M/s | Total DISK WRITE:       0.00 B/s

TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
20298 be/4 root        2.89 M/s    0.00 B/s  0.00 % 49.48 % ruby /usr/sbin/google-fluentd -q
20329 be/4 root        6.76 M/s    0.00 B/s  0.00 % 33.28 % ruby /usr/sbin/google-fluentd -q
20331 be/4 root        3.60 M/s    0.00 B/s  0.00 % 21.26 % ruby /usr/sbin/google-fluentd -q
20334 be/4 root        2.95 M/s    0.00 B/s  0.00 % 16.17 % ruby /usr/sbin/google-fluentd -q
20350 be/4 root     1455.94 K/s    0.00 B/s  0.00 % 14.73 % ruby /usr/sbin/google-fluentd -q
20335 be/4 root      908.98 K/s    0.00 B/s  0.00 %  7.88 % ruby /usr/sbin/google-fluentd -q
20336 be/4 root     1794.35 K/s    0.00 B/s  0.00 %  7.23 % ruby /usr/sbin/google-fluentd -q

Those are fluent pod logs:

2016-11-22 16:44:50 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 16:13:48 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc" 2016-11-22 16:46:46 +0000 [warn]: suppressed same stacktrace 2016-11-22 17:15:21 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:15:25 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:14:57 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 16:43:52 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc" 2016-11-22 17:17:15 +0000 [warn]: suppressed same stacktrace 2016-11-22 17:44:11 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:44:20 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:44:33 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 17:43:37 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 17:14:00 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc"2016-11-22 17:44:32 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc"

2016-11-22 17:45:30 +0000 [warn]: suppressed same stacktrace 2016-11-22 18:12:34 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:12:58 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:12:20 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 17:42:49 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc" 2016-11-22 18:14:43 +0000 [warn]: suppressed same stacktrace 2016-11-22 18:43:27 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:43:23 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:43:36 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:42:39 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-11-22 18:11:12 +0000 error_class="Faraday::ConnectionFailed" error="end of file reached" plugin_id="object:20fcccc" 2016-11-22 18:43:45 +0000 [warn]: retry succeeded. plugin_id="object:20fcccc" 2016-11-22 18:44:45 +0000 [warn]: suppressed same stacktrace

I already restarted the fluentd pod several times.

On our kubernetes cluster, stack driver logging is enabled as you can see here:

Stackdriver Logging Enabled

Stackdriver Monitoring Disabled

Any idea whats happening?

-- Salvador González González
fluentd
google-kubernetes-engine

0 Answers