I have a websocket server that hoards memory during days, till the point that Kubernetes eventually kills it. We monitor it using prometheous-net.
# dotnet --info
Host (useful for support):
Version: 2.1.6
Commit: 3f4f8eebd8
.NET Core SDKs installed:
No SDKs were found.
.NET Core runtimes installed:
Microsoft.AspNetCore.All 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
But when I connect remotely and take a memory dump (using createdump
), suddently the memory drops... without the service stopping, restarting or loosing any connected user. See the green line in the picture.
I can see in the graphs, that GC is collecting regularly in all generations.
GC Server is disabled using:
<PropertyGroup>
<ServerGarbageCollection>false</ServerGarbageCollection>
</PropertyGroup>
Before disabling GC Server, the service used to grow memory way faster. Now it takes two weeks to get into 512Mb.
Other services using ASP.NET Core on request/response fashion do not show this problem. This uses Websockets, where each connection last usually around 10 minutes... so I guess everything related with the connection survives till Gen 2 easily.
Note that there are two pods, showing the same behaviour, and then one (the green) drops suddenly in memory ussage due the taking of the memory dump.
The pods did not restart during the taking of the memory dump:
No connection was lost or restarted.
Heap:
(lldb) eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x00007F8481C8D0B0
generation 1 starts at 0x00007F8481C7E820
generation 2 starts at 0x00007F852A1D7000
ephemeral segment allocation context: none
segment begin allocated size
00007F852A1D6000 00007F852A1D7000 00007F853A1D5E90 0xfffee90(268430992)
00007F84807D0000 00007F84807D1000 00007F8482278000 0x1aa7000(27947008)
Large object heap starts at 0x00007F853A1D7000
segment begin allocated size
00007F853A1D6000 00007F853A1D7000 00007F853A7C60F8 0x5ef0f8(6222072)
Total Size: Size: 0x12094f88 (302600072) bytes.
------------------------------
GC Heap Size: Size: 0x12094f88 (302600072) bytes.
(lldb)
Free objects:
(lldb) dumpheap -type Free -stat
Statistics:
MT Count TotalSize Class Name
00000000010c52b0 219774 10740482 Free
Total 219774 objects
Is there any explanation to this behaviour?
The problem was the connection to RabbitMQ. Because we were using sort lived channels, the "auto-reconnect" feature of the RabbitMQ.Client was keeping a lot of state about dead channels. We switched off this configuration, since we do not need the "perks" of the "auto-reconnect" feature, and everything start working normally. It was a pain, but we basically had to setup a Windows deploy and do the usual memory analysis process with Windows tools (Jetbrains dotMemory in this case). Using lldb is not productive at all.
Disclaimer: I am no .NET Wizard.
But you should do two things to go with Kubernetes best practices:
Define sensible resource limits for your app. If the app does not need more than 200MB memory define a resource limit to prevent the app from consuming all available host memory. But be aware that the Unix API to get available memory is not capable of processing the cgroup the process has and always outputs the host memory no matter what your cgroup says.
Tell your app what this resource limit is. It seems like your app does not "feel the need" to free memory as there is plenty. Almost all applications, and frameworks as well, have a switch to define the maximum memory to be consumed. Tell your app this limit, and it will "see" memory pressure and perform a full GC (what I guess could be the problem here)