I've deployed metrics-server
in my K8s cluster (ver. 1.15)
I gather this is a standard way to perform simple mem utilization checks
I have a POD that contains multiple processes (wrapped with dumb-init
for process reaping purposes)
I want to know the exact current memory usage of my POD.
The output kube-capacity --util --pods
:
NODE NAMESPACE POD CPU REQUESTS CPU LIMITS CPU UTIL MEMORY REQUESTS MEMORY LIMITS MEMORY UTIL
sj-k8s1 kube-system kube-apiserver-sj-k8s1 250m (6%) 0m (0%) 77m (1%) 0Mi (0%) 0Mi (0%) 207Mi (2%)
...
sj-k8s3 salt-provisioning salt-master-7dcf7cfb6c-l8tth 0m (0%) 0m (0%) 220m (5%) 1536Mi (19%) 3072Mi (39%) 1580Mi (20%)
Shows that salt-master POD uses currently ~1.6Gi and kubeapi uses ~200Mi
However performing on sj-k8s3
, command ps aux | awk '$12 ~ /salt-master/ {sum += $6} END {print sum}'
(sum of RSS from PS output):
2051208
Which is ~2Gi, the output of /sys/fs/cgroup/memory/memory.stats
:
cache 173740032
rss 1523937280
rss_huge 0
shmem 331776
mapped_file 53248
dirty 4096
writeback 0
pgpgin 34692690
pgpgout 34278218
pgfault 212566509
pgmajfault 6
inactive_anon 331776
active_anon 1523916800
inactive_file 155201536
active_file 18206720
unevictable 0
hierarchical_memory_limit 2147483648
total_cache 173740032
total_rss 1523937280
total_rss_huge 0
total_shmem 331776
total_mapped_file 53248
total_dirty 4096
total_writeback 0
total_pgpgin 34692690
total_pgpgout 34278218
total_pgfault 212566509
total_pgmajfault 6
total_inactive_anon 331776
total_active_anon 1523916800
total_inactive_file 155201536
total_active_file 18206720
total_unevictable 0
This POD actually contains two docker containers, so actual sum of RSS is:
2296688
which is even bigger: 2.3Gi
On apiserver Node, performing just ps aux
reveals that process RSS is: 447948
The output of /sys/fs/cgroup/memory/memory.stats
:
cache 78499840
rss 391188480
rss_huge 12582912
shmem 0
mapped_file 69423104
dirty 0
writeback 0
pgpgin 111883
pgpgout 1812
pgfault 100603
pgmajfault 624
inactive_anon 0
active_anon 215531520
inactive_file 253870080
active_file 270336
unevictable 0
hierarchical_memory_limit 8361357312
total_cache 78499840
total_rss 391188480
total_rss_huge 12582912
total_shmem 0
total_mapped_file 69423104
total_dirty 0
total_writeback 0
total_pgpgin 111883
total_pgpgout 1812
total_pgfault 100603
total_pgmajfault 624
total_inactive_anon 0
total_active_anon 215531520
total_inactive_file 253870080
total_active_file 270336
total_unevictable 0
Could someone explain why the reported POD memory utilization differs from simple ps
by almost 40% (for apiserver process by 100) ?
EDIT: I've updated the memory reported values to include output of /sys/fs/cgroup/memory/memory.stat
which seems to +- correspond to POD utilization reported by kube-capacity
As suggested in first comment: does it mean that the difference is the shared memory only (reported by PS, but not by POD metrics/cgroup)?
The difference is pretty big
The ps
does not reflect the actual amount of memory used by the application but only the memory reserved for it. It can be very misleading if pages are shared by several processes or by using some dynamically linked libraries.
Understanding memory usage on Linux is a very good article describing how memory usage in Linux works and what ps is actually reporting.
Why
ps
is "wrong"Depending on how you look at it,
ps
is not reporting the real memory usage of processes. What it is really doing is showing how much real memory each process would take up if it were the only process running. Of course, a typical Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported byps
are almost definitely wrong.
That is why ps
should not be used for some detailed data for memory consumption.
Alternative to ps
would be smem
. It reports physical memory usage, taking shared memory pages into account. Then unshared memory is reported at the USS
(Unique Set Size). So you can use USS
when you want to ignore shared memory.
The unshared memory (USS
) plus process's proportion of shared memory is reported at the PSS
(Proportionial Set Size). Basically it add USS
along with a proportion of its shared memory divided by the number of processes sharing that memory.
On the other hand RSS
(Resident Set Size) is the amount of shared memory plus unshared memory used by each process. If any processes share memory, this will short report that over the amount of memory that is actually used.
Linux uses a resource management technique used in programming to efficiently implement a duplicate
or copy
operation. This is called copy-on-write
. So when you have parent and child process, they both will show the same RSS. With copy-on-write
linux ensures that both processes are really using the same memory.