Discrepancy in POD memory utilization and RSS from Node's ps

3/23/2020

I've deployed metrics-server in my K8s cluster (ver. 1.15)
I gather this is a standard way to perform simple mem utilization checks

I have a POD that contains multiple processes (wrapped with dumb-init for process reaping purposes)

I want to know the exact current memory usage of my POD.

The output kube-capacity --util --pods:

NODE      NAMESPACE           POD                               CPU REQUESTS   CPU LIMITS   CPU UTIL      MEMORY REQUESTS   MEMORY LIMITS   MEMORY UTIL
sj-k8s1   kube-system         kube-apiserver-sj-k8s1            250m (6%)      0m (0%)      77m (1%)      0Mi (0%)          0Mi (0%)        207Mi (2%)

...
sj-k8s3   salt-provisioning   salt-master-7dcf7cfb6c-l8tth      0m (0%)        0m (0%)      220m (5%)     1536Mi (19%)      3072Mi (39%)    1580Mi (20%)

Shows that salt-master POD uses currently ~1.6Gi and kubeapi uses ~200Mi

However performing on sj-k8s3, command ps aux | awk '$12 ~ /salt-master/ {sum += $6} END {print sum}' (sum of RSS from PS output):

2051208

Which is ~2Gi, the output of /sys/fs/cgroup/memory/memory.stats:

cache 173740032
rss 1523937280
rss_huge 0
shmem 331776
mapped_file 53248
dirty 4096
writeback 0
pgpgin 34692690
pgpgout 34278218
pgfault 212566509
pgmajfault 6
inactive_anon 331776
active_anon 1523916800
inactive_file 155201536
active_file 18206720
unevictable 0
hierarchical_memory_limit 2147483648
total_cache 173740032
total_rss 1523937280
total_rss_huge 0
total_shmem 331776
total_mapped_file 53248
total_dirty 4096
total_writeback 0
total_pgpgin 34692690
total_pgpgout 34278218
total_pgfault 212566509
total_pgmajfault 6
total_inactive_anon 331776
total_active_anon 1523916800
total_inactive_file 155201536
total_active_file 18206720
total_unevictable 0

This POD actually contains two docker containers, so actual sum of RSS is:

2296688

which is even bigger: 2.3Gi

On apiserver Node, performing just ps aux reveals that process RSS is: 447948 The output of /sys/fs/cgroup/memory/memory.stats:

cache 78499840
rss 391188480
rss_huge 12582912
shmem 0
mapped_file 69423104
dirty 0
writeback 0
pgpgin 111883
pgpgout 1812
pgfault 100603
pgmajfault 624
inactive_anon 0
active_anon 215531520
inactive_file 253870080
active_file 270336
unevictable 0
hierarchical_memory_limit 8361357312
total_cache 78499840
total_rss 391188480
total_rss_huge 12582912
total_shmem 0
total_mapped_file 69423104
total_dirty 0
total_writeback 0
total_pgpgin 111883
total_pgpgout 1812
total_pgfault 100603
total_pgmajfault 624
total_inactive_anon 0
total_active_anon 215531520
total_inactive_file 253870080
total_active_file 270336
total_unevictable 0

Could someone explain why the reported POD memory utilization differs from simple ps by almost 40% (for apiserver process by 100) ?

EDIT: I've updated the memory reported values to include output of /sys/fs/cgroup/memory/memory.stat which seems to +- correspond to POD utilization reported by kube-capacity
As suggested in first comment: does it mean that the difference is the shared memory only (reported by PS, but not by POD metrics/cgroup)?
The difference is pretty big

-- lakier
kubernetes
memory
metrics

1 Answer

3/25/2020

The ps does not reflect the actual amount of memory used by the application but only the memory reserved for it. It can be very misleading if pages are shared by several processes or by using some dynamically linked libraries.

Understanding memory usage on Linux is a very good article describing how memory usage in Linux works and what ps is actually reporting.

Why ps is "wrong"

Depending on how you look at it, ps is not reporting the real memory usage of processes. What it is really doing is showing how much real memory each process would take up if it were the only process running. Of course, a typical Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported by ps are almost definitely wrong.

That is why ps should not be used for some detailed data for memory consumption.

Alternative to ps would be smem. It reports physical memory usage, taking shared memory pages into account. Then unshared memory is reported at the USS (Unique Set Size). So you can use USS when you want to ignore shared memory.

The unshared memory (USS) plus process's proportion of shared memory is reported at the PSS (Proportionial Set Size). Basically it add USS along with a proportion of its shared memory divided by the number of processes sharing that memory.

On the other hand RSS(Resident Set Size) is the amount of shared memory plus unshared memory used by each process. If any processes share memory, this will short report that over the amount of memory that is actually used.

Linux uses a resource management technique used in programming to efficiently implement a duplicate or copy operation. This is called copy-on-write. So when you have parent and child process, they both will show the same RSS. With copy-on-write linux ensures that both processes are really using the same memory.

-- acid_fuji
Source: StackOverflow