Would same file from various docker images be page-cached in k8s node just once?

1/14/2020

Excerpt from https://docs.docker.com/storage/storagedriver/overlayfs-driver/

Page Caching. OverlayFS supports page cache sharing. Multiple containers accessing the same file share a single page cache entry for that file.

This is in the context of layered docker image, where multiple containers access the same image. In my deployment, however, I can see very stable (in time) difference in page cache utilization by the same image running on different but similarly configured nodes of Kubernetes cluster. Neither node has cache pressure that could lead to different reclaiming rates.

So, I was wondering if the same file in the above excerpt could refer to the "sameness" verified by a hash, and the file could be actually a part of various docker images?

The difference in question is on the order of 30-60MB, and that's consistent with python/libgc libraries my container uses. It would all make sense if common shared libraries were "deduplicated" in page cache node-wide at overlayfs level. The page cache would be counted to a cgroup on first-touch basis as per para 2.3 of https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

Therefore, the image running on the node, where same python libraries were used by other docker image(s), would show less utilization of the page cache by comparison with the node, where those libraries were used by my container only.

I am aware of abundant thought along deduplication, such as https://lwn.net/Articles/636943/ Y2015:

Chinner spoke up to describe the problem, which is that there might be a hundred containers running on a system all based on a snapshot of a single root filesystem. That means there will be a hundred copies of glibc in the page cache because they come from different namespaces with different inodes, so there is no sharing of the data.

And no, I am not using KSM, so no need to mention that. I would appreciate having some references to the source code shedding light on this behavior.

-- wick
docker
kernel
kubernetes
linux
memory-management

1 Answer

1/24/2020

Upon further investigation it became clear that content from unrelated containers/pods is shared on a node that may be a reasonable security risk.

Each line in a Dockerfile can represent 0,1 or many layers as per https://docs.docker.com/storage/storagedriver/. For example, my current build FROM ubuntugives three base layers in my image:

# docker inspect ubuntu :

"GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/22abb0d6b77061cc1e3a04de4d3c83be15e60b87adebf9b7b2fa9adc0fbb0f2d/diff:/var/lib/docker/overlay2/7ab02c0180d53cfa2f444a10650a688c8cebd0368ddb2cea1dba7f01b2008d37/diff:/var/lib/docker/overlay2/3ee0e4ab0518c76376a4023f7c438cc6a8d28121eba9cdbed9440cfc7474204e/diff",

If I further say RUN apt-get -y install python, docker will create a layer containing all folders, files, and timestamps produced by that command. The layer will be tar'd and sha256 will be taken from the tar file. In the ubuntu example above you can see the mezzanine layer has the sha256sum: 3ee0e4ab0518c76376a4023f7c438cc6a8d28121eba9cdbed9440cfc7474204e

Once my image is orchestrated by Kubernetes cluster, the layer will be inflated to a standard location on the node where the image is run. A link will be created to the layer folder - only reason for the link is to make the path shorter as explained here: https://docs.docker.com/storage/storagedriver/overlayfs-driver/. So a node running an image built from ubuntu will have something similar to:

# ls -l /var/lib/docker/overlay2/l |grep 3ee0e4ab0518c76
lrwxrwxrwx    1 root     root            72 Dec 13 15:40 VGN2ARTYLKI6LQWXSZSMKUQOQL -> ../3ee0e4ab0518c76376a4023f7c438cc6a8d28121eba9cdbed9440cfc7474204e/diff

Note, that VGN2ARTYLKI6LQWXSZSMKUQOQL here is a/ node unique and b/ node specific. And this identifier will appear in mounts for containers. Root cgroup sees all the mounts on a node, and pid 1 would normally belong to the root cgroup. So the layer in question is shared like so:

# grep `ls -l /var/lib/docker/overlay2/l |grep 3ee0e4ab0518c76 |awk '{print$9}'` /proc/1/mounts

overlay /var/lib/docker/overlay2/84ec5295eb902ab01b37451f9063987f5803a0ff4bc53ee27c1838f783f61f48/merged overlay rw,relatime,lowerdir=
/var/lib/docker/overlay2/l/7RBRYLLCPECAY5IXIQWNNFMT4L:
/var/lib/docker/overlay2/l/LK4X5JGJE327XH6STN6DHMQZUI:
/var/lib/docker/overlay2/l/2RODCFKARIMWO2NUPHVP7HREVF:
/var/lib/docker/overlay2/l/DH43WT4W2DPJTMMKHJL46IPIXM:
/var/lib/docker/overlay2/l/DQBSRPR7QCKCXNT4QQHHC6L2TO:
/var/lib/docker/overlay2/l/N3NL6BAOEKFZYIAXCCFEHMRJC2:
/var/lib/docker/overlay2/l/VGN2ARTYLKI6LQWXSZSMKUQOQL,upperdir=/var/lib/docker/overlay2/84ec5295eb902ab01b37451f9063987f5803a0ff4bc53ee27c1838f783f61f48/diff,workdir=/var/lib/docker/overlay2/84ec5295eb902ab01b37451f9063987f5803a0ff4bc53ee27c1838f783f61f48/work 0 0

overlay /var/lib/docker/overlay2/89ce211716bd81100b99ecacc3c9da7af602029b2724d01db41d5efad37f43e6/merged overlay rw,relatime,lowerdir=
/var/lib/docker/overlay2/l/SQEWZDFCQQX6EKH7IZHSFXKLBN:
/var/lib/docker/overlay2/l/TJFM5IIGAQIKCMA5LDT6X4NUJK:
/var/lib/docker/overlay2/l/DQBSRPR7QCKCXNT4QQHHC6L2TO:
/var/lib/docker/overlay2/l/N3NL6BAOEKFZYIAXCCFEHMRJC2:
/var/lib/docker/overlay2/l/VGN2ARTYLKI6LQWXSZSMKUQOQL,upperdir=/var/lib/docker/overlay2/89ce211716bd81100b99ecacc3c9da7af602029b2724d01db41d5efad37f43e6/diff,workdir=/var/lib/docker/overlay2/89ce211716bd81100b99ecacc3c9da7af602029b2724d01db41d5efad37f43e6/work 0 0

Two overlay mounts mean that the layer is shared between 2 running containers built from this version of ubuntu image. Or more concise:

# grep `ls -l /var/lib/docker/overlay2/l |grep 3ee0e4ab0518c76 |awk ‘{print$9}’` /proc/1/mounts |wc -l
2

This confirms that the content is shared between unrelated containers and explains the difference in page cache utilization I see. Is it a security risk? In theory, an adversary could plant malicious code in ubuntu wrappers and devise a nonce yielding the same sha256. Is it a risk in practice? Probably not so much...

-- wick
Source: StackOverflow