We have mounted a smb share from a windows server on a kubernetes Centos cluster as a PV using this plugin https://github.com/juliohm1978/kubernetes-cifs-volumedriver, that is basically just a script to use mount.cifs. We have also tried both the microsoft plugin and https://github.com/fstab/cifs, but they do the same thing using mount.cifs.
The mount has been working fine for small projects, but once we started using it on write intensive activities we have run into a very strange pattern of missing just-written files.
we created a simple java program that writes a random number between 900-1000 of files in a foreach loop. the files all contain the same random bytes, created with this line
byte[] data = org.apache.commons.lang3.RandomUtils.nextBytes(1024 * 50);
it then tries to list the directory where the files have been written and the file count from the list is always missing files.
the missing files are always the same in the sequence, even if the name contains a random part.
We have written the files with a name that includes the foreach index, but also a random part so that each retry is checkable without interference from the previous.
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:24.331 start write file total=971
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:39.771 end write file total=971
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.041 count files=937
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.043 missing files=34
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044 missing file=smb-cache-test-0025-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044 missing file=smb-cache-test-0051-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044 missing file=smb-cache-test-0077-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044 missing file=smb-cache-test-0109-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044 missing file=smb-cache-test-0135-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044 missing file=smb-cache-test-0161-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[...]
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:16.113 start write file total=995
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:30.808 end write file total=995
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.065 count files=960
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066 missing files=35
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066 missing file=smb-cache-test-0025-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066 missing file=smb-cache-test-0051-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066 missing file=smb-cache-test-0077-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066 missing file=smb-cache-test-0109-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.067 missing file=smb-cache-test-0135-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.067 missing file=smb-cache-test-0161-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[...]
We have inspected the mounted path inside the pod and it has the same problem, while the same mount on the node accessed inside the flexvolume mount dirs is ok, so it's not just java, it's at pod OS level. This is an example of the file count on both mounts for the same run, the first is inside the pod, the latter is in the node hosting the pod
/fileserver # pwd
/fileserver
/fileserver # ls -l _smb-cache-test/ | wc -l
938
[root@k8s-node-03 wind3-speech-file-extractor-pre-pv]# pwd
/var/lib/kubelet/pods/b4dd4252-9492-11e8-8796-000c299d5d24/volumes/juliohm~cifs/wind3-speech-file-extractor-pre-pv
[root@k8s-node-03 wind3-speech-file-extractor-pre-pv]# ls -l _smb-cache-test/ | wc -l
972
this is either docker or kubernetes doing something VERY fishy in the "hidden" layers between the mount created by flexvolume on the node and the very same mount in the pod.
additional info:
we have clarified it's not a SMB cache issue:
we have also mounted the share manually on the node and added it to the pod as a "local" persistedVolume instead of flexvolume, and the issue was NOT there anymore.
Thanks
Turns out the problem was caused by using the Apline image as a base for java in a Centos node. Changing the image to a Centos based solved the issue.
We tried this way because we found a hint about same disappearings of files regarding Ubuntu and SMB (with no docker involved) and thought that Apline could be impacted by the same issue being somewhat related more to ubuntu than to RHEL/Centos
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1572132
BTW, the proposed solutions in the thread to use noserverino and 0777 access rights did not solve our issues.