kubernetes cifs smb flexvolume is missing files

7/31/2018

We have mounted a smb share from a windows server on a kubernetes Centos cluster as a PV using this plugin https://github.com/juliohm1978/kubernetes-cifs-volumedriver, that is basically just a script to use mount.cifs. We have also tried both the microsoft plugin and https://github.com/fstab/cifs, but they do the same thing using mount.cifs.

The mount has been working fine for small projects, but once we started using it on write intensive activities we have run into a very strange pattern of missing just-written files.

we created a simple java program that writes a random number between 900-1000 of files in a foreach loop. the files all contain the same random bytes, created with this line

byte[] data = org.apache.commons.lang3.RandomUtils.nextBytes(1024 * 50);

it then tries to list the directory where the files have been written and the file count from the list is always missing files.

the missing files are always the same in the sequence, even if the name contains a random part.

We have written the files with a name that includes the foreach index, but also a random part so that each retry is checkable without interference from the previous.

[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:24.331  start write file total=971
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:39.771  end write file total=971
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.041  count files=937
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.043  missing files=34
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044  missing file=smb-cache-test-0025-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044  missing file=smb-cache-test-0051-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044  missing file=smb-cache-test-0077-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044  missing file=smb-cache-test-0109-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044  missing file=smb-cache-test-0135-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:27:40.044  missing file=smb-cache-test-0161-yUaWG4aYTIrqFPBE93WZXzgnmBBy5Wl4.tmp
[...]

[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:16.113  start write file total=995
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:30.808  end write file total=995
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.065  count files=960
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066  missing files=35
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066  missing file=smb-cache-test-0025-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066  missing file=smb-cache-test-0051-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066  missing file=smb-cache-test-0077-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.066  missing file=smb-cache-test-0109-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.067  missing file=smb-cache-test-0135-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[smb-cache-test-6db94b688-jw8cx] 2018-07-31 07:30:31.067  missing file=smb-cache-test-0161-hjvVQG6JdnC0KBI5xfsBldZkCHWZQ0Fr.tmp
[...]

We have inspected the mounted path inside the pod and it has the same problem, while the same mount on the node accessed inside the flexvolume mount dirs is ok, so it's not just java, it's at pod OS level. This is an example of the file count on both mounts for the same run, the first is inside the pod, the latter is in the node hosting the pod

/fileserver # pwd
/fileserver
/fileserver # ls -l _smb-cache-test/ | wc -l
938

[root@k8s-node-03 wind3-speech-file-extractor-pre-pv]# pwd
/var/lib/kubelet/pods/b4dd4252-9492-11e8-8796-000c299d5d24/volumes/juliohm~cifs/wind3-speech-file-extractor-pre-pv
[root@k8s-node-03 wind3-speech-file-extractor-pre-pv]# ls -l _smb-cache-test/ | wc -l
972

this is either docker or kubernetes doing something VERY fishy in the "hidden" layers between the mount created by flexvolume on the node and the very same mount in the pod.

additional info:

we have clarified it's not a SMB cache issue:

  • we have tried with smb v1.0, v2.0, v2.1 and v3.0.
  • we have also tried disabling the cache by cifs option cache=none.
  • we have disabled the cache server-side on windows with a registry hack to set the cache timeout to 0.
  • we have used the undocumented $NOCSC$ suffix on the share host name. Issue still there.

we have also mounted the share manually on the node and added it to the pod as a "local" persistedVolume instead of flexvolume, and the issue was NOT there anymore.

Thanks

-- francesco paolo schiavone
cifs
docker
kubernetes
mount
smb

1 Answer

7/31/2018

Turns out the problem was caused by using the Apline image as a base for java in a Centos node. Changing the image to a Centos based solved the issue.

We tried this way because we found a hint about same disappearings of files regarding Ubuntu and SMB (with no docker involved) and thought that Apline could be impacted by the same issue being somewhat related more to ubuntu than to RHEL/Centos

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1572132

BTW, the proposed solutions in the thread to use noserverino and 0777 access rights did not solve our issues.

-- francesco paolo schiavone
Source: StackOverflow