Data Locality in Spark on Kubernetes colocated with HDFS pods

11/24/2021

Revisiting the data locality for Spark on Kubernetes question: if the Spark pods are colocated on the same nodes as the HDFS data node pods then does data locality work ?

The Q&A session here: https://www.youtube.com/watch?v=5-4X3HylQQo seems to suggest it doesn't.

-- JHI Star
apache-spark
hadoop
hdfs
kubernetes

1 Answer

11/25/2021

Locality is an issue Spark on Kubernetes. Basic Data locality does work if the Kubernetes provider provides a network topology plugins that are required to resolve where the data is and where the spark nodes should be run. and you have built kubernetes to include the code here

There is a method to test this data locality. I have copied it here for completeness:

Here's how one can check if data locality in the namenode works.

Launch a HDFS client pod and go inside the pod.

$ kubectl run -i --tty hadoop --image=uhopper/hadoop:2.7.2
--generator="run-pod/v1" --command -- /bin/bash 

Inside the pod, create a simple text file on HDFS.

$ hadoop fs
-fs hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local
-cp file:/etc/hosts /hosts

Set the number of replicas for the file to the number of your cluster nodes. This ensures that there will be a copy of the file in the cluster node that your client pod is running on. Wait some time until this happens.

 `$ hadoop fs -setrep NUM-REPLICAS /hosts` 

Run the following hdfs cat command. From the debug messages, see which datanode is being used. Make sure it is your local datanode. (You can get this from $ kubectl get pods hadoop -o json | grep hostIP. Do this outside the pod)

$ hadoop --loglevel DEBUG fs
-fs hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local
-cat /hosts ... 17/04/24 20:51:28 DEBUG hdfs.DFSClient: Connecting to datanode 10.128.0.4:50010 ... 

If no, you should check if your local datanode is even in the list from the debug messsages above. If it is not, then this is because step (3) did not finish yet. Wait more. (You can use a smaller cluster for this test if that is possible)

`17/04/24 20:51:28 DEBUG hdfs.DFSClient: newInfo = LocatedBlocks{ fileLength=199 underConstruction=false blocks=[LocatedBlock{BP-347555225-10.128.0.2-1493066928989:blk_1073741825_1001; getBlockSize()=199; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.128.0.4:50010,DS-d2de9d29-6962-4435-a4b4-aadf4ea67e46,DISK], DatanodeInfoWithStorage[10.128.0.3:50010,DS-0728ffcf-f400-4919-86bf-af0f9af36685,DISK], DatanodeInfoWithStorage[10.128.0.2:50010,DS-3a881114-af08-47de-89cf-37dec051c5c2,DISK]]}] lastLocatedBlock=LocatedBlock{BP-347555225-10.128.0.2-1493066928989:blk_1073741825_1001;`

Repeat the hdfs cat command multiple times. Check if the same datanode is being consistently used.

-- Matt Andruff
Source: StackOverflow