Currently, I have deployed a Hadoop cluster in Kubernetes. There are three datanodes (statefulset) and a namenode for HDFS. I want to access data in HDFS externally. Thus, I created a service with nodePort type to export the namenode. When I tried to download the file inside HDFS, the namenode redirected me to the datanode. The problem is, the domain of redirect url was the domain in Kubernetes like hadoop-hdfs-dn-0.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075
, which was not able to be accessed externally.
The first thought of mine was to resolve the domain by client itself. Like
hadoop-hdfs-dn-0.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => IP0:50075
hadoop-hdfs-dn-1.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => IP1:50075
hadoop-hdfs-dn-2.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => IP2:50075
However, the nodePort applies to all nodes in Kubernetes cluster, so all three IPs above will go to the same service and may go to wrong datanode.
Is there any solution for this situation? Either from the aspect of Hadoop or Kubernetes. Like forcing the namenode to redirect like this?
hadoop-hdfs-dn-0.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => <node IP>:50001
hadoop-hdfs-dn-1.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => <node IP>:50002
hadoop-hdfs-dn-2.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => <node IP>:50003
So that I can create three services for each pod in the statefulset.
I would suggest you to try externalIP out.
suppose your datanode is listening at port 50000, you can create seperate service for every datanode and use the nodeip of the node it running on as the externalIP. something like this:
apiVersion: v1
kind: Service
metadata:
name: datanode-1
spec:
externalIPs:
- node1-ip
ports:
- name: datanode
port: 50000
selector:
app: datanode
id: "1"
---
apiVersion: v1
kind: Service
metadata:
name: datanode-2
spec:
externalIPs:
- node2-ip
ports:
- name: datanode
port: 50000
selector:
app: datanode
id: "2"
---
apiVersion: v1
kind: Service
metadata:
name: datanode-3
spec:
externalIPs:
- node3-ip
ports:
- name: datanode
port: 50000
selector:
app: datanode
id: "3"
Then you can resolve those pod domainnames to node ip that it running on.