Externally Access Hadoop HDFS deployed in Kubernetes

11/5/2018

Currently, I have deployed a Hadoop cluster in Kubernetes. There are three datanodes (statefulset) and a namenode for HDFS. I want to access data in HDFS externally. Thus, I created a service with nodePort type to export the namenode. When I tried to download the file inside HDFS, the namenode redirected me to the datanode. The problem is, the domain of redirect url was the domain in Kubernetes like hadoop-hdfs-dn-0.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075, which was not able to be accessed externally.

The first thought of mine was to resolve the domain by client itself. Like

hadoop-hdfs-dn-0.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => IP0:50075
hadoop-hdfs-dn-1.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => IP1:50075
hadoop-hdfs-dn-2.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => IP2:50075

However, the nodePort applies to all nodes in Kubernetes cluster, so all three IPs above will go to the same service and may go to wrong datanode.

Is there any solution for this situation? Either from the aspect of Hadoop or Kubernetes. Like forcing the namenode to redirect like this?

hadoop-hdfs-dn-0.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => <node IP>:50001
hadoop-hdfs-dn-1.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => <node IP>:50002
hadoop-hdfs-dn-2.hadoop-hdfs-dn.hadoop.svc.cluster.local:50075 => <node IP>:50003

So that I can create three services for each pod in the statefulset.

-- Ryan Yang
dns
hadoop
hdfs
kubernetes

1 Answer

11/5/2018

I would suggest you to try externalIP out.

suppose your datanode is listening at port 50000, you can create seperate service for every datanode and use the nodeip of the node it running on as the externalIP. something like this:

apiVersion: v1
kind: Service
metadata:
  name: datanode-1
spec:
  externalIPs:
  - node1-ip
  ports:
  - name: datanode
    port: 50000
  selector:
    app: datanode
    id: "1"
---
apiVersion: v1
kind: Service
metadata:
  name: datanode-2
spec:
  externalIPs:
  - node2-ip
  ports:
  - name: datanode
    port: 50000
  selector:
    app: datanode
    id: "2"
---
apiVersion: v1
kind: Service
metadata:
  name: datanode-3
spec:
  externalIPs:
  - node3-ip
  ports:
  - name: datanode
    port: 50000
  selector:
    app: datanode
    id: "3"

Then you can resolve those pod domainnames to node ip that it running on.

-- Kun Li
Source: StackOverflow