HDFS namenode not showing datanodes list correctly on Kubernetes

7/5/2020

I am trying to install hdfs on EKS cluster. I deployed a namenode and two datanodes. All are up successfully.

But a strange error is happening. When I check Namenode GUI or check dfsadmin client to get the datanodes list, it randomly shows the one datanode only i.e. sometime datanode-0, sometime datanode-1. It never displays both/all datanodes.

What can be the issue here? I am even using headless service for datanodes. Please help.

#clusterIP service of namenode
apiVersion: v1
kind: Service
metadata:
  name: hdfs-name
  namespace: pulse
  labels:
    app.kubernetes.io/name: hdfs-name
    app.kubernetes.io/version: "1.0"
spec:
  ports:
    - port: 8020
      protocol: TCP
      name: nn-rpc
    - port: 9870
      protocol: TCP
      name: nn-web
  selector:
    app.kubernetes.io/name: hdfs-name
    app.kubernetes.io/version: "1.0"
  type: ClusterIP
---
#namenode stateful deployment 
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: hdfs-name
  namespace: pulse
  labels:
    app.kubernetes.io/name: hdfs-name
    app.kubernetes.io/version: "1.0"
spec:
  serviceName: hdfs-name
  replicas: 1       #TODO 2 namenodes (1 active, 1 standby)
  selector:
    matchLabels:
      app.kubernetes.io/name: hdfs-name
      app.kubernetes.io/version: "1.0"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: hdfs-name
        app.kubernetes.io/version: "1.0"
    spec:
      initContainers:
      - name: delete-lost-found
        image: busybox
        command: ["sh", "-c", "rm -rf /hadoop/dfs/name/lost+found"]
        volumeMounts:
        - name: hdfs-name-pv-claim
          mountPath: /hadoop/dfs/name
      containers:
      - name: hdfs-name
        image: bde2020/hadoop-namenode
        env:
        - name: CLUSTER_NAME
          value: hdfs-k8s
        - name: HDFS_CONF_dfs_permissions_enabled
          value: "false"
        #- name: HDFS_CONF_dfs_replication              #not needed
        #  value: "2"  
        ports:
        - containerPort: 8020
          name: nn-rpc
        - containerPort: 9870
          name: nn-web
        resources:
          limits:
            cpu: "500m"
            memory: 1Gi
          requests:
            cpu: "500m"
            memory: 1Gi
        volumeMounts:
        - name: hdfs-name-pv-claim
          mountPath: /hadoop/dfs/name
  volumeClaimTemplates:
  - metadata:
      name: hdfs-name-pv-claim
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: ebs
      resources:
        requests:
          storage: 1Gi
---
#headless service of datanode
apiVersion: v1
kind: Service
metadata:
  name: hdfs-data
  namespace: pulse
  labels:
    app.kubernetes.io/name: hdfs-data
    app.kubernetes.io/version: "1.0"
spec:
  ports:
    ports:
    - port: 9866
      protocol: TCP
      name: dn-rpc
    - port: 9864
      protocol: TCP
      name: dn-web
  selector:
    app.kubernetes.io/name: hdfs-data
    app.kubernetes.io/version: "1.0"
  clusterIP: None
  type: ClusterIP
---
#datanode stateful deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: hdfs-data
  namespace: pulse
  labels:
    app.kubernetes.io/name: hdfs-data
    app.kubernetes.io/version: "1.0"
spec:
  serviceName: hdfs-data
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: hdfs-data
      app.kubernetes.io/version: "1.0"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: hdfs-data
        app.kubernetes.io/version: "1.0"
    spec:
      containers:
      - name: hdfs-data
        image: bde2020/hadoop-datanode
        env:
        - name: CORE_CONF_fs_defaultFS
          value: hdfs://hdfs-name:8020
        ports:           
        - containerPort: 9866
          name: dn-rpc
        - containerPort: 9864
          name: dn-web
        resources:
          limits:
            cpu: "500m"
            memory: 1Gi
          requests:
            cpu: "500m"
            memory: 1Gi
        volumeMounts:
        - name: hdfs-data-pv-claim
          mountPath: /hadoop/dfs/data 
  volumeClaimTemplates:
  - metadata:
      name: hdfs-data-pv-claim
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: ebs
      resources:
        requests:
          storage: 1Gi

     

Running hdfs dfsadmin -report shows one datanode only randomly e.g. sometime datanode-0 and sometime datanode-1.
Datanodes host name is different datanode-0,datanode-1 but their name is same (127.0.0.1:9866(localhost)). Can this be the issue? If yes, how to solve i?

Also, I don't see any HDFS block replication happening, even rep factor is 3.

UPDATE
HI, It comes out to be the Istio porxy issue. I uninstalled Istio and it worked out. Istio proxy was setting name as 127.0.0.1 instead of actual IP.

-- NumeroUno
amazon-eks
hdfs
kubernetes

2 Answers

8/27/2020

It comes out to be the Istio porxy issue. I uninstalled Istio and it worked out. Istio proxy was setting name as 127.0.0.1 instead of actual IP.

-- NumeroUno
Source: StackOverflow

8/28/2020

I ran into this same issue and the workaround I'm currently using is to disable the envoy redirect for inbound traffic to the namenode on port 9000 (8020 for your case) by adding this annotation to the hadoop namenode:

traffic.sidecar.istio.io/excludeInboundPorts: "9000"

Reference: https://istio.io/v1.4/docs/reference/config/annotations/

After reading through some Istio issues it seems like the source IP is not being retained when being redirected through envoy.

Related issues:
https://github.com/istio/istio/issues/5679
https://github.com/istio/istio/pull/23275

I have not tried the TPROXY approach yet since I'm currently not running Istio 1.6 which includes the TPROXY source ip preservation fix.

-- awells
Source: StackOverflow