I am trying to install hdfs on EKS cluster. I deployed a namenode and two datanodes. All are up successfully.
But a strange error is happening. When I check Namenode GUI or check dfsadmin client to get the datanodes list, it randomly shows the one datanode only i.e. sometime datanode-0, sometime datanode-1. It never displays both/all datanodes.
What can be the issue here? I am even using headless service for datanodes. Please help.
#clusterIP service of namenode
apiVersion: v1
kind: Service
metadata:
name: hdfs-name
namespace: pulse
labels:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
spec:
ports:
- port: 8020
protocol: TCP
name: nn-rpc
- port: 9870
protocol: TCP
name: nn-web
selector:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
type: ClusterIP
---
#namenode stateful deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-name
namespace: pulse
labels:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
spec:
serviceName: hdfs-name
replicas: 1 #TODO 2 namenodes (1 active, 1 standby)
selector:
matchLabels:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
template:
metadata:
labels:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
spec:
initContainers:
- name: delete-lost-found
image: busybox
command: ["sh", "-c", "rm -rf /hadoop/dfs/name/lost+found"]
volumeMounts:
- name: hdfs-name-pv-claim
mountPath: /hadoop/dfs/name
containers:
- name: hdfs-name
image: bde2020/hadoop-namenode
env:
- name: CLUSTER_NAME
value: hdfs-k8s
- name: HDFS_CONF_dfs_permissions_enabled
value: "false"
#- name: HDFS_CONF_dfs_replication #not needed
# value: "2"
ports:
- containerPort: 8020
name: nn-rpc
- containerPort: 9870
name: nn-web
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
volumeMounts:
- name: hdfs-name-pv-claim
mountPath: /hadoop/dfs/name
volumeClaimTemplates:
- metadata:
name: hdfs-name-pv-claim
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ebs
resources:
requests:
storage: 1Gi
---
#headless service of datanode
apiVersion: v1
kind: Service
metadata:
name: hdfs-data
namespace: pulse
labels:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
spec:
ports:
ports:
- port: 9866
protocol: TCP
name: dn-rpc
- port: 9864
protocol: TCP
name: dn-web
selector:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
clusterIP: None
type: ClusterIP
---
#datanode stateful deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-data
namespace: pulse
labels:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
spec:
serviceName: hdfs-data
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
template:
metadata:
labels:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
spec:
containers:
- name: hdfs-data
image: bde2020/hadoop-datanode
env:
- name: CORE_CONF_fs_defaultFS
value: hdfs://hdfs-name:8020
ports:
- containerPort: 9866
name: dn-rpc
- containerPort: 9864
name: dn-web
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
volumeMounts:
- name: hdfs-data-pv-claim
mountPath: /hadoop/dfs/data
volumeClaimTemplates:
- metadata:
name: hdfs-data-pv-claim
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ebs
resources:
requests:
storage: 1Gi
Running hdfs dfsadmin -report shows one datanode only randomly e.g. sometime datanode-0 and sometime datanode-1.
Datanodes host name is different datanode-0,datanode-1 but their name is same (127.0.0.1:9866(localhost)). Can this be the issue? If yes, how to solve i?
Also, I don't see any HDFS block replication happening, even rep factor is 3.
UPDATE
HI, It comes out to be the Istio porxy issue. I uninstalled Istio and it worked out. Istio proxy was setting name as 127.0.0.1 instead of actual IP.
It comes out to be the Istio porxy issue. I uninstalled Istio and it worked out. Istio proxy was setting name as 127.0.0.1 instead of actual IP.
I ran into this same issue and the workaround I'm currently using is to disable the envoy redirect for inbound traffic to the namenode on port 9000 (8020 for your case) by adding this annotation to the hadoop namenode:
traffic.sidecar.istio.io/excludeInboundPorts: "9000"
Reference: https://istio.io/v1.4/docs/reference/config/annotations/
After reading through some Istio issues it seems like the source IP is not being retained when being redirected through envoy.
Related issues:
https://github.com/istio/istio/issues/5679
https://github.com/istio/istio/pull/23275
I have not tried the TPROXY approach yet since I'm currently not running Istio 1.6 which includes the TPROXY source ip preservation fix.