I have Elasticsearch cluster of 3 nodes (StatefulSet
) running on EKS (Server Version: v1.13.7-eks-c57ff8
) using Persistent Volumes.
I performed EKS cluster upgrade from 1.12
to 1.13
, which was successful. But one of the elasticsearch cluster node failed to start and stuck in init
state :
NAME READY STATUS RESTARTS AGE
es-master-0 0/1 Init:0/3 0 15h
es-master-1 1/1 Running 0 44h
es-master-2 1/1 Running 0 44h
I tried to kill the pod es-master-0
but new pods again stuck in the same state.
When I check the pod deployment (kubectl describe pod es-master-0
), I noticed pods is not able to mount the volume :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m13s default-scheduler Successfully assigned kube-logging/es-master-0 to ip-10-2-18-16.us-west-2.compute.internal
Normal SuccessfulAttachVolume 2m10s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-f2e27430-af11-11e9-b10d-02a8eba067e2"
Warning FailedMount 10s kubelet, ip-10-2-18-16.us-west-2.compute.internal Unable to mount volumes for pod "es-master-0_kube-logging(bc27e29c-b539-11e9-9958-06eeabb0603e)": timeout expired waiting for volumes to attach or mount for pod "kube-logging"/"es-master-0". list of unmounted volumes=[data]. list of unattached volumes=[data default-token-bz6w9]
Output of kubectl get pv
:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-06cd5cfe-af12-11e9-b10d-02a8eba067e2 100Gi RWO Retain Bound kube-logging/data-es-master-1 aws-gp2 7d19h
pvc-178b5aba-af12-11e9-b10d-02a8eba067e2 100Gi RWO Retain Bound kube-logging/data-es-master-2 aws-gp2 7d19h
pvc-f2e27430-af11-11e9-b10d-02a8eba067e2 100Gi RWO Retain Bound kube-logging/data-es-master-0 aws-gp2 7d19h
Output of kubectl get pvc
:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-es-master-0 Bound pvc-f2e27430-af11-11e9-b10d-02a8eba067e2 100Gi RWO aws-gp2 7d19h
data-es-master-1 Bound pvc-06cd5cfe-af12-11e9-b10d-02a8eba067e2 100Gi RWO aws-gp2 7d19h
data-es-master-2 Bound pvc-178b5aba-af12-11e9-b10d-02a8eba067e2 100Gi RWO aws-gp2 7d19h
I also tried to rebooting the node on which this pod is schedule.
This is my manifest file :
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: es-master
namespace: kube-logging
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:7.2.0
resources:
limits:
cpu: 1000m
memory: 2.5G
requests:
cpu: 100m
ports:
- containerPort: 9200
name: rest
protocol: TCP
- containerPort: 9300
name: inter-node
protocol: TCP
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
env:
- name: cluster.name
value: prod-eks-logs
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: node.name
value: "$(NODE_NAME).elasticsearch"
- name: discovery.zen.ping.unicast.hosts
value: "es-master-0.elasticsearch,es-master-1.elasticsearch,es-master-2.elasticsearch"
- name: cluster.initial_master_nodes
value: "es-master-0.elasticsearch,es-master-1.elasticsearch,es-master-2.elasticsearch"
- name: discovery.zen.minimum_master_nodes
value: "2"
- name: ES_JAVA_OPTS
value: "-Xmx1g -Xmx1g"
initContainers:
- name: fix-permissions
image: busybox
command: ["sh", "-c", "chown -R 1000:1000 /usr/share/elasticsearch/data"]
securityContext:
privileged: true
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
- name: increase-vm-max-map
image: busybox
command: ["sysctl", "-w", "vm.max_map_count=262144"]
securityContext:
privileged: true
- name: increase-fd-ulimit
image: busybox
command: ["sh", "-c", "ulimit -n 65536"]
securityContext:
privileged: true
volumeClaimTemplates:
- metadata:
name: data
labels:
app: elasticsearch
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: aws-gp2
resources:
requests:
storage: 100Gi
Any help on how can I pass this elasticsearch state ?