Letting only one elasticsearch pod come up on a node in Kubernetes

12/26/2018

We have a multi-node setup of our product where we need to deploy multiple Elasticsearch pods. As all these are data nodes and have volume mounts for persistent storage, we don't want to bring two pods up on the same node. I'm trying to use the anti-affinity feature of Kubernetes, but to no avail.

The cluster deployment is done through Rancher. We have 5 nodes in the cluster, and three nodes (let's say node-1, node-2 and node-3) have the label test.service.es-master: "true". So, when I deploy the helm chart and scale it up-to 3, Elasticsearch pods are up and running on all these three nodes. but if I scale it to 4, the 4th data node comes in one of the above mentioned nodes. Is that a correct behavior? My understanding was, imposing a strict anti-affinity should prevent the pods from coming up on the same node. I've referred to multiple blogs and forums (e.g. this and this), and they suggest similar changes as mine. I'm attaching the relevant section of the helm chart.

The requirement is, we need to bring up ES on only those nodes which are labelled with specific key-value pair as mentioned above, and each of those nodes should only contain one pod. Any feedback is appreciated.

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    test.service.es-master: "true"
  name: {{ .Values.service.name }}
  namespace: default
spec:
  clusterIP: None
  ports:
  ...
  selector:
    test.service.es-master: "true"
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    test.service.es-master: "true"
  name: {{ .Values.service.name }}
  namespace: default
spec:
  selector:
    matchLabels:
      test.service.es-master: "true"
  serviceName: {{ .Values.service.name }}
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: test.service.es-master
            operator: In
            values:
            - "true"
        topologyKey: kubernetes.io/hostname
  replicas: {{ .Values.replicaCount }}
  template:
    metadata:
      creationTimestamp: null
      labels:
        test.service.es-master: "true"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: test.service.es-master
                operator: In
                values:
                  - "true"
              topologyKey: kubernetes.io/hostname
      securityContext:
             ...
      volumes:
        ...
      ...
status: {}

Update-1

As per the suggestions in the comments and answers, I've added the anti-affinity section in template.spec. But unfortunately the issue still remains. The updated yaml looks like as follows:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    test.service.es-master: "true"
  name: {{ .Values.service.name }}
  namespace: default
spec:
  clusterIP: None
  ports:
  - name: {{ .Values.service.httpport | quote }}
    port: {{ .Values.service.httpport }}
    targetPort: {{ .Values.service.httpport }}
  - name: {{ .Values.service.tcpport | quote }}
    port: {{ .Values.service.tcpport }}
    targetPort: {{ .Values.service.tcpport }}
  selector:
    test.service.es-master: "true"
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    test.service.es-master: "true"
  name: {{ .Values.service.name }}
  namespace: default
spec:
  selector:
    matchLabels:
      test.service.es-master: "true"
  serviceName: {{ .Values.service.name }}
  replicas: {{ .Values.replicaCount }}
  template:
    metadata:
      creationTimestamp: null
      labels:
        test.service.es-master: "true"
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
            matchExpressions:
            - key: test.service.es-master
              operator: In
              values:
              - "true"
            topologyKey: kubernetes.io/hostname
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: test.service.es-master
                operator: In
                values:
                  - "true"
              topologyKey: kubernetes.io/hostname
      securityContext:
             readOnlyRootFilesystem: false
      volumes:
       - name: elasticsearch-data-volume
         hostPath:
            path: /opt/ca/elasticsearch/data
      initContainers:
         - name: elasticsearch-data-volume
           image: busybox
           securityContext:
                  privileged: true
           command: ["sh", "-c", "chown -R 1010:1010 /var/data/elasticsearch/nodes"]
           volumeMounts:
              - name: elasticsearch-data-volume
                mountPath: /var/data/elasticsearch/nodes
      containers:
      - env:
        {{- range $key, $val := .Values.data }}
        - name: {{ $key }} 
          value: {{ $val | quote }}
        {{- end}}
        image: {{ .Values.image.registry }}/analytics/{{ .Values.image.repository }}:{{ .Values.image.tag }}
        name: {{ .Values.service.name }}
        ports:
        - containerPort: {{ .Values.service.httpport }}
        - containerPort: {{ .Values.service.tcpport }}
        volumeMounts:
              - name: elasticsearch-data-volume
                mountPath: /var/data/elasticsearch/nodes    
        resources:
          limits:
            memory: {{ .Values.resources.limits.memory }}
          requests:
            memory: {{ .Values.resources.requests.memory }}
        restartPolicy: Always
status: {}
-- Bitswazsky
elasticsearch
kubernetes
rancher

3 Answers

12/26/2018

As Egor suggested, you need podAntiAffinity:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"

Source: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#always-co-located-in-the-same-node

So, with your current label, it might look like this:

spec:
  affinity:
    nodeAffinity:
    # node affinity stuff here
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: "test.service.es-master"
            operator: In
            values:
            - "true"
        topologyKey: "kubernetes.io/hostname"

Ensure that you put this in the correct place in your yaml, or else it won't work.

-- John
Source: StackOverflow

1/14/2019

Firstly, both in your initial manifest and even in the updated manifest you are using topologyKey for nodeAffinity which will give you an error while trying to deploy those manifest using kubectl create or kubectl apply because there is no api key called topologyKey for nodeAffinity Ref doc

Secondly, you are using a key called test.service.es-master for your nodeAffinity are you sure your "node" has those labels? please confirm by this command kubectl get nodes --show-labels

Lastly, Augmenting to @Laszlo answer and your @bitswazsky comment on it to simplify it, you can use the below code:

Here I have used a node label (as key) called role to identify the node, you can add that to your existing clusters' node by executing this command kubectl label nodes <node-name> role=platform

selector:
    matchLabels:
      component: nginx
  template:
    metadata:
      labels:
        component: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: role
                operator: In
                values:
                - platform
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: component
                operator: In
                values:
                - nginx
            topologyKey: kubernetes.io/hostname
-- garlicFrancium
Source: StackOverflow

1/14/2019

This works for me with Kubernetes 1.11.5:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      test.service.es-master: "true"
  template:
    metadata:
      labels:
        test.service.es-master: "true"
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: test.service.es-master
                operator: In
                values:
                - "true"
            topologyKey: kubernetes.io/hostname
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: test.service.es-master
                operator: In
                values:
                  - "true"
      containers:
      - image: nginx:1.7.10
        name: nginx

I don't know why you chose the same key/value for the pod deployment selector label, as for the node selector. They are confusing as a minimum...

-- Laszlo Valko
Source: StackOverflow