How to prevent data inconsistency when one node lose network connectivity in kubernetes

1/5/2022

I have a situation where I have a cluster with a service (we named it A1) and its data which is on a remote storage like cephFS in my case. the number of replica for my service is 1. Assume I have 5 node in my cluster and service A1 reside in node 1. something happens with node 1 network and it lose the connectivity with cephFS cluster and my Kubernetes cluster as well (or docker-swarm). cluster mark it as unreachable and start a new service (we named it A2) on node 2 to keep replica as 1. after for example 15 min node 1 network fixed and node 1 get back to cluster and have service A1 running already (assume it didn't crash while it loses its connectivity with remote storage).

I worked with docker-swarm and recently switched to Kubernetes. I see Kuber has a feature call StatefulSet but when I read about it. it doesn't answer my question. (or I may miss something when I read about it)

Question A: what does cluster do. does it keep A2 and shutdown A1 or let A1 keeps working and shutdown A2 (Logically it should shutdown A1)

Question B (and my primary question as well!): Assume that the cluster wants to shutdown on of these services (for example A1). This service does some save on storage when it wants to shutdown. in this case state A1 save to disk and A2 with newer state saved something before A1 network get fixed. There must be some locks when we mount the volume to the container in which when it attached to one container other container cant write to that (let A1 failed when want to save its old state data on disk)

-- Mohammad Amin
ceph
docker-swarm
high-availability
kubernetes
remotestorage

1 Answer

1/6/2022

The way it works - using docker swarm terminology -

You have a service. A service is a description of some image you'd like to run, how many replicas and so on. Assuming the service specifies at least 1 replica should be running it will create a task that will schedule a container on a swarm node. So the service is associated with 0 to many tasks, where each task has 0 - if its still starting or 1 container - if the task is running or stopped - which is on a node.

So, when swarm (the orcestrator) detects a node go offline, it principally sees that a number of tasks associated with a service have lost their containers, and so the replication (in terms of running tasks) is no longer correct for the service, and it creates new tasks which in turn will schedule new containers on the available nodes.

On the disconnected node, the swarm worker notices that it has lost connection to the swarm managers so it cleans up all the tasks it is holding onto as it no longer has current information about them. In the process of cleaning the tasks up, the associated containers get stopped.

This is good because when the node finally reconnects there is no race condition where there are two tasks running. Only "A2" is running and "A1" has been shut down. This is bad if you have a situation where nodes can lose connectivity to the managers frequently, but you need the services to keep running on those nodes regardless, as they will be shut down each time the workers detach.

The process on K8s is pretty much the same just change the terminology.

-- Chris Becke
Source: StackOverflow