I had planned to go with Service Fabric (on premises) for my service and container orchestration. But, due to internal discussions, I am giving Kubernetes a look. Mostly because it is so very popular.
Service Fabric has concepts called Upgrade Domains and Failure Domains. A "domain" is a grouping of host nodes.
Upgrade Domains are used when pushing out an application service or container update. Service Fabric makes sure that the upgrading service/container is still available by only taking down one Upgrade Domain at at a time. (These are also used when updating the Service Fabric cluster software itself.)
Failure Domains work in a similar way. The idea is that the Failure Domains are created in alignment with hardware failure groups. Service Fabric makes sure that there are service/container instances running in each failure domain. (To allow for up time during a hardware failure.)
As I look at docs and listen to podcasts on Kubernetes I don't see any of these concepts. It seems it just hosts containers (Pods). I have heard a bit about "scheduling" and "tags". But it seems it is just the way to manually configure pods.
Are application upgrades and failure tolerance things that are done manually in Kubernetes? (via scheduling and/or tags perhaps)
Or is there a feature I am missing?
A "domain" is a grouping of host nodes.
Is not that simple, it would be more accurate if you said "A 'domain' is a logical grouping of resources".
To understand it correctly, you have to first understand most components in isolation. I recommend these readings first:
Then, we can take some points out of it:
Nodes are not Virtual Machines, Nodes runs on top of azure virtual machines.
They often have a 1:1 mapping, but in some cases you can have 5:1 node/VM mapping, an example is when you install a local development cluster.
Azure Virtual machines has Update Domains and Fault Domains, Service fabric nodes has Upgrade Domains and Fault Domains
As much they look the same, they have their differences:
Fault Domains:
.
Update\Upgrade Domains:
Based on that, you can see FD & UP as a matrix of reliable deployment slots, as much as you have, more the reliability will increase (with trade offs, like the update time required). Example below taken from SF docs:
Service Fabric, out of the box tries to place your services instances on different FD\UD on best effort, that means, if possible they will be on different FD\UD otherwise it will find another one with least number of instances of the service being deployed.
And about Kubernetes:
On Kubernetes, these features are not out of the box, k8s have the concept of zones, but according to the docs, they are limited by regions, they cannot span across regions.
Kubernetes will automatically spread the pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures). This is achieved via SelectorSpreadPriority.
This is a best-effort placement, and so if the zones in your cluster are heterogeneous (e.g. different numbers of nodes, different types of nodes, or different pod resource requirements), this might prevent equal spreading of your pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.
Is not the same as as FD but is a very similar concept.
To achieve a similar result as SF, will be required to deploy your cluster across zones or map the nodes to VM FD\UD, so that they behave as nodes on SF. Add the labels to the nodes to identify these domains. You also would need to create NodeType labels on the nodes over different FD, so that you can use for deploying your pods on delimited nodes.
For example:
When you deploy you application, you should make use of the affinity feature to assign PODs to a node, and in this case your service would have:
With these settings, using of the affinity and anti-affinity k8s will try to place replicas\instances of your container on separate nodes, and the nodes will be already separate by FD\zone delimited by the NoteType labels, then k8s will handle the rolling updates as SF does.
Because the anti-affinity rules are prefered, k8s will try to balance across these nodes on best effort, if no valid nodes are available it will start adding more instances on node that already contains instances of same container,
Conclusion
It is a bit of extra work, but not much different to what is currently used on other solutions. The major concern here will be configuring the Nodes on FD\Zones, after you place your nodes on the right FD, the rest will work smoothly.
On SF you don't have to worry about this when you deploy a cluster on Azure, but if you do it from scratch, is a big work to do, even bigger than k8s.
NOTE: If you use AKS, it will distribute the nodes across availability sets (set which specifies VM fault domains and update domains). Currently, according to this post, AKS does not provide the zone distribution for you, so you would have to do it from scratch if you need it this level of distribution.
Those abstractions don't currently exist in kubernetes, though the desired behaviors can often be achieved in an automated fashion.
The meta-model for kubernetes involves agents (called Controllers and Operators) continuously watching events and configuration on the cluster and gradually reconciling cluster state with the Controller's declarative configuration. The sudden loss of a Node hosting Pods will result in the IPs corresponding to the lost Pods being removed from Services and ReplicationControllers for those Pods spinning up new versions on other Nodes, ensuring otherwise that co- and anti-scheduling constraints are met.
Similarly, application upgrades usually occur through changes to a Deployment, which results in new Pods being scheduled and old Pods being unscheduled in an automated, gradual manner.
Custom declarative configurations are now possible with CustomResourceDefinitions, so this model is extensible. The underlying primitives and machinery are there for someone to introduce top level declarative abstractions like FailureDomains and UpgradeDomains, managed by custom Operators.
The kube ecosystem is so enormous and moving so quickly that something like this will likely emerge, and will also likely be met by competitor concepts.
Bottom line for a plant owner considering adoption is that Kubernetes is really still a toolsmith's world. There are an enormous number of tools, and a similarly enormous amount of unfinished product.
Process Health Checking The simplest form of health-checking is just process level health checking. The Kubelet constantly asks the Docker daemon if the container process is still running, and if not, the container process is restarted. In all of the Kubernetes examples you have run so far, this health checking was actually already enabled. It’s on for every single container that runs in Kubernetes
Kubernetes supports user implemented application health-checks. These checks are performed by the Kubelet to ensure that your application is operating correctly for a definition of “correctly” that you provide.
Currently, there are three types of application health checks that you can choose from:
TCP Socket - The Kubelet will attempt to open a socket to your container. If it can establish a connection, the container is considered healthy, if it can’t it is considered a failure. In all cases, if the Kubelet discovers a failure the container is restarted.
If the Status of the Ready condition [of a node] is “Unknown” or “False” for longer than the pod-eviction-timeout, an argument is passed to the kube-controller-manager and all of the Pods on the node are scheduled for deletion by the Node Controller. The default eviction timeout duration is five minutes. In some cases when the node is unreachable, the apiserver is unable to communicate with the kubelet on it. The decision to delete the pods cannot be communicated to the kubelet until it re-establishes communication with the apiserver. In the meantime, the pods which are scheduled for deletion may continue to run on the partitioned node
Users expect applications to be available all the time and developers are expected to deploy new versions of them several times a day. In Kubernetes this is done with rolling updates. Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones. The new Pods will be scheduled on Nodes with available resources.