I'm looking at deploying Kubernetes on top of a CoreOS cluster, but I think I've run into a deal breaker of sorts.
If I'm using just CoreOS and fleet, I can specify within the unit files that I want certain services to not run on the same physical machine as other services (anti-affinity). This is sort of essential for high availability. But it doesn't look like kubernetes has this functionality yet.
In my specific use-case, I'm going to need to run a few clusters of elasticsearch machines that need to always be available. If, for any reason, kubernetes decides to schedule all of my elasticsearch node containers for a given ES cluster on a single machine, (or even the majority on a single machine), and that machine dies, then my elasticsearch cluster will die with it. That can't be allowed to happen.
It seems like there could be work-arounds. I could set up the resource requirements and machine specs such that only one elasticsearch instance could fit on each machine. Or I could probably use labels in some way to specify that certain elasticsearch containers should go on certain machines. I could also just provision way more machines than necessary, and way more ES nodes than necessary, and assume kubernetes will spread them out enough to be reasonably certain of high availability.
But all of that seems awkward. It's much more elegant from a resource-management standpoint to just specify required hardware and anti-affinity, and let the scheduler optimize from there.
So does Kubernetes support anti-affinity in some way I couldn't find? Or does anyone know if it will any time soon?
Or should I be thinking about this another way? Do I have to write my own scheduler?
Anti-Affinity is when you don’t want certain pods to run on certain nodes. For the present scenario I think TAINTS AND TOLERATIONS can come handy. You can mark nodes with a taint, then ,only those pods which explicitly have tolerance for that taint are going to be allowed to run on that specific node.
Below I am describing how can anti-affinity concept be implemented:
Taint any node you want.
kubectl taint nodes gke-cluster1-default-pool-4db7fabf-726z env=dev:NoSchedule
here, env=dev is the key:value pair or rather the label for this node , NoSchedule is the effect which describes not to schedule any pod on this node unless with the specific toleration.
Create a deployment
kubectl run newdep1 --image=nginx
Add the same label as the label to the node to this deployment so that all pods of this deployment are hosted on this node and this node will not host any other pod without the matching label.
kubectl label deployments/newdep1 env=dev
kubectl scale deployments/newdep1 --replicas=5
kubectl get pods -o wide
running this you will see all pods of newdep1 will run on your tainted node.
Looks like there are a few ways that kubernetes decides how to spread containers, and these are in active development.
Firstly, of course there have to be the necessary resources on any machine for the scheduler to consider bringing up a pod there.
After that, kubernetes spreads pods by replication controller, attempting to keep the different instances created by a given replication controller on different nodes.
It seems like there was recently implemented a method of scheduling that considers services and various other parameters. https://github.com/GoogleCloudPlatform/kubernetes/pull/2906 Though I'm not completely clear on exactly how to use it. Perhaps in coordination with this scheduler config? https://github.com/GoogleCloudPlatform/kubernetes/pull/4674
Probably the most interesting issue to me is that none of these scheduling priorities are considered during scale-down, only scale-up. https://github.com/GoogleCloudPlatform/kubernetes/issues/4301 That's a bit of big deal, it seems like over time you could weird distributions of pods because they stay whereever they are originally placed.
Overall, I think the answer to my question at the moment is that this is an area of kubernetes that is in flux (as to be expected with pre-v1). However, it looks like much of what I need will be done automatically with sufficient nodes, and proper use of replication controllers and services.