Our group has recently set up a 3-node Kubernetes cluster, and we've been using Jobs to schedule batch processing tasks on it. We have a lot of work to do and not a particularly large cluster to do it on, so at any given time there are a bunch of "pending" pods waiting to run on the cluster.
These pods have different resource requests; some are much larger than others. For example, some pods need 4 GB RAM and some need 100 GB RAM.
The problem we are having is that our large pods are never actually being run as long as there are enough small pods available to keep the cluster busy. As soon as one 4 GB pod finishes, Kubernetes looks and sees that a 4 GB pod will fit while a 100 GB pod won't, and it schedules a new 4 GB pod. It doesn't seem to ever decide that a 100 GB pod has been waiting long enough and refrain from scheduling new pods on a particular node until enough have finished that the 100 GB pod will fit there. Perhaps it can't tell that our pods come from jobs and are expected to eventually finish, unlike, say, a web server.
How can Kubernetes be configured to ensure that small pods cannot starve big pods indefinitely? Is there some kind of third-party scheduler with this behavior that we need to add to our installation? Or is there some way to configure the default scheduler to avoid this behavior?