MongoDB doesn't scale properly when adding new shard with collection already filled

8/10/2019

My MongoDB sharded cluster ingestion performances don't scale up when adding a new shard.

I have a small cluster setup with 1 mongos + 1 config replica set (3 nodes) + N shards replica sets (3 nodes each).

Mongos is on a dedicated Kubernetes node, and each mongo process hosting shards has its dedicated k8s node, while the config mong processes run a bit here and there where they happens to be deployed.

The cluster is used mainly for GridFS file hosting, with a typical file being around 100Mb.

I am doing stress tests with 1, 2 and 3 shards to see if it scales properly, and it doesn't.

If I start a brand new cluster with 2 shards and run my test it ingest files at (approx) twice the speed I had with 1 shard, but if I start the cluster with 1 shard, then perform the test, then add 1 more shard (total 2 shards), then perform the test again, the speed of ingestion is approx the same as before with 1 shard.

Looking at where chunks go, when I start the cluster immediately with 2 shards the load is evenly balanced between shards. If I start with 1 shard and add a second after a some insertions, then the chunks tend to go all on the old shard and the balancer must bring them later to the second shard.

Quick facts:

  • chunksize 1024 MB

  • sharding key is GridFS file_id, hashed

-- Federico Bonelli
gridfs
kubernetes
mongodb
sharding

1 Answer

8/13/2019

This is due to how hashed sharding and balancing works.

In an empty collection (from Shard an Empty Collection):

The sharding operation creates empty chunks to cover the entire range of the shard key values and performs an initial chunk distribution. By default, the operation creates 2 chunks per shard and migrates across the cluster.

So if you execute sh.shardCollection() on a cluster with x numbers of shards, it will create 2 chunks per shard and distribute them across the shards, totalling 2x chunks across the cluster. Since the collection is empty, moving the chunks around take little time. Your ingestion will now be distributed evenly across the shards (assuming other things e.g. good cardinality of the hashed field).

Now if you add a new shard after the chunks were created, that shard starts empty and the balancer will start to send chunks to it using the Migration Thresholds. In a populated collection, this process may take a while to finish.

If while the balancer is still moving chunks around (which may not be empty now) you do another ingestion, the cluster is now doing two different jobs at the same time: 1) ingestion, and 2) balancing.

When you're doing this with 1 shard and add another shard, it's likely that the chunks you're ingesting into are still located in shard 1 and haven't moved to the new shard yet, so most data will go into that shard.

Thus you should wait until the cluster is balanced after adding the new shard before doing another ingestion. After it's balanced, the ingestion load should be more evenly distributed.

Note: since your shard key is file_id, I'm assuming that each file is approximately the same size (~100 MB). If some files are much larger than others, some chunks will be busier than others as well.

-- kevinadi
Source: StackOverflow