Sharded Mongodb stalls randomly

8/5/2019

I have setup Sharded MongoDB cluster using hashed sharding in kuberenetes.I first created the config server Replicaset and then created 2 shard replicasets. Finally created mongos to connect to the sharded cluster.

I followed the below link to setup sharded MongoDB Click https://docs.mongodb.com/manual/tutorial/deploy-sharded-cluster-hashed-sharding/

After creation of mongos,I have enabled sharding for the database and have sharded the collection using the hashed sharding strategy.

After all this setup,I'm able to connect to mongos and have added some data to some of the collections in the database and able to check the distribution of data across different shards.

The issue that I'm facing is when trying to access mongodb from my java spring boot project,the connection stalls randomly.But once the connection is established for a particular query, that particular query won't stall for next few tries.After some idle time if I try to make request again to mongodb,it will again start to stall.

Note : MongoDB is hosted in "DS2 v2" VM and this cluster has 4 nodes.1 for config server,2 for shards and 1 for mongos

  • In one of the link,they had asked to set proper shard key to all the collections and this will have an impact on the performance of the mongodb.There were couple of things to consider before selecting the right shard key,I had considered all those factors before selecting shard key.I read through this link to select shard key - Click https://www.mongodb.com/blog/post/on-selecting-a-shard-key-for-mongodb

  • One of the other solution that I came across was that to set the ShardingTaskExecutorPoolMaxConnecting and to limit the rate at which mongos nodes add connectons to connection pools.I tried setting it to 20,5,100,150 and none of this resolved the stalling issue that I'm facing. This is the link - Click https://jira.mongodb.org/browse/SERVER-29237

  • I tried tweaking other parameters like ShardingTaskExecutorPoolMinSize and taskExecutorPoolSize.Even this did not resolve stalling issue.

  • I also set --serviceExecutor as adaptive.

  • Increased the wiredTigerCacheSizeGB from 0.25 to 2.This also dint make any difference to the stalling issue

1) YAML file of service and Deployment for config server of mongodb is -

apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      kompose.cmd: kompose convert -d -f docker-compose.yml -o azure-deployment.yaml
      kompose.version: 1.12.0 (0ab07be)
    creationTimestamp: null
    labels:
      io.kompose.service: mongo-conf-service
    name: mongo-conf-service
  spec:
    type: LoadBalancer
    ports:
    - name: "27017"
      port: 27017
      targetPort: 27017
    selector:
      io.kompose.service: mongo-conf-service
  status:
    loadBalancer: {}
- apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    annotations:
      kompose.cmd: kompose convert -d -f docker-compose.yml -o azure-deployment.yaml
      kompose.version: 1.12.0 (0ab07be)
    creationTimestamp: null
    labels:
      io.kompose.service: mongo-conf-service
    name: mongo-conf-service
  spec:
    replicas: 1
    strategy: {}
    template:
      metadata:
        creationTimestamp: null
        labels:
          io.kompose.service: mongo-conf-service
      spec:
        containers:
        - env:
          - name: MONGO_INITDB_ROOT_USERNAME
            value: #Username
          - name: MONGO_INITDB_ROOT_PASSWORD
            value: #Password
          command:
          - "mongod"
          - "--storageEngine"
          - "wiredTiger"
          - "--port"
          - "27017"
          - "--bind_ip"
          - "0.0.0.0"
          - "--wiredTigerCacheSizeGB"
          - "2"
          - "--configsvr"
          - "--replSet"
          - "ConfigDBRepSet"
          image: #MongoImageName
          name: mongo-conf-service
          ports:
          - containerPort: 27017
          resources: {}
          volumeMounts:
          - name: mongo-conf
            mountPath: /data/db
        restartPolicy: Always
        volumes:
          - name: mongo-conf
            persistentVolumeClaim:
              claimName: mongo-conf

2) YAML file of service and Deployment for Shard mongodb is -

apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      kompose.cmd: kompose convert -d -f docker-compose.yml -o azure-deployment.yaml
      kompose.version: 1.12.0 (0ab07be)
    creationTimestamp: null
    labels:
      io.kompose.service: mongo-shard
    name: mongo-shard
  spec:
    type: LoadBalancer
    ports:
    - name: "27017"
      port: 27017
      targetPort: 27017
    selector:
      io.kompose.service: mongo-shard
  status:
    loadBalancer: {}
- apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    annotations:
      kompose.cmd: kompose convert -d -f docker-compose.yml -o azure-deployment.yaml
      kompose.version: 1.12.0 (0ab07be)
    creationTimestamp: null
    labels:
      io.kompose.service: mongo-shard
    name: mongo-shard
  spec:
    replicas: 1
    strategy: {}
    template:
      metadata:
        creationTimestamp: null
        labels:
          io.kompose.service: mongo-shard
      spec:
        containers:
        - env:
          - name: MONGO_INITDB_ROOT_USERNAME
            value: #Username
          - name: MONGO_INITDB_ROOT_PASSWORD
            value: #Password
          command:
          - "mongod"
          - "--storageEngine"
          - "wiredTiger"
          - "--port"
          - "27017"
          - "--bind_ip"
          - "0.0.0.0"
          - "--wiredTigerCacheSizeGB"
          - "2"
          - "--shardsvr"
          - "--replSet"
          - "Shard1RepSet"
          image: #MongoImage
          name: mongo-shard
          ports:
          - containerPort: 27017
          resources: {}

3) YAML File of mongos server:

apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      kompose.cmd: kompose convert -d -f docker-compose.yml -o azure-deployment.yaml
      kompose.version: 1.12.0 (0ab07be)
    creationTimestamp: null
    labels:
      io.kompose.service: mongos-service
    name: mongos-service
  spec:
    type: LoadBalancer
    ports:
    - name: "27017"
      port: 27017
      targetPort: 27017
    selector:
      io.kompose.service: mongos-service
  status:
    loadBalancer: {}
- apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    annotations:
      kompose.cmd: kompose convert -d -f docker-compose.yml -o azure-deployment.yaml
      kompose.version: 1.12.0 (0ab07be)
    creationTimestamp: null
    labels:
      io.kompose.service: mongos-service
    name: mongos-service
  spec:
    replicas: 1
    strategy: {}
    template:
      metadata:
        creationTimestamp: null
        labels:
          io.kompose.service: mongos-service
      spec:
        containers:
        - env:
          - name: MONGO_INITDB_ROOT_USERNAME
            value: #USername
          - name: MONGO_INITDB_ROOT_PASSWORD
            value: #Password
          command:
            - "numactl"
            - "--interleave=all"
            - "mongos"
            - "--port"
            - "27017"
            - "--bind_ip"
            - "0.0.0.0"
            - "--configdb"
            - "ConfigDBRepSet/mongo-conf-service:27017"
          image: #MongoImageName
          name: mongos-service
          ports:
          - containerPort: 27017
          resources: {}
  • The logs of mongos server is :
2019-08-05T05:27:52.942+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:5058 #308807 (79 connections now open)
2019-08-05T05:27:52.964+0000 I ACCESS   [conn308807] Successfully authenticated as principal Assist_Random_Workspace on Random_Workspace from client 10.0.0.0:5058
2019-08-05T05:27:54.267+0000 I NETWORK  [worker-3] end connection 10.0.0.0:52954 (78 connections now open)
2019-08-05T05:27:54.269+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:52988 #308808 (79 connections now open)
2019-08-05T05:27:54.275+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:7174 #308809 (80 connections now open)
2019-08-05T05:27:54.279+0000 I ACCESS   [conn308809] SASL SCRAM-SHA-1 authentication failed for Assist_Refactored_Code_DB on Refactored_Code_DB from client 10.0.0.:7174 ; UserNotFound: User "Assist_Refactored_Code_DB@Refactored_Code_DB" not found
2019-08-05T05:27:54.281+0000 I NETWORK  [worker-1] end connection 10.0.0.5:7174 (79 connections now open)
2019-08-05T05:27:54.342+0000 I NETWORK  [worker-1] end connection 10.0.0.6:57391 (78 connections now open)
2019-08-05T05:27:54.343+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:57527 #308810 (79 connections now open)
2019-08-05T05:27:55.080+0000 I NETWORK  [worker-3] end connection 10.0.0.0:56021 (78 connections now open)
2019-08-05T05:27:55.081+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:56057 #308811 (79 connections now open)
2019-08-05T05:27:56.054+0000 I NETWORK  [worker-1] end connection 10.0.0.0:59137 (78 connections now open)
2019-08-05T05:27:56.055+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:59184 #308812 (79 connections now open)
2019-08-05T05:27:59.268+0000 I NETWORK  [worker-1] end connection 10.0.0.5:52988 (78 connections now open)
2019-08-05T05:27:59.270+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:53047 #308813 (79 connections now open)
2019-08-05T05:27:59.343+0000 I NETWORK  [worker-3] end connection 10.0.0.6:57527 (78 connections now open)
2019-08-05T05:27:59.344+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:57672 #308814 (79 connections now open)
2019-08-05T05:28:00.080+0000 I NETWORK  [worker-3] end connection 10.0.1.1:56057 (78 connections now open)
2019-08-05T05:28:00.081+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:56116 #308815 (79 connections now open)
2019-08-05T05:28:01.054+0000 I NETWORK  [worker-3] end connection 10.0.0.0:59184 (78 connections now open)
2019-08-05T05:28:01.058+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:59225 #308816 (79 connections now open)
2019-08-05T05:28:01.763+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:7173 #308817 (80 connections now open)
2019-08-05T05:28:01.768+0000 I ACCESS   [conn308817] SASL SCRAM-SHA-1 authentication failed for Assist_Sharded_Database on Sharded_Database from client 10.0.0.0:7173 ; UserNotFound: User "Assist_Sharded_Database@Sharded_Database" not found
2019-08-05T05:28:01.770+0000 I NETWORK  [worker-3] end connection 10.0.0.0:7173 (79 connections now open)
2019-08-05T05:28:04.271+0000 I NETWORK  [worker-3] end connection 10.0.0.0:53047 (78 connections now open)
2019-08-05T05:28:04.272+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:53083 #308818 (79 connections now open)
2019-08-05T05:28:04.283+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:7105 #308819 (80 connections now open)
2019-08-05T05:28:04.287+0000 I ACCESS   [conn308819] SASL SCRAM-SHA-1 authentication failed for Assist_Refactored_Code_DB on Refactored_Code_DB from client 10.0.0.0:7105 ; UserNotFound: User "Assist_Refactored_Code_DB@Refactored_Code_DB" not found

In the above logs,there is an error in authentication to Assist_Refactored_Code_DB(This database is not created by me).Im not sure why this authentication is failing and in which mongo URI the username and password should be mentioned.And Im also not sure whether this is one of the reason for stalling or not. This is the only error logs that I could find in mongos.All other logs in config server and shard mongo doesnt have any errors.

Logs of Shard1Repset is :

019-08-06T10:48:08.926+0000 I NETWORK  [listener] connection accepted from 10.0.0.4:58010 #782186 (10 connections now open)
2019-08-06T10:48:11.585+0000 I NETWORK  [conn782183] end connection 10.0.0.0:64938 (9 connections now open)
2019-08-06T10:48:11.586+0000 I NETWORK  [listener] connection accepted from 10.0.0.7:64989 #782187 (10 connections now open)
2019-08-06T10:48:11.765+0000 I NETWORK  [conn782184] end connection 10.0.0.0:62126 (9 connections now open)
2019-08-06T10:48:11.766+0000 I NETWORK  [listener] connection accepted from 10.0.0.6:62302 #782188 (10 connections now open)
2019-08-06T10:48:13.763+0000 I NETWORK  [conn782185] end connection 10.0.0.0:52907 (9 connections now open)
2019-08-06T10:48:13.763+0000 I NETWORK  [listener] connection accepted from 10.0.0.1:52947 #782189 (10 connections now open)
2019-08-06T10:48:13.926+0000 I NETWORK  [conn782186] end connection 10.0.0.0:58010 (9 connections now open)
2019-08-06T10:48:13.927+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:58051 #782190 (10 connections now open)
2019-08-06T10:48:16.586+0000 I NETWORK  [conn782187] end connection 10.0.0.0:64989 (9 connections now open)
2019-08-06T10:48:16.587+0000 I NETWORK  [listener] connection accepted from 10.0.0.0:65054 #782191 (10 connections now open)
2019-08-06T10:48:16.766+0000 I NETWORK  [conn782188] end connection 10.0.0.6:62302 (9 connections now open)
2019-08-06T10:48:16.767+0000 I NETWORK  [listener] connection accepted from 10.0.0.6:62445 #782192 (10 connections now open)
2019-08-06T10:48:18.765+0000 I NETWORK  [conn782189] end connection 10.0.2.1:52947 (9 connections now open)
2019-08-06T10:48:18.765+0000 I NETWORK  [listener] connection accepted from 10.0.2.1:52989 #782193 (10 connections now open)
2019-08-06T10:48:18.927+0000 I NETWORK  [conn782190] end connection 10.0.0.4:58051 (9 connections now open)
2019-08-06T10:48:18.929+0000 I NETWORK  [listener] connection accepted from 10.0.0.4:58100 #782194 (10 connections now open)
2019-08-06T10:48:21.588+0000 I NETWORK  [conn782191] end connection 10.0.0.7:65054 (9 connections now open)
2019-08-06T10:48:21.589+0000 I NETWORK  [listener] connection accepted from 10.0.0.7:65105 #782195 (10 connections now open)
2019-08-06T10:48:21.767+0000 I NETWORK  [conn782192] end connection 10.0.0.6:62445 (9 connections now open)
2019-08-06T10:48:21.768+0000 I NETWORK  [listener] connection accepted from 10.0.0.6:62581 #782196 (10 connections now open)
2019-08-06T10:48:23.766+0000 I NETWORK  [conn782193] end connection 10.0.2.1:52989 (9 connections now open)
2019-08-06T10:48:23.766+0000 I NETWORK  [listener] connection accepted from 10.0.2.1:53030 #782197 (10 connections now open)
2019-08-06T10:48:23.928+0000 I NETWORK  [conn782194] end connection 10.0.0.4:58100 (9 connections now open)
2019-08-06T10:48:23.930+0000 I NETWORK  [listener] connection accepted from 10.0.0.4:58145 #782198 (10 connections now open)
2019-08-06T10:48:26.589+0000 I NETWORK  [conn782195] end connection 10.0.0.7:65105 (9 connections now open)
2019-08-06T10:48:26.590+0000 I NETWORK  [listener] connection accepted from 10.0.0.7:65148 #782199 (10 connections now open)
2019-08-06T10:48:26.768+0000 I NETWORK  [conn782196] end connection 10.0.0.6:62581 (9 connections now open)
2019-08-06T10:48:26.770+0000 I NETWORK  [listener] connection accepted from 10.0.0.6:62746 #782200 (10 connections now open)
2019-08-06T10:48:28.766+0000 I NETWORK  [conn782197] end connection 10.0.2.1:53030 (9 connections now open)
2019-08-06T10:48:28.767+0000 I NETWORK  [listener] connection accepted from 10.0.2.1:53081 #782201 (10 connections now open)
2019-08-06T10:48:28.930+0000 I NETWORK  [conn782198] end connection 10.0.0.4:58145 (9 connections now open)
2019-08-06T10:48:28.931+0000 I NETWORK  [listener] connection accepted from 10.0.0.4:58217 #782202 (10 connections now open)
2019-08-06T10:48:31.590+0000 I NETWORK  [conn782199] end connection 10.0.0.7:65148 (9 connections now open)

Logs of ConfigDBRepSet is:

2019-08-06T10:52:18.962+0000 I NETWORK  [conn781553] end connection 10.0.0.4:60257 (10 connections now open)
2019-08-06T10:52:18.963+0000 I NETWORK  [listener] connection accepted from 10.0.0.4:60306 #781557 (11 connections now open)
2019-08-06T10:52:21.296+0000 I NETWORK  [conn781554] end connection 10.0.0.7:50910 (10 connections now open)
2019-08-06T10:52:21.297+0000 I NETWORK  [listener] connection accepted from 10.0.0.7:50956 #781558 (11 connections now open)
2019-08-06T10:52:22.380+0000 I NETWORK  [conn781555] end connection 10.0.0.5:54999 (10 connections now open)
2019-08-06T10:52:22.381+0000 I NETWORK  [listener] connection accepted from 10.0.0.5:55043 #781559 (11 connections now open)
2019-08-06T10:52:22.554+0000 I NETWORK  [conn781556] end connection 10.0.3.1:57125 (10 connections now open)
2019-08-06T10:52:22.555+0000 I NETWORK  [listener] connection accepted from 10.0.3.1:57258 #781560 (11 connections now open)
2019-08-06T10:52:23.963+0000 I NETWORK  [conn781557] end connection 10.0.0.4:60306 (10 connections now open)
2019-08-06T10:52:23.964+0000 I NETWORK  [listener] connection accepted from 10.0.0.4:60341 #781561 (11 connections now open)
2019-08-06T10:52:26.298+0000 I NETWORK  [conn781558] end connection 10.0.0.7:50956 (10 connections now open)
2019-08-06T10:52:26.299+0000 I NETWORK  [listener] connection accepted from 10.0.0.7:50998 #781562 (11 connections now open)
2019-08-06T10:52:27.382+0000 I NETWORK  [conn781559] end connection 10.0.0.5:55043 (10 connections now open)
2019-08-06T10:52:27.383+0000 I NETWORK  [listener] connection accepted from 10.0.0.5:55086 #781563 (11 connections now open)
2019-08-06T10:52:27.555+0000 I NETWORK  [conn781560] end connection 10.0.3.1:57258 (10 connections now open)
2019-08-06T10:52:27.556+0000 I NETWORK  [listener] connection accepted from 10.0.3.1:57415 #781564 (11 connections now open)
2019-08-06T10:52:28.964+0000 I NETWORK  [conn781561] end connection 10.0.0.4:60341 (10 connections now open)
2019-08-06T10:52:28.965+0000 I NETWORK  [listener] connection accepted from 10.0.0.4:60406 #781565 (11 connections now open)
2019-08-06T10:52:31.299+0000 I NETWORK  [conn781562] end connection 10.0.0.7:50998 (10 connections now open)
2019-08-06T10:52:31.300+0000 I NETWORK  [listener] connection accepted from 10.0.0.7:51043 #781566 (11 connections now open)
2019-08-06T10:52:32.383+0000 I NETWORK  [conn781563] end connection 10.0.0.5:55086 (10 connections now open)
2019-08-06T10:52:32.384+0000 I NETWORK  [listener] connection accepted from 10.0.0.5:55136 #781567 (11 connections now open)
2019-08-06T10:52:32.556+0000 I NETWORK  [conn781564] end connection 10.0.3.1:57415 (10 connections now open)
2019-08-06T10:52:32.556+0000 I NETWORK  [listener] connection accepted from 10.0.3.1:57535 #781568 (11 connections now open)
2019-08-06T10:52:33.966+0000 I NETWORK  [conn781565] end connection 10.0.0.4:60406 (10 connections now open)
2019-08-06T10:52:33.967+0000 I NETWORK  [listener] connection accepted from 10.0.0.4:60461 #781569 (11 connections now open)

Output of sh.status() :

--- Sharding Status --- 
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("5d3a7c7d035b4525a7de5eaa")
  }
  shards:
        {  "_id" : "Shard1RepSet",  "host" : "Shard1RepSet/94.245.111.162:27017",  "state" : 1 }
        {  "_id" : "Shard2RepSet",  "host" : "Shard2RepSet/13.74.42.35:27017",  "state" : 1 }
  active mongoses:
        "4.0.10" : 1
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours: 
                2 : Success
  databases:
#Databases sharding Information

I expect the sharded mongodb to not stall at any point of time and work similar to standalone mongodb.

Can anyone guide me to resolve the stalling of sharded mongodb issue?

-- Prajwal M
azure
kubernetes
mongodb
sharding

1 Answer

8/5/2019

First of all, if you are using mongo images provided by dockerhub, env and command should be specified together, because command overwrites enterypoint which, in this case, is responsible for handling user and password creation, so it won't work. Check The command field corresponds to entrypoint in some container runtimes.

-- FL3SH
Source: StackOverflow