We all choose one or the other database to easily store, query and fetch the data. The point is, what type of database would best support the project at hand. The question goes deeper than “SQL vs. NoSQL” . For example, if I need a flexible schema with recursive graph queries, I go with MongoDB, which is a NoSQL database with JSON documents which represent field and value pairs.

{ "name": "Angelina", "place": "Bangalore", // field: value }

Since the table structure is not fixed, queries are themselves JSON, and thus easily composable.

What is Database Sharding?

Consider a scenario where I have standalone mongoDB running on one machine where I have 2 million user data items. Now, my business is reaching that break-point and will likely surpass 2.5 million users soon. So, I decided to break the database up into two:

The system capacity is now doubled!

What is MongoDB Sharding?

MongoDB sharding is used for deployment consists large sets of data where machines are connected with a few simple rules. This result in high throughput operations.

Components:

  1. Shards: It contains a subset of sharded data for a sharded cluster.
  2. Mongos (query router): Provides an interface between client applications and the sharded cluster.
  3. Config-servers: Config servers store metadata and configuration settings for the cluster.

Sharding is done based on the Shard Key. It is a field or a set of fields that you select from a document of a targeted collection.

Types of Sharding:

  1. Ranged Sharding: Dividing data into contiguous ranges determined by the shard key values.
  2. Hashed Sharding: Uses a hashed index to partition data across your shared cluster.

Demo on how to enable hashed sharding for your database with shard key as “_id”.

Deploy a Shard cluster

1. Config Server replica set

For production deployment, deploy a config server with at least three members. Below is mongoDB configuration for config servers.

sharding: clusterRole: configsvr replication: replSetName: <replica_set_name>

Deploying config server with 2 replica members ie. 1 primary and 2 secondary replica members.

config-server replica set name: "test-config-server" ... mongod --configsvr --replSet testing-config-server --dbpath /data --bind_ip localhost,<hostname(s)|ip address(es)>

2. Shard Replica set

Below is mongod setting for each shard cluster

sharding: clusterRole: shardsvr replication: replSetName: <replica_set_name>

Create three mongoDB shard clusters with primary, secondary and one arbiter members for each shard or you can add more replica members if needed. Give a specific name for each shard.

mongod --shardsvr --replSet testing-shard-1 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)> mongod --shardsvr --replSet testing-shard-2 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)> mongod --shardsvr --replSet testing-shard-3 --dbpath <path> --bind_ip localhost,<hostname(s)|ip address(es)>

3. Mongos for the shard cluster

While running mongos add the below setting in configuration file i.e config server replica set name and at least one member of the replica set.

sharding: configDB: <configReplSetName>/<cfg-server1-ip:<port>,<cfg-server2-ip:<port>,<cfg-server3-ip:<port>.

Running mongos:

mongos --configdb test-config-server/cfg1.example.net:27019,cfg2.example.net:27019,cfg3.example.net:27019 --bind_ip localhost,<hostname(s)|ip address(es)>

4. Connect to the Shard Cluster

mongo --host <hostname> --port <port>

4.1 Add Shards to the Cluster

sh.addShard("test-shard-1/<hostname(s)|ip address(es)") sh.addShard("test-shard-2/<hostname(s)|ip address(es)") sh.addShard("test-shard-3/<hostname(s)|ip address(es)") ... sh.addShard("<shard-names>/<hostname(s)|ip address(es)")

4.2 Create database

Creating users database

use users ... use <db name>

4.3 Create collection

Creating collection with name indianUsers

db.createCollection("indianUsers") ... db.createCollection("<collectionName>")

4.4 Check indexes for collection

By default the sharding index will be {_id: “1”} i.e Range based sharding.

db.indianUsers.getIndexes() ... db.<collectionName>.getIndexes()

Output:

[ { "key": { "_id": 1 }, "name": "_id_", "ns": "users.indianUsers", "v": 2 } ]

4.5 Create hashed index

We are going with hash based sharding i.e create an index with {_id: “hashed”} value.

db.indianUsers.ensureIndex({_id: "hashed"}) ... db.<collectionName>.ensureIndex({_id: "hashed"}) ... # Check indexes again db.indianUsers.getIndexes()

Output:

[ { "key": { "_id": 1 }, "name": "_id_", "ns": "users.indianUsers", "v": 2 }, { "key": { "_id": "hashed" }, "name": "_id_hashed", "ns": "users.indianUsers", "v": 2 } ]

4.6 Enable sharding for database

sh.enableSharding('users') ... sh.enableSharding('<db name>')

4.7 Enable hash based sharding for collection

sh.shardCollection("<db name>.<collectionName>",{_id: "hashed"}) sh.shardCollection("users.indianUsers",{_id: "hashed"})

4.8 Check collection distribution

db.indianUsers.getShardDistribution() ... db.<collectionName>.getShardDistribution()

Output:

Shard testing-shard-1-v2 at testing-shard-1-v2/<replica-server-ip:port> data : 0B docs : 0 chunks : 2 estimated data per chunk : 0B estimated docs per chunk : 0 Shard testing-shard-3-v2 at testing-shard-3-v2/<replica-server-ip:port> data : 0B docs : 0 chunks : 2 estimated data per chunk : 0B estimated docs per chunk : 0 Shard testing-shard-2-v2 at testing-shard-2-v2/<replica-server-ip:port> data : 0B docs : 0 chunks : 2 estimated data per chunk : 0B estimated docs per chunk : 0 Totals data : 0B docs : 0 chunks : 6 Shard testing-shard-1-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B Shard testing-shard-3-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B Shard testing-shard-2-v2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B

4.9 Insert Data into the collection

Using simple for loop to insert some data into the collection.

for (var i = 1; i <= 1000; i++) { db.indianUsers.insert( { 'first': 'helloWorld', 'rollno': i } ) }

Output:

WriteResult({ "nInserted" : 1 }) # This inserts 1000 documents into the collection.# Check collection count db.indianUsers.count() ... db.<collection name>.count()

Output: 1000

Check the shard distribution again. the data will be divided across 3 different shards. Mongos automatically rebalance the data across all shards.

db.indianUsers.getShardDistribution() ... db.<collectionName>.getShardDistribution()Shard testing-shard-2-v2 at testing-shard-2-v2/<replica-server-ip:port> data : 17KiB docs : 324 chunks : 2 estimated data per chunk : 8KiB estimated docs per chunk : 162 Shard testing-shard-1-v2 at testing-shard-1-v2/<replica-server-ip:port> data : 17KiB docs : 332 chunks : 2 estimated data per chunk : 8KiB estimated docs per chunk : 166 Shard testing-shard-3-v2 at testing-shard-3-v2/<replica-server-ip:port> data : 18KiB docs : 344 chunks : 2 estimated data per chunk : 9KiB estimated docs per chunk : 172 Totals data : 53KiB docs : 1000 chunks : 6 Shard testing-shard-2-v2 contains 32.4% data, 32.4% docs in cluster, avg obj size on shard : 55B Shard testing-shard-1-v2 contains 33.2% data, 33.2% docs in cluster, avg obj size on shard : 55B Shard testing-shard-3-v2 contains 34.39% data, 34.39% docs in cluster, avg obj size on shard : 55B

Conclusion

MongoDB makes it more manageable when it comes for sharding, scaling than other popular databases.

I hope this article was able to shed some light on what sharding is in MongoDB.

Originally published at https://www.coffeebeans.io.

CoffeeBeans helps small, medium and large businesses unlock the true potential of technology and AI to solve some of their most pressing challenges