MongoDB¶

Intermediate40 minPrereqs: nosql-conceptsdatabases nosql mongodb

Learning outcomes

Model data as BSON documents and perform CRUD operations with mongosh
Build aggregation pipelines and create compound, text, and geospatial indexes
Configure replica sets for availability and understand sharding fundamentals

MongoDB is the most widely adopted document database. Instead of rows and columns, you work with flexible JSON-like documents that map naturally to objects in your application code. No rigid schema, no required joins, no impedance mismatch between your data layer and your programming language.

This guide covers the document model, CRUD operations, the aggregation pipeline, indexing, replica sets, sharding, and the mongosh shell.

The Document Model¶

MongoDB stores data as BSON (Binary JSON) documents - a binary-encoded superset of JSON that adds data types JSON lacks. A document is a set of key-value pairs, analogous to a row in a relational table but far more flexible.

A Simple Document¶

{
  "_id": ObjectId("65a1b2c3d4e5f6a7b8c9d0e1"),
  "name": "Jane Chen",
  "email": "jane@example.com",
  "age": 34,
  "roles": ["admin", "developer"],
  "address": {
    "street": "742 Evergreen Terrace",
    "city": "Portland",
    "state": "OR",
    "zip": "97201"
  },
  "created_at": ISODate("2024-01-15T09:30:00Z"),
  "active": true
}

Arrays, nested sub-documents, and typed fields beyond what plain JSON supports - all in a single document.

BSON Types¶

Type	Example	Notes
String	`"hello"`	UTF-8 encoded
Int32	`NumberInt(42)`	32-bit signed integer
Int64	`NumberLong(9007199254740993)`	64-bit signed integer
Double	`3.14159`	64-bit IEEE 754 floating point
Decimal128	`NumberDecimal("19.99")`	128-bit decimal - use for currency
Boolean	`true` / `false`
Date	`ISODate("2024-01-15T09:30:00Z")`	UTC datetime, millisecond precision
ObjectId	`ObjectId("65a1...")`	12-byte unique identifier
Array	`["a", "b", "c"]`	Ordered list of any BSON types
Embedded Document	`{ "key": "value" }`	Nested document
Binary	`BinData(0, "base64...")`	Arbitrary binary data
Null	`null`	Explicit null value

Decimal128 for money

Never use Double for financial data. Floating-point arithmetic produces rounding errors (0.1 + 0.2 = 0.30000000000000004). Decimal128 provides exact decimal representation - use it for prices, balances, and any value where precision matters.

The _id Field¶

Every document must have an _id field that acts as the primary key. If you don't provide one, MongoDB generates an ObjectId automatically - a 12-byte value containing a timestamp (4 bytes), random value (5 bytes), and incrementing counter (3 bytes). ObjectIds are roughly time-ordered and globally unique without coordination.

A single BSON document cannot exceed 16 MB. If your data model approaches this limit, restructure it into separate documents or use GridFS for large files.

The mongosh Shell¶

mongosh is MongoDB's modern command-line shell - a JavaScript REPL with syntax highlighting, auto-completion, and full MongoDB API access.

Connecting¶

# Connect to localhost on default port 27017
mongosh

# Connect to a specific host and database
mongosh "mongodb://192.168.1.50:27017/myapp"

# Connect to a replica set
mongosh "mongodb://host1:27017,host2:27017,host3:27017/myapp?replicaSet=rs0"

# Connect with authentication
mongosh "mongodb://admin:password@localhost:27017/admin"

Essential Commands¶

show dbs              // List databases
use myapp             // Switch database (creates on first write)
show collections      // List collections in current database
db.users.stats()      // Collection statistics
db.users.help()       // Method help

Customizing with .mongoshrc.js¶

Create ~/.mongoshrc.js to add shell aliases and custom prompts:

// ~/.mongoshrc.js
const last = (coll, n = 5) => db[coll].find().sort({ _id: -1 }).limit(n);
const count = (coll, query = {}) => db[coll].countDocuments(query);

CRUD Operations¶

All operations target a single collection (the rough equivalent of a relational table).

Create: Inserting Documents¶

// Insert a single document
db.users.insertOne({ name: "Alice Rivera", email: "alice@example.com", age: 28, department: "Engineering" })

// Insert multiple documents
db.users.insertMany([
  { name: "Bob Park", email: "bob@example.com", age: 35, department: "Marketing" },
  { name: "Carol Okafor", email: "carol@example.com", age: 42, department: "Engineering" },
  { name: "Dave Singh", email: "dave@example.com", age: 31, department: "Sales" }
])

Both methods generate _id values automatically if omitted.

Read: Finding Documents¶

Pass a filter document to match records and an optional projection to control which fields come back.

db.users.find()                                             // all documents
db.users.findOne({ email: "alice@example.com" })            // single match
db.users.find({ department: "Engineering" })                // equality filter
db.users.find({ department: "Engineering" }, { name: 1, email: 1, _id: 0 })  // with projection

Query Operators¶

Query operators prefixed with $ handle comparisons beyond simple equality:

// Comparison
db.users.find({ age: { $gt: 30 } })                        // greater than
db.users.find({ age: { $gte: 28, $lte: 42 } })             // range

// Membership
db.users.find({ department: { $in: ["Engineering", "Sales"] } })

// Logical
db.users.find({ $and: [{ age: { $gt: 25 } }, { department: "Engineering" }] })
db.users.find({ $or: [{ department: "Engineering" }, { department: "Sales" }] })

// Pattern matching and existence
db.users.find({ name: { $regex: /^A/i } })                 // names starting with A
db.users.find({ phone: { $exists: true } })                 // has a phone field

Implicit $and

When you specify multiple conditions in the same filter document, MongoDB treats them as an implicit $and. Writing { age: { $gt: 25 }, department: "Engineering" } is equivalent to using $and explicitly. You only need $and when you have multiple conditions on the same field.

Sorting, Limiting, and Skipping¶

db.users.find().sort({ age: -1, name: 1 })     // sort by age desc, name asc
db.users.find().limit(10)                       // limit results
db.users.find().skip(20).limit(10)              // pagination (page 3, 10 per page)

Update: Modifying Documents¶

Update operators modify specific fields without replacing the entire document:

// Set or change fields
db.users.updateOne({ email: "alice@example.com" }, { $set: { age: 29, title: "Senior Engineer" } })

// Remove a field entirely
db.users.updateOne({ email: "bob@example.com" }, { $unset: { phone: "" } })

// Increment a numeric field
db.orders.updateOne({ _id: orderId }, { $inc: { quantity: 1 } })

// Add to / remove from an array
db.users.updateOne({ email: "alice@example.com" }, { $push: { roles: "team-lead" } })
db.users.updateOne({ email: "alice@example.com" }, { $pull: { roles: "junior" } })

// Update multiple documents
db.users.updateMany({ department: "Engineering" }, { $set: { building: "HQ-3" } })

Don't forget the operator

If you pass a plain document as the second argument to updateOne without $set, MongoDB replaces the entire document (except _id) with that object. This is almost never what you want. Always use update operators like $set, $inc, $push, etc.

Delete: Removing Documents¶

db.users.deleteOne({ email: "dave@example.com" })    // Delete one match
db.sessions.deleteMany({ expired: true })             // Delete all matches
db.temp_data.deleteMany({})                           // Delete everything (careful!)

The Aggregation Pipeline¶

The aggregation pipeline processes documents through a sequence of stages, each transforming data before passing it to the next - like a Unix pipeline for your database.

db.collection.aggregate([
  { $stage1: { ... } },
  { $stage2: { ... } },
  { $stage3: { ... } }
])

Pipeline Stages¶

$match - Filter Documents¶

Works like find but as a pipeline stage. Place $match as early as possible to reduce documents flowing through later stages.

{ $match: { status: "active", age: { $gte: 18 } } }

$group - Aggregate Values¶

Groups documents by a key and applies accumulator operators ($sum, $avg, $min, $max, $first, $last, $push, $addToSet):

{ $group: { _id: "$department", avgAge: { $avg: "$age" }, total: { $sum: 1 } } }

$sort, $project, $limit, $skip¶

{ $sort: { total: -1 } }          // order results
{ $project: { name: 1, email: 1, fullName: { $concat: ["$first", " ", "$last"] } } }  // reshape
{ $limit: 10 }                     // cap output
{ $skip: 20 }                      // offset

$lookup - Join Collections¶

Performs a left outer join against another collection:

{ $lookup: { from: "orders", localField: "_id", foreignField: "customer_id", as: "customer_orders" } }

$unwind - Flatten Arrays¶

Deconstructs an array field, outputting one document per element:

{ $unwind: "$customer_orders" }

Practical Example: Revenue by Category¶

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$category", totalRevenue: { $sum: "$amount" }, orderCount: { $sum: 1 } } },
  { $sort: { totalRevenue: -1 } },
  { $limit: 5 },
  { $project: { category: "$_id", totalRevenue: { $round: ["$totalRevenue", 2] }, orderCount: 1, _id: 0 } }
])

Indexes¶

Without indexes, MongoDB performs a collection scan - reading every document. Indexes are B-tree structures that let MongoDB locate documents without examining the entire collection.

Single Field Indexes¶

db.users.createIndex({ email: 1 })          // ascending
db.users.createIndex({ created_at: -1 })    // descending (matters for sorted queries)

Compound Indexes¶

A compound index covers multiple fields. Field order matters - MongoDB uses the index for prefix queries (left to right) but not for queries that skip leading fields.

// Supports queries on: department alone, department+age, department+age+name
// Does NOT efficiently support queries on age alone or name alone
db.users.createIndex({ department: 1, age: -1, name: 1 })

Unique Indexes¶

db.users.createIndex({ email: 1 }, { unique: true })  // duplicate values throw an error

Text Indexes¶

Text indexes support full-text search across string content. A collection can have at most one text index, but it can cover multiple fields:

db.articles.createIndex({ title: "text", body: "text" })
db.articles.find(
  { $text: { $search: "mongodb aggregation" } },
  { score: { $meta: "textScore" } }
).sort({ score: { $meta: "textScore" } })

Geospatial Indexes¶

2dsphere indexes support queries on GeoJSON data - finding documents near a point, within a polygon, or intersecting a geometry:

// Create a 2dsphere index
db.restaurants.createIndex({ location: "2dsphere" })

// Find restaurants within 2km of a point
db.restaurants.find({
  location: {
    $near: {
      $geometry: { type: "Point", coordinates: [-122.6750, 45.5120] },
      $maxDistance: 2000  // meters
    }
  }
})

TTL Indexes¶

TTL (Time to Live) indexes automatically delete documents after a specified duration - ideal for sessions, logs, or temporary records:

db.sessions.createIndex({ createdAt: 1 }, { expireAfterSeconds: 86400 })  // expire after 24h

Query Planning with explain()¶

Use explain() to understand query execution. Key fields in the output:

Field	Meaning
`winningPlan.stage`	`IXSCAN` (index used) vs `COLLSCAN` (full scan)
`totalKeysExamined`	Number of index entries scanned
`totalDocsExamined`	Number of documents loaded
`nReturned`	Number of documents returned
`executionTimeMillis`	Total execution time

db.users.find({ department: "Engineering", age: { $gt: 30 } }).explain("executionStats")

The goal: get totalDocsExamined as close to nReturned as possible.

Replica Sets¶

A replica set is a group of MongoDB instances maintaining the same data for redundancy and high availability. Always run MongoDB as a replica set in production - it gives you automatic failover and read scaling.

Architecture¶

A replica set consists of:

Primary: Receives all write operations. There is exactly one primary at any time.
Secondary: Replicates data from the primary. Can serve read operations if configured. You typically have two or more secondaries.
Arbiter: Participates in elections but holds no data. Used when you need an odd number of voting members but don't want to store a third full copy of the data.

graph TB
    C[Application] --> P[Primary]
    P -->|Replication| S1[Secondary 1]
    P -->|Replication| S2[Secondary 2]
    P -.->|Heartbeat| A[Arbiter]
    S1 -.->|Heartbeat| P
    S2 -.->|Heartbeat| P
    A -.->|Heartbeat| P
    S1 -.->|Heartbeat| S2
    S1 -.->|Heartbeat| A
    S2 -.->|Heartbeat| A

Elections¶

If the primary becomes unreachable, remaining members hold an election. A majority of voting members must agree - a 3-member set tolerates 1 failure, a 5-member set tolerates 2. A 2-member set cannot elect a new primary (no majority), which is why you always need at least 3 members. Elections typically complete within 10-12 seconds.

Read Preferences¶

Read preference controls which members receive read operations:

Mode	Behavior	Use Case
`primary`	All reads go to the primary	Default. Guaranteed latest data
`primaryPreferred`	Reads from primary; falls back to secondary if primary unavailable	Availability over consistency
`secondary`	Reads go to secondaries only	Offload analytics queries from primary
`secondaryPreferred`	Reads from secondary; falls back to primary	Analytics with fallback
`nearest`	Reads from the member with lowest network latency	Geographically distributed deployments

// In a connection string
"mongodb://host1,host2,host3/myapp?readPreference=secondaryPreferred"

Stale reads from secondaries

Secondaries replicate asynchronously, so they may lag behind the primary. Reading from a secondary can return data that's a few seconds (or, under heavy load, longer) behind. If your application requires reading the data it just wrote, use primary read preference for those queries.

Write Concern¶

Write concern controls acknowledgment requirements before MongoDB confirms a write:

Write Concern	Behavior
`w: 1`	Acknowledged by the primary only (default)
`w: "majority"`	Acknowledged by a majority of voting members
`w: 0`	No acknowledgment - fire and forget

db.orders.insertOne({ item: "widget", qty: 5 }, { writeConcern: { w: "majority", wtimeout: 5000 } })

Use w: "majority" for critical data. Use w: 1 for high-throughput, loss-tolerant workloads.

Sharding¶

Sharding distributes data across multiple machines. When a single replica set can't handle your data volume or throughput, you split data across shards - each shard being its own replica set. Config servers store metadata about data placement, and mongos routers route queries to the correct shard(s). Your application connects to mongos, not directly to shards.

Shard Key Selection¶

The shard key determines how documents are distributed. A good shard key has high cardinality (many distinct values), even distribution (no hotspots), and supports query targeting (queries including the shard key go to one shard instead of all).

Hashed vs Ranged Sharding¶

Strategy	How it works	Pros	Cons
Ranged	Documents with nearby key values go to the same shard	Efficient range queries	Hotspots if writes cluster at one end
Hashed	A hash of the key determines the shard	Even write distribution	Range queries hit all shards

sh.enableSharding("myapp")
sh.shardCollection("myapp.orders", { customer_id: 1 })   // ranged
sh.shardCollection("myapp.events", { _id: "hashed" })    // hashed

Chunks and Balancing¶

MongoDB divides the shard key range into chunks (default 128 MB). The balancer migrates chunks between shards in the background to keep distribution even.