NoSQL Concepts & Architecture¶

Intermediate30 minPrereqs: database-fundamentalsdatabases nosql

Learning outcomes

Categorize NoSQL databases into document, key-value, wide-column, and graph types
Apply CAP theorem and consistency models to database selection decisions
Design polyglot persistence architectures using the right database for each workload

Relational databases dominated for decades, and for good reason - they provide strong guarantees, a mature query language, and well-understood operational patterns. But the web changed what "normal" looks like for data. Social networks generating billions of events per day, e-commerce catalogs with wildly different product attributes, real-time gaming leaderboards, IoT sensor streams - these workloads pushed relational systems to their limits and created demand for alternatives.

NoSQL ("Not Only SQL") is not a single technology. It is a family of database systems that relax one or more relational constraints - fixed schemas, JOIN-heavy queries, single-node scaling, ACID transactions - in exchange for benefits like horizontal scalability, schema flexibility, or specialized data models. The term emerged around 2009, but the ideas go back further: key-value stores like BerkeleyDB existed in the 1990s.

The core motivations behind NoSQL adoption:

Horizontal scaling - distribute data across commodity servers instead of buying increasingly expensive hardware
Schema flexibility - store documents with varying structures without ALTER TABLE migrations
Developer productivity - work with data models that map directly to application objects
Specialized query patterns - graph traversals, time-series aggregations, and full-text search that relational systems handle poorly
High availability - stay operational even when individual nodes fail

NoSQL does not mean "no SQL" or "better than SQL"

NoSQL databases are not replacements for relational systems. They are alternatives optimized for specific access patterns. Most production applications use both relational and NoSQL databases together. Choosing the wrong database type for your workload causes more pain than the problem you were trying to solve.

NoSQL Database Categories¶

NoSQL databases fall into four major categories, each built around a different data model and optimized for different access patterns.

flowchart TD
    A[NoSQL Databases] --> B[Document Stores]
    A --> C[Key-Value Stores]
    A --> D[Wide-Column Stores]
    A --> E[Graph Databases]
    B --> B1[MongoDB]
    B --> B2[CouchDB]
    B --> B3[Amazon DocumentDB]
    C --> C1[Redis]
    C --> C2[Memcached]
    C --> C3[Amazon DynamoDB]
    D --> D1[Apache Cassandra]
    D --> D2[Apache HBase]
    D --> D3[ScyllaDB]
    E --> E1[Neo4j]
    E --> E2[Amazon Neptune]
    E --> E3[ArangoDB]

Document Stores¶

A document store organizes data as self-contained documents - typically JSON, BSON (binary JSON), or XML. Each document holds all the data for a single entity, including nested objects and arrays. Documents are grouped into collections (analogous to tables), but unlike rows in a relational table, documents in the same collection can have completely different fields.

The Document Model¶

Consider a product catalog. In a relational database, you might need a products table, a product_attributes table, a product_images table, and JOINs to reassemble them. In a document store, each product is a single document:

{
  "_id": "prod_8842",
  "name": "Mechanical Keyboard",
  "brand": "KeyCraft",
  "price": 149.99,
  "specs": {
    "switch_type": "Cherry MX Brown",
    "layout": "TKL",
    "backlight": "RGB",
    "connectivity": ["USB-C", "Bluetooth 5.0"]
  },
  "images": [
    {"url": "/img/kb-front.jpg", "alt": "Front view"},
    {"url": "/img/kb-side.jpg", "alt": "Side profile"}
  ],
  "reviews_count": 342,
  "avg_rating": 4.6
}

This document embeds everything the application needs in a single read. No JOINs, no multi-table lookups. The specs object has fields specific to keyboards - a document for headphones in the same collection would have entirely different spec fields, and that is perfectly valid.

Strengths and Trade-offs¶

Strength	Trade-off
Schema flexibility - add fields without migrations	No enforced schema means the application must handle inconsistencies
Nested data reduces JOINs	Deep nesting creates update complexity (changing a nested value requires rewriting the document)
Developer-friendly JSON mapping	Cross-document queries (equivalent to JOINs) are expensive or unsupported
Horizontal scaling via sharding	Denormalized data means updates may need to touch multiple documents

Major Document Stores¶

MongoDB is the most widely used document database. It stores BSON documents, supports rich queries with an aggregation pipeline, and provides replica sets for high availability and sharding for horizontal scaling. You will work with MongoDB in depth in the next guide.

CouchDB uses a RESTful HTTP API and stores plain JSON. It pioneered multi-master replication with eventual consistency - useful for offline-first applications that sync when connectivity returns.

Key-Value Stores¶

A key-value store is the simplest NoSQL model. You store a value (a string, a number, a serialized object, a blob) under a unique key, and you retrieve it by that key. The database treats the value as opaque - it does not parse or index the contents. This simplicity enables extreme performance.

The Data Model¶

Key                          Value
─────────────────────────    ────────────────────────────────
session:abc123               {"user_id": 7042, "role": "admin", "expires": 1735689600}
cache:product:8842           {"name": "Mechanical Keyboard", "price": 149.99, ...}
feature:dark-mode            "enabled"
rate-limit:192.168.1.50      "47"

Every operation is a GET(key), SET(key, value), or DELETE(key). Some key-value stores add atomic counters, expiration (TTL), and basic data structures, but the core model stays the same.

Use Cases¶

Caching - store expensive query results or rendered page fragments with a TTL
Session management - fast read/write for user session data
Feature flags - simple key lookups to toggle application behavior
Rate limiting - atomic increment counters per IP or API key
Leaderboards - sorted sets (in stores like Redis that support them)

Major Key-Value Stores¶

Redis goes well beyond simple key-value. It supports strings, hashes, lists, sets, sorted sets, streams, and more - all held in memory for sub-millisecond latency. Redis is covered in its own guide later in this course.

Memcached is a pure in-memory cache with a simpler feature set than Redis. It excels at distributed caching for web applications and has been a staple since the mid-2000s.

Amazon DynamoDB is a managed key-value and document database. Despite supporting richer queries than a pure key-value store, its core access pattern is key-based lookup, and it scales horizontally with minimal operational overhead.

Key design matters

In key-value stores, the key is your only query mechanism. Design keys with namespaces and hierarchies (user:7042:preferences, cache:v2:product:8842) so you can reason about your data and implement expiration policies per namespace.

Wide-Column Stores¶

A wide-column store (sometimes called a column-family store) organizes data into rows and columns, but unlike a relational table, each row can have a different set of columns. Columns are grouped into column families, and the database is optimized for reading and writing entire column families efficiently.

The Data Model¶

Think of a wide-column store as a two-dimensional key-value store: a row key maps to a set of column-family entries, and each column family contains a dynamic set of columns.

Row Key	`profile` (column family)	`activity` (column family)
`user:1001`	`name: "Alice"`, `email: "alice@example.com"`	`last_login: "2025-03-15"`, `page_views: 4821`
`user:1002`	`name: "Bob"`, `org: "Acme Corp"`	`last_login: "2025-03-14"`
`user:1003`	`name: "Charlie"`, `email: "charlie@dev.io"`, `phone: "+1-555-0199"`	`last_login: "2025-03-15"`, `page_views: 127`, `api_calls: 9483`

Notice that user:1001 has no phone column, user:1002 has an org column the others lack, and user:1003 has api_calls that the others do not. This sparsity is the defining characteristic - you do not waste storage on NULL columns, and adding a new column does not require a schema migration.

Use Cases¶

Time-series data - sensor readings, application metrics, event logs (row key = sensor ID + time bucket)
Analytics - aggregation queries over large datasets where you read specific column families
Content management - storing metadata alongside content with varying attributes
IoT data ingestion - high write throughput for millions of devices

Major Wide-Column Stores¶

Apache Cassandra is designed for massive write throughput and high availability across multiple data centers. It uses a peer-to-peer architecture with no single point of failure, tunable consistency per query, and a SQL-like query language called CQL.

Apache HBase runs on top of Hadoop's HDFS and provides strong consistency with a single-master architecture. It is commonly used for random read/write access to large datasets in the Hadoop ecosystem.

ScyllaDB is a Cassandra-compatible database rewritten in C++ for lower latency. It provides the same CQL interface with significantly better performance per node.

Graph Databases¶

A graph database stores data as nodes (entities), edges (relationships), and properties (attributes on both nodes and edges). The relationships are first-class citizens - stored directly as pointers between nodes rather than computed via JOINs at query time. This makes traversing connections fast regardless of dataset size.

The Data Model¶

In a social network:

Nodes: User("Alice"), User("Bob"), Post("Graph databases are underrated")
Edges: Alice -[:FOLLOWS]-> Bob, Alice -[:AUTHORED]-> Post, Bob -[:LIKED]-> Post
Properties: FOLLOWS {since: "2024-01-15"}, User {name: "Alice", joined: "2023-06-01"}

A query like "find all users who follow someone who liked a post authored by Alice" requires three JOINs in a relational database but is a direct graph traversal:

(Alice)-[:AUTHORED]->(post)<-[:LIKED]-(liker)<-[:FOLLOWS]-(follower)

This traversal stays fast because following an edge is a pointer lookup, not a table scan. In relational databases, each additional JOIN multiplies the cost. In graph databases, each additional hop is roughly constant time.

Use Cases¶

Social networks - friend recommendations, mutual connections, influence analysis
Recommendation engines - "users who bought X also bought Y" via shared purchase patterns
Fraud detection - identifying clusters of accounts with suspicious relationship patterns
Knowledge graphs - connecting entities across domains (products, categories, suppliers, regulations)
Network topology - modeling infrastructure dependencies, routing, impact analysis

Major Graph Databases¶

Neo4j is the most established graph database. Its query language, Cypher, reads almost like English: MATCH (a:User)-[:FOLLOWS]->(b:User) WHERE a.name = "Alice" RETURN b.name. Neo4j uses native graph storage - relationships are stored as direct pointers, not looked up in an index.

Amazon Neptune is a managed graph database supporting both property graph (Gremlin) and RDF (SPARQL) query models.

When to use a graph database

If your queries are primarily about relationships between entities - especially multi-hop traversals - a graph database will dramatically outperform a relational database. If your queries are primarily about filtering and aggregating attributes of individual entities, a relational or document database is the better choice.

SQL vs. Document Query Comparison¶

To make the difference concrete, here is the same data retrieval in SQL (relational) and in a MongoDB-style document query. The scenario: find all orders for a specific customer, including product details.

The CAP Theorem¶

The CAP theorem (formalized by Eric Brewer in 2000, proven by Seth Gilbert and Nancy Lynch in 2002) states that a distributed data store can provide at most two of these three guarantees simultaneously:

Consistency (C) - every read receives the most recent write or an error
Availability (A) - every request receives a non-error response (though it may not contain the most recent write)
Partition tolerance (P) - the system continues operating despite network partitions between nodes

Why Only Two?¶

In any distributed system, network partitions will happen - switches fail, cables get cut, cloud availability zones lose connectivity. During a partition, the system must make a choice:

Choose CP (Consistency + Partition Tolerance): When a partition occurs, the system blocks or returns errors for requests it cannot guarantee are consistent. Nodes that cannot confirm they have the latest data refuse to serve reads. You get correct data or no data.

Choose AP (Availability + Partition Tolerance): When a partition occurs, every node continues serving requests using whatever data it has locally. Reads might return stale data, but the system never goes down. Nodes reconcile differences after the partition heals.

CA (Consistency + Availability): Only possible when there are no partitions - meaning a single-node system. The moment you distribute data across a network, you must handle partitions, and CA is off the table.

Real-World CAP Classifications¶

System	CAP Choice	Behavior During Partition
HBase	CP	Regions on the partitioned side become unavailable until partition heals
MongoDB (majority write concern)	CP	Writes require acknowledgment from a majority of replica set members; partitioned minority cannot accept writes
Cassandra	AP	All nodes continue accepting reads and writes; conflicts resolved by timestamp after partition heals
DynamoDB	AP	Continues serving requests across partitions; eventual consistency by default
PostgreSQL (single node)	CA	No partitions possible on a single node; both consistent and available

CAP is about partitions, not normal operation

CAP choices only matter during a network partition. Under normal operation, well-designed distributed databases provide all three properties. The CAP classification describes what the system sacrifices when things go wrong, not its everyday behavior.

A distributed database is configured to always return the most recent write, even if that means some requests fail during a network partition. Which CAP properties does it guarantee? (requires JavaScript)

Consistency Models¶

CAP describes a binary choice during failures. In practice, distributed databases offer a spectrum of consistency models that determine how up-to-date your reads are relative to writes, even during normal operation.

Strong Consistency¶

Every read sees the result of the most recent write. After a write completes, all subsequent reads from any node return that value. This is what relational databases provide by default within a transaction.

Cost: Higher latency (writes must propagate before reads can proceed), lower throughput, reduced availability during partitions.

Example: A bank transfer. After moving $500 from checking to savings, reading either account balance from any node must reflect the transfer.

Eventual Consistency¶

If no new writes occur, all nodes will eventually converge to the same value. There is no guarantee about how long "eventually" takes - it could be milliseconds or seconds.

Cost: Reads may return stale data. The application must tolerate temporarily inconsistent views.

Example: A social media "like" count. If Alice likes a post, Bob might see the old count for a few seconds. This is acceptable because the counter will converge, and the brief inconsistency has no real consequence.

Causal Consistency¶

If operation A causally precedes operation B (B depends on or was influenced by A), then every node sees A before B. Operations with no causal relationship may appear in different orders on different nodes.

Example: In a comment thread, a reply always appears after the message it replies to, but independent top-level comments may appear in different orders on different nodes.

Read-Your-Writes Consistency¶

After you perform a write, your subsequent reads are guaranteed to see that write. Other users may still see stale data.

Example: After updating your profile picture, you immediately see the new picture. Other users might briefly see your old picture. This is usually implemented by routing your reads to the same node that accepted your write.

Monotonic Reads¶

Once you read a value, subsequent reads will never return an older value. Your view of the data only moves forward in time, never backward.

Example: If you read a post that has 47 likes, the next time you read it you will see 47 or more likes, never 45. Without monotonic reads, load-balanced reads across replicas with different lag could show the count jumping backward.

Choosing a Consistency Level¶

Most distributed databases let you choose consistency per operation. Cassandra, for example, lets you specify ONE, QUORUM, or ALL for each read or write:

Cassandra Level	Behavior	Latency	Consistency
`ONE`	Read/write acknowledged by a single node	Lowest	Weakest (eventual)
`QUORUM`	Acknowledged by majority of replicas	Medium	Strong if read + write quorum > replication factor
`ALL`	Acknowledged by every replica	Highest	Strongest, but one node failure blocks the operation

The formula for strong consistency: if R + W > N (read replicas + write replicas > total replicas), reads are guaranteed to overlap with at least one node that has the latest write.

Polyglot Persistence¶

Polyglot persistence means using multiple database technologies within a single application, choosing each based on the access pattern of the data it stores.

Consider an e-commerce platform:

Data	Database	Reasoning
Product catalog	MongoDB	Varying attributes per product category, flexible schema
User sessions	Redis	Sub-millisecond reads, automatic TTL expiration
Order transactions	PostgreSQL	ACID guarantees for financial data, complex reporting queries
Product recommendations	Neo4j	Traversing purchase history graphs for "also bought" suggestions
Search index	Elasticsearch	Full-text search with fuzzy matching and faceted navigation
Event stream	Apache Kafka	High-throughput append-only log for analytics pipeline

Each database handles the workload it was designed for. The alternative - forcing all data into a single relational database - means fighting against the database for workloads it was not optimized for.

The Costs of Polyglot Persistence¶

This approach is not free:

Operational complexity - every database engine is another system to monitor, patch, back up, and recover
Data synchronization - keeping data consistent across systems requires event-driven architecture or change data capture
Team expertise - your team needs operational knowledge of each database
Transaction boundaries - distributed transactions across different databases are hard; most teams use eventual consistency between systems

Start simple

Begin with a single well-chosen database. Add specialized databases only when you have a measurable problem - not because an architecture diagram looks impressive with more boxes. Most applications work fine with one relational database for years.

An e-commerce platform stores product catalog data in MongoDB, user sessions in Redis, and financial transactions in PostgreSQL. What is this pattern called? (requires JavaScript)

Decision Framework: Relational vs. NoSQL¶

Choosing a database is not about ideology. It is about matching your workload to the system designed for it. Work through these criteria:

Data Structure¶

Your data looks like...	Consider
Highly structured, well-defined relationships between entities	Relational (PostgreSQL, MySQL)
Semi-structured with varying attributes per record	Document store (MongoDB, CouchDB)
Simple lookups by identifier, no complex queries	Key-value (Redis, DynamoDB)
Sparse columns, high-volume append writes	Wide-column (Cassandra, HBase)
Heavily interconnected with multi-hop relationship queries	Graph (Neo4j, Neptune)

Query Patterns¶

You mostly need to...	Consider
Run ad-hoc queries with complex filters, JOINs, aggregations	Relational
Fetch complete entities by ID or simple filters	Document or key-value
Write massive volumes with simple reads by partition key	Wide-column
Traverse relationships between entities	Graph

Consistency Requirements¶

You need...	Consider
ACID transactions across multiple tables	Relational
Strong consistency for financial or inventory data	Relational or CP-configured NoSQL
Eventual consistency is acceptable	AP NoSQL (Cassandra, DynamoDB)
Tunable consistency per operation	Cassandra, DynamoDB, MongoDB

Scale Requirements¶

Your scale is...	Consider
Single server handles the load	Relational (simplest operational model)
Read-heavy, needs read replicas	Relational with replicas, or any NoSQL
Write-heavy, needs horizontal partitioning	Wide-column or sharded document store
Global distribution across data centers	Cassandra, DynamoDB, CockroachDB

The Default Choice¶

If you are unsure, start with PostgreSQL. It handles JSON documents (via JSONB), full-text search, and horizontal read scaling with replicas. You can add specialized databases later when you have concrete evidence that PostgreSQL is not meeting a specific need.

NoSQL Concepts & Architecture¶

NoSQL Database Categories¶

Document Stores¶

The Document Model¶

Strengths and Trade-offs¶

Major Document Stores¶

Key-Value Stores¶

The Data Model¶

Use Cases¶

Major Key-Value Stores¶

Wide-Column Stores¶

The Data Model¶

Use Cases¶

Major Wide-Column Stores¶

Graph Databases¶

The Data Model¶

Use Cases¶

Major Graph Databases¶

SQL vs. Document Query Comparison¶

The CAP Theorem¶

Why Only Two?¶

Real-World CAP Classifications¶

Consistency Models¶

Strong Consistency¶

Eventual Consistency¶

Causal Consistency¶

Read-Your-Writes Consistency¶

Monotonic Reads¶

Choosing a Consistency Level¶

Polyglot Persistence¶

The Costs of Polyglot Persistence¶

Decision Framework: Relational vs. NoSQL¶

Data Structure¶

Query Patterns¶

Consistency Requirements¶

Scale Requirements¶

The Default Choice¶

Practical Exercise¶

Further Reading¶

Comments