Skip to content

NoSQL Concepts & Architecture

Relational databases dominated for decades, and for good reason - they provide strong guarantees, a mature query language, and well-understood operational patterns. But the web changed what "normal" looks like for data. Social networks generating billions of events per day, e-commerce catalogs with wildly different product attributes, real-time gaming leaderboards, IoT sensor streams - these workloads pushed relational systems to their limits and created demand for alternatives.

NoSQL ("Not Only SQL") is not a single technology. It is a family of database systems that relax one or more relational constraints - fixed schemas, JOIN-heavy queries, single-node scaling, ACID transactions - in exchange for benefits like horizontal scalability, schema flexibility, or specialized data models. The term emerged around 2009, but the ideas go back further: key-value stores like BerkeleyDB existed in the 1990s.

The core motivations behind NoSQL adoption:

  • Horizontal scaling - distribute data across commodity servers instead of buying increasingly expensive hardware
  • Schema flexibility - store documents with varying structures without ALTER TABLE migrations
  • Developer productivity - work with data models that map directly to application objects
  • Specialized query patterns - graph traversals, time-series aggregations, and full-text search that relational systems handle poorly
  • High availability - stay operational even when individual nodes fail

NoSQL does not mean "no SQL" or "better than SQL"

NoSQL databases are not replacements for relational systems. They are alternatives optimized for specific access patterns. Most production applications use both relational and NoSQL databases together. Choosing the wrong database type for your workload causes more pain than the problem you were trying to solve.


NoSQL Database Categories

NoSQL databases fall into four major categories, each built around a different data model and optimized for different access patterns.

flowchart TD
    A[NoSQL Databases] --> B[Document Stores]
    A --> C[Key-Value Stores]
    A --> D[Wide-Column Stores]
    A --> E[Graph Databases]
    B --> B1[MongoDB]
    B --> B2[CouchDB]
    B --> B3[Amazon DocumentDB]
    C --> C1[Redis]
    C --> C2[Memcached]
    C --> C3[Amazon DynamoDB]
    D --> D1[Apache Cassandra]
    D --> D2[Apache HBase]
    D --> D3[ScyllaDB]
    E --> E1[Neo4j]
    E --> E2[Amazon Neptune]
    E --> E3[ArangoDB]

Document Stores

A document store organizes data as self-contained documents - typically JSON, BSON (binary JSON), or XML. Each document holds all the data for a single entity, including nested objects and arrays. Documents are grouped into collections (analogous to tables), but unlike rows in a relational table, documents in the same collection can have completely different fields.

The Document Model

Consider a product catalog. In a relational database, you might need a products table, a product_attributes table, a product_images table, and JOINs to reassemble them. In a document store, each product is a single document:

{
  "_id": "prod_8842",
  "name": "Mechanical Keyboard",
  "brand": "KeyCraft",
  "price": 149.99,
  "specs": {
    "switch_type": "Cherry MX Brown",
    "layout": "TKL",
    "backlight": "RGB",
    "connectivity": ["USB-C", "Bluetooth 5.0"]
  },
  "images": [
    {"url": "/img/kb-front.jpg", "alt": "Front view"},
    {"url": "/img/kb-side.jpg", "alt": "Side profile"}
  ],
  "reviews_count": 342,
  "avg_rating": 4.6
}

This document embeds everything the application needs in a single read. No JOINs, no multi-table lookups. The specs object has fields specific to keyboards - a document for headphones in the same collection would have entirely different spec fields, and that is perfectly valid.

Strengths and Trade-offs

Strength Trade-off
Schema flexibility - add fields without migrations No enforced schema means the application must handle inconsistencies
Nested data reduces JOINs Deep nesting creates update complexity (changing a nested value requires rewriting the document)
Developer-friendly JSON mapping Cross-document queries (equivalent to JOINs) are expensive or unsupported
Horizontal scaling via sharding Denormalized data means updates may need to touch multiple documents

Major Document Stores

MongoDB is the most widely used document database. It stores BSON documents, supports rich queries with an aggregation pipeline, and provides replica sets for high availability and sharding for horizontal scaling. You will work with MongoDB in depth in the next guide.

CouchDB uses a RESTful HTTP API and stores plain JSON. It pioneered multi-master replication with eventual consistency - useful for offline-first applications that sync when connectivity returns.


Key-Value Stores

A key-value store is the simplest NoSQL model. You store a value (a string, a number, a serialized object, a blob) under a unique key, and you retrieve it by that key. The database treats the value as opaque - it does not parse or index the contents. This simplicity enables extreme performance.

The Data Model

Key                          Value
─────────────────────────    ────────────────────────────────
session:abc123               {"user_id": 7042, "role": "admin", "expires": 1735689600}
cache:product:8842           {"name": "Mechanical Keyboard", "price": 149.99, ...}
feature:dark-mode            "enabled"
rate-limit:192.168.1.50      "47"

Every operation is a GET(key), SET(key, value), or DELETE(key). Some key-value stores add atomic counters, expiration (TTL), and basic data structures, but the core model stays the same.

Use Cases

  • Caching - store expensive query results or rendered page fragments with a TTL
  • Session management - fast read/write for user session data
  • Feature flags - simple key lookups to toggle application behavior
  • Rate limiting - atomic increment counters per IP or API key
  • Leaderboards - sorted sets (in stores like Redis that support them)

Major Key-Value Stores

Redis goes well beyond simple key-value. It supports strings, hashes, lists, sets, sorted sets, streams, and more - all held in memory for sub-millisecond latency. Redis is covered in its own guide later in this course.

Memcached is a pure in-memory cache with a simpler feature set than Redis. It excels at distributed caching for web applications and has been a staple since the mid-2000s.

Amazon DynamoDB is a managed key-value and document database. Despite supporting richer queries than a pure key-value store, its core access pattern is key-based lookup, and it scales horizontally with minimal operational overhead.

Key design matters

In key-value stores, the key is your only query mechanism. Design keys with namespaces and hierarchies (user:7042:preferences, cache:v2:product:8842) so you can reason about your data and implement expiration policies per namespace.


Wide-Column Stores

A wide-column store (sometimes called a column-family store) organizes data into rows and columns, but unlike a relational table, each row can have a different set of columns. Columns are grouped into column families, and the database is optimized for reading and writing entire column families efficiently.

The Data Model

Think of a wide-column store as a two-dimensional key-value store: a row key maps to a set of column-family entries, and each column family contains a dynamic set of columns.

Row Key profile (column family) activity (column family)
user:1001 name: "Alice", email: "alice@example.com" last_login: "2025-03-15", page_views: 4821
user:1002 name: "Bob", org: "Acme Corp" last_login: "2025-03-14"
user:1003 name: "Charlie", email: "charlie@dev.io", phone: "+1-555-0199" last_login: "2025-03-15", page_views: 127, api_calls: 9483

Notice that user:1001 has no phone column, user:1002 has an org column the others lack, and user:1003 has api_calls that the others do not. This sparsity is the defining characteristic - you do not waste storage on NULL columns, and adding a new column does not require a schema migration.

Use Cases

  • Time-series data - sensor readings, application metrics, event logs (row key = sensor ID + time bucket)
  • Analytics - aggregation queries over large datasets where you read specific column families
  • Content management - storing metadata alongside content with varying attributes
  • IoT data ingestion - high write throughput for millions of devices

Major Wide-Column Stores

Apache Cassandra is designed for massive write throughput and high availability across multiple data centers. It uses a peer-to-peer architecture with no single point of failure, tunable consistency per query, and a SQL-like query language called CQL.

Apache HBase runs on top of Hadoop's HDFS and provides strong consistency with a single-master architecture. It is commonly used for random read/write access to large datasets in the Hadoop ecosystem.

ScyllaDB is a Cassandra-compatible database rewritten in C++ for lower latency. It provides the same CQL interface with significantly better performance per node.


Graph Databases

A graph database stores data as nodes (entities), edges (relationships), and properties (attributes on both nodes and edges). The relationships are first-class citizens - stored directly as pointers between nodes rather than computed via JOINs at query time. This makes traversing connections fast regardless of dataset size.

The Data Model

In a social network:

  • Nodes: User("Alice"), User("Bob"), Post("Graph databases are underrated")
  • Edges: Alice -[:FOLLOWS]-> Bob, Alice -[:AUTHORED]-> Post, Bob -[:LIKED]-> Post
  • Properties: FOLLOWS {since: "2024-01-15"}, User {name: "Alice", joined: "2023-06-01"}

A query like "find all users who follow someone who liked a post authored by Alice" requires three JOINs in a relational database but is a direct graph traversal:

(Alice)-[:AUTHORED]->(post)<-[:LIKED]-(liker)<-[:FOLLOWS]-(follower)

This traversal stays fast because following an edge is a pointer lookup, not a table scan. In relational databases, each additional JOIN multiplies the cost. In graph databases, each additional hop is roughly constant time.

Use Cases

  • Social networks - friend recommendations, mutual connections, influence analysis
  • Recommendation engines - "users who bought X also bought Y" via shared purchase patterns
  • Fraud detection - identifying clusters of accounts with suspicious relationship patterns
  • Knowledge graphs - connecting entities across domains (products, categories, suppliers, regulations)
  • Network topology - modeling infrastructure dependencies, routing, impact analysis

Major Graph Databases

Neo4j is the most established graph database. Its query language, Cypher, reads almost like English: MATCH (a:User)-[:FOLLOWS]->(b:User) WHERE a.name = "Alice" RETURN b.name. Neo4j uses native graph storage - relationships are stored as direct pointers, not looked up in an index.

Amazon Neptune is a managed graph database supporting both property graph (Gremlin) and RDF (SPARQL) query models.

When to use a graph database

If your queries are primarily about relationships between entities - especially multi-hop traversals - a graph database will dramatically outperform a relational database. If your queries are primarily about filtering and aggregating attributes of individual entities, a relational or document database is the better choice.


SQL vs. Document Query Comparison

To make the difference concrete, here is the same data retrieval in SQL (relational) and in a MongoDB-style document query. The scenario: find all orders for a specific customer, including product details.


The CAP Theorem

The CAP theorem (formalized by Eric Brewer in 2000, proven by Seth Gilbert and Nancy Lynch in 2002) states that a distributed data store can provide at most two of these three guarantees simultaneously:

  • Consistency (C) - every read receives the most recent write or an error
  • Availability (A) - every request receives a non-error response (though it may not contain the most recent write)
  • Partition tolerance (P) - the system continues operating despite network partitions between nodes

Why Only Two?

In any distributed system, network partitions will happen - switches fail, cables get cut, cloud availability zones lose connectivity. During a partition, the system must make a choice:

Choose CP (Consistency + Partition Tolerance): When a partition occurs, the system blocks or returns errors for requests it cannot guarantee are consistent. Nodes that cannot confirm they have the latest data refuse to serve reads. You get correct data or no data.

Choose AP (Availability + Partition Tolerance): When a partition occurs, every node continues serving requests using whatever data it has locally. Reads might return stale data, but the system never goes down. Nodes reconcile differences after the partition heals.

CA (Consistency + Availability): Only possible when there are no partitions - meaning a single-node system. The moment you distribute data across a network, you must handle partitions, and CA is off the table.

Real-World CAP Classifications

System CAP Choice Behavior During Partition
HBase CP Regions on the partitioned side become unavailable until partition heals
MongoDB (majority write concern) CP Writes require acknowledgment from a majority of replica set members; partitioned minority cannot accept writes
Cassandra AP All nodes continue accepting reads and writes; conflicts resolved by timestamp after partition heals
DynamoDB AP Continues serving requests across partitions; eventual consistency by default
PostgreSQL (single node) CA No partitions possible on a single node; both consistent and available

CAP is about partitions, not normal operation

CAP choices only matter during a network partition. Under normal operation, well-designed distributed databases provide all three properties. The CAP classification describes what the system sacrifices when things go wrong, not its everyday behavior.


Consistency Models

CAP describes a binary choice during failures. In practice, distributed databases offer a spectrum of consistency models that determine how up-to-date your reads are relative to writes, even during normal operation.

Strong Consistency

Every read sees the result of the most recent write. After a write completes, all subsequent reads from any node return that value. This is what relational databases provide by default within a transaction.

Cost: Higher latency (writes must propagate before reads can proceed), lower throughput, reduced availability during partitions.

Example: A bank transfer. After moving $500 from checking to savings, reading either account balance from any node must reflect the transfer.

Eventual Consistency

If no new writes occur, all nodes will eventually converge to the same value. There is no guarantee about how long "eventually" takes - it could be milliseconds or seconds.

Cost: Reads may return stale data. The application must tolerate temporarily inconsistent views.

Example: A social media "like" count. If Alice likes a post, Bob might see the old count for a few seconds. This is acceptable because the counter will converge, and the brief inconsistency has no real consequence.

Causal Consistency

If operation A causally precedes operation B (B depends on or was influenced by A), then every node sees A before B. Operations with no causal relationship may appear in different orders on different nodes.

Example: In a comment thread, a reply always appears after the message it replies to, but independent top-level comments may appear in different orders on different nodes.

Read-Your-Writes Consistency

After you perform a write, your subsequent reads are guaranteed to see that write. Other users may still see stale data.

Example: After updating your profile picture, you immediately see the new picture. Other users might briefly see your old picture. This is usually implemented by routing your reads to the same node that accepted your write.

Monotonic Reads

Once you read a value, subsequent reads will never return an older value. Your view of the data only moves forward in time, never backward.

Example: If you read a post that has 47 likes, the next time you read it you will see 47 or more likes, never 45. Without monotonic reads, load-balanced reads across replicas with different lag could show the count jumping backward.

Choosing a Consistency Level

Most distributed databases let you choose consistency per operation. Cassandra, for example, lets you specify ONE, QUORUM, or ALL for each read or write:

Cassandra Level Behavior Latency Consistency
ONE Read/write acknowledged by a single node Lowest Weakest (eventual)
QUORUM Acknowledged by majority of replicas Medium Strong if read + write quorum > replication factor
ALL Acknowledged by every replica Highest Strongest, but one node failure blocks the operation

The formula for strong consistency: if R + W > N (read replicas + write replicas > total replicas), reads are guaranteed to overlap with at least one node that has the latest write.


Polyglot Persistence

Polyglot persistence means using multiple database technologies within a single application, choosing each based on the access pattern of the data it stores.

Consider an e-commerce platform:

Data Database Reasoning
Product catalog MongoDB Varying attributes per product category, flexible schema
User sessions Redis Sub-millisecond reads, automatic TTL expiration
Order transactions PostgreSQL ACID guarantees for financial data, complex reporting queries
Product recommendations Neo4j Traversing purchase history graphs for "also bought" suggestions
Search index Elasticsearch Full-text search with fuzzy matching and faceted navigation
Event stream Apache Kafka High-throughput append-only log for analytics pipeline

Each database handles the workload it was designed for. The alternative - forcing all data into a single relational database - means fighting against the database for workloads it was not optimized for.

The Costs of Polyglot Persistence

This approach is not free:

  • Operational complexity - every database engine is another system to monitor, patch, back up, and recover
  • Data synchronization - keeping data consistent across systems requires event-driven architecture or change data capture
  • Team expertise - your team needs operational knowledge of each database
  • Transaction boundaries - distributed transactions across different databases are hard; most teams use eventual consistency between systems

Start simple

Begin with a single well-chosen database. Add specialized databases only when you have a measurable problem - not because an architecture diagram looks impressive with more boxes. Most applications work fine with one relational database for years.


Decision Framework: Relational vs. NoSQL

Choosing a database is not about ideology. It is about matching your workload to the system designed for it. Work through these criteria:

Data Structure

Your data looks like... Consider
Highly structured, well-defined relationships between entities Relational (PostgreSQL, MySQL)
Semi-structured with varying attributes per record Document store (MongoDB, CouchDB)
Simple lookups by identifier, no complex queries Key-value (Redis, DynamoDB)
Sparse columns, high-volume append writes Wide-column (Cassandra, HBase)
Heavily interconnected with multi-hop relationship queries Graph (Neo4j, Neptune)

Query Patterns

You mostly need to... Consider
Run ad-hoc queries with complex filters, JOINs, aggregations Relational
Fetch complete entities by ID or simple filters Document or key-value
Write massive volumes with simple reads by partition key Wide-column
Traverse relationships between entities Graph

Consistency Requirements

You need... Consider
ACID transactions across multiple tables Relational
Strong consistency for financial or inventory data Relational or CP-configured NoSQL
Eventual consistency is acceptable AP NoSQL (Cassandra, DynamoDB)
Tunable consistency per operation Cassandra, DynamoDB, MongoDB

Scale Requirements

Your scale is... Consider
Single server handles the load Relational (simplest operational model)
Read-heavy, needs read replicas Relational with replicas, or any NoSQL
Write-heavy, needs horizontal partitioning Wide-column or sharded document store
Global distribution across data centers Cassandra, DynamoDB, CockroachDB

The Default Choice

If you are unsure, start with PostgreSQL. It handles JSON documents (via JSONB), full-text search, and horizontal read scaling with replicas. You can add specialized databases later when you have concrete evidence that PostgreSQL is not meeting a specific need.


Practical Exercise


Further Reading


Previous: PostgreSQL Advanced Features | Next: MongoDB | Back to Index

Comments