Skip to content

Database Design & Modeling

Good schema design is the foundation that everything else rests on - query performance, data integrity, application complexity, and operational burden. A poorly designed schema creates problems that no amount of indexing or hardware can fix. A well-designed one makes queries obvious, constraints enforceable, and future changes manageable.

This guide covers the full lifecycle of turning requirements into schema: modeling entities and relationships, normalizing to eliminate redundancy, choosing when to denormalize for performance, selecting appropriate data types, enforcing integrity through constraints, and building indexes that make your queries fast.


Entity-Relationship Modeling

Before writing any CREATE TABLE statements, you need a clear picture of what your data looks like and how the pieces connect. Entity-Relationship (ER) diagrams are the standard tool for this. They capture the nouns in your system (entities), the attributes those nouns have, and the connections (relationships) between them.

Entities and Attributes

An entity is a thing your system tracks - a customer, an order, a product, an employee. Each entity has attributes: a customer has a name and email, an order has a date and total, a product has a price and description.

In a relational database, each entity becomes a table. Each attribute becomes a column.

Relationships and Cardinality

Entities connect to each other through relationships, and those relationships have cardinality - rules about how many of one entity can relate to another:

Cardinality Meaning Example
One-to-one (1:1) Each row in A matches exactly one row in B User and user profile
One-to-many (1:N) Each row in A matches zero or more rows in B Customer and orders
Many-to-many (M:N) Rows in A can match multiple rows in B and vice versa Students and courses

Many-to-many relationships require a junction table (also called a join table or associative table) to implement in a relational database. The junction table holds foreign keys pointing to both sides.

Reading an ER Diagram

Here is an ER diagram for a simple e-commerce system. Each line represents a relationship, and the symbols at each end indicate cardinality:

erDiagram
    CUSTOMER ||--o{ ORDER : places
    ORDER ||--|{ ORDER_LINE : contains
    PRODUCT ||--o{ ORDER_LINE : "appears in"
    PRODUCT }o--|| CATEGORY : "belongs to"
    CUSTOMER {
        int customer_id PK
        varchar name
        varchar email
        timestamp created_at
    }
    ORDER {
        int order_id PK
        int customer_id FK
        date order_date
        decimal total
        varchar status
    }
    ORDER_LINE {
        int order_id FK
        int product_id FK
        int quantity
        decimal unit_price
    }
    PRODUCT {
        int product_id PK
        int category_id FK
        varchar name
        decimal price
        text description
    }
    CATEGORY {
        int category_id PK
        varchar name
    }

The notation works like this:

  • || means "exactly one"
  • o{ means "zero or more"
  • |{ means "one or more"

So CUSTOMER ||--o{ ORDER reads as: each customer places zero or more orders, and each order belongs to exactly one customer.

Start with the diagram

Sketch your ER diagram before writing SQL. It forces you to think about relationships and cardinality up front, and catches design mistakes that are painful to fix after tables are populated with data.


Normalization

Normalization is the process of organizing columns and tables to reduce data redundancy and prevent update anomalies. Each normal form builds on the previous one, imposing stricter rules about how data is structured.

The goal is not normalization for its own sake. The goal is a schema where each fact is stored exactly once, so updates cannot create inconsistencies.

First Normal Form (1NF)

A table is in first normal form if:

  1. Every column contains atomic (indivisible) values - no lists, sets, or nested structures
  2. Each row is unique (the table has a primary key)

Problem - a table violating 1NF:

order_id customer products
1 Alice Widget, Gadget, Sprocket
2 Bob Gadget

The products column contains a comma-separated list. You cannot query for all orders containing "Widget" without string parsing. You cannot count products per order without splitting strings. You cannot enforce that each product name is valid.

Fix - split into atomic values with a proper relationship:

order_id customer
1 Alice
2 Bob
order_id product
1 Widget
1 Gadget
1 Sprocket
2 Gadget

Now each cell holds one value, each row is unique, and you can query, index, and constrain the data properly.

Second Normal Form (2NF)

A table is in second normal form if:

  1. It is in 1NF
  2. Every non-key column depends on the entire primary key, not just part of it

2NF only matters for tables with composite primary keys. If your primary key is a single column, 1NF compliance automatically gives you 2NF.

Problem - a table violating 2NF:

order_id product_id product_name quantity
1 10 Widget 3
1 20 Gadget 1
2 10 Widget 5

The primary key is (order_id, product_id). But product_name depends only on product_id, not on the full composite key. If you rename "Widget" to "Super Widget", you must update every row where product_id = 10. Miss one and your data is inconsistent.

Fix - move partial dependencies to their own table:

order_lines:

order_id product_id quantity
1 10 3
1 20 1
2 10 5

products:

product_id product_name
10 Widget
20 Gadget

Now product_name is stored once and referenced by product_id.

Third Normal Form (3NF)

A table is in third normal form if:

  1. It is in 2NF
  2. No non-key column depends on another non-key column (no transitive dependencies)

Problem - a table violating 3NF:

employee_id name department_id department_name department_location
1 Alice 10 Engineering Building A
2 Bob 10 Engineering Building A
3 Carol 20 Marketing Building B

Here, department_name and department_location depend on department_id, not on employee_id. If Engineering moves to Building C, you update multiple rows. Miss one and you have Engineering in two buildings.

Fix - extract the transitive dependency:

employees:

employee_id name department_id
1 Alice 10
2 Bob 10
3 Carol 20

departments:

department_id department_name department_location
10 Engineering Building A
20 Marketing Building B

The department fact is stored once. Updates happen in one place.

Boyce-Codd Normal Form (BCNF)

Boyce-Codd Normal Form is a stricter version of 3NF. A table is in BCNF if every determinant (a column or set of columns that other columns depend on) is a candidate key.

The difference from 3NF is subtle and only surfaces in tables with multiple overlapping candidate keys. Here is a classic example:

Problem - a table in 3NF but not BCNF:

Consider a university where each subject is taught by one professor, but a professor can teach only one subject. Students can enroll in multiple subjects.

student_id subject professor
1 Math Dr. Smith
1 Physics Dr. Jones
2 Math Dr. Smith

The candidate keys are (student_id, subject) and (student_id, professor). But subject determines professor and professor determines subject - and neither is a candidate key on its own. If Dr. Smith switches from Math to Chemistry, you update multiple rows.

Fix - decompose so every determinant is a candidate key:

professor_subjects:

professor subject
Dr. Smith Math
Dr. Jones Physics

student_enrollments:

student_id professor
1 Dr. Smith
1 Dr. Jones
2 Dr. Smith

Now the professor -> subject dependency lives in a table where professor is the key.

Normalization is a tool, not a religion

In practice, most production schemas aim for 3NF. BCNF violations are rare and usually harmless. Going beyond 3NF (there are 4NF, 5NF, and even 6NF) brings diminishing returns and increasing complexity. Know the rules so you can make informed decisions about when to follow them and when to break them.


Denormalization

Fully normalized schemas store each fact once, which is excellent for writes and data integrity. But reads can suffer because answering a single question may require joining five or six tables. Denormalization is the deliberate introduction of redundancy to speed up reads.

When to Denormalize

Denormalization makes sense when:

  • A query is run frequently and the joins are measurably slow
  • The data changes rarely but is read constantly (high read-to-write ratio)
  • Reporting queries need pre-aggregated data to avoid scanning millions of rows
  • You have measured the problem with EXPLAIN and confirmed that joins are the bottleneck

Denormalization does not make sense when:

  • You are guessing about performance before measuring
  • Write consistency is critical and you cannot tolerate stale duplicated data
  • The application is write-heavy

Common Denormalization Strategies

Pre-computed columns. Store a calculated value directly on the row instead of computing it at query time:

-- Instead of SUM(order_lines.quantity * order_lines.unit_price) every time:
ALTER TABLE orders ADD COLUMN total_amount DECIMAL(10,2);

You update total_amount whenever an order line changes. The trade-off is that your application code (or a trigger) must keep it in sync.

Duplicated columns. Copy a frequently accessed column from a related table:

-- Instead of joining to customers for every order display:
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(255);

If the customer's name changes, every order must be updated. This is acceptable if name changes are rare and order listing queries are constant.

Summary tables. Maintain a separate table with pre-aggregated data:

CREATE TABLE daily_sales_summary (
    summary_date DATE PRIMARY KEY,
    total_orders INT NOT NULL,
    total_revenue DECIMAL(12,2) NOT NULL,
    avg_order_value DECIMAL(10,2) NOT NULL
);

Populated by a nightly batch job or maintained by triggers. Dashboard queries hit this table instead of scanning the entire orders table.

Denormalization creates maintenance burden

Every piece of duplicated data is a consistency risk. When you denormalize, document which columns are derived, what the source of truth is, and how synchronization happens (trigger, application code, batch job). If you cannot answer those questions, do not denormalize.


Index Theory

Indexes are separate data structures that help the database find rows without scanning every page in the table. Without an index, a query against a million-row table reads every row. With the right index, it reads a handful of pages.

B-tree Indexes

The B-tree (balanced tree) is the default index structure in virtually every relational database. It keeps data sorted in a tree structure where:

  • The root node points to intermediate nodes
  • Intermediate nodes point to leaf nodes
  • Leaf nodes contain the indexed values and pointers to the actual table rows
  • All leaf nodes are at the same depth, so lookup time is consistent

B-tree indexes support:

  • Exact match lookups: WHERE email = 'alice@example.com'
  • Range scans: WHERE price BETWEEN 10 AND 50
  • Prefix matching: WHERE name LIKE 'Ali%'
  • Sorting: ORDER BY created_at DESC
  • Min/max: SELECT MIN(price) FROM products

They do not efficiently support:

  • Suffix or infix matching: WHERE name LIKE '%ice'
  • Functions on the indexed column: WHERE YEAR(created_at) = 2025 (unless you create a functional index)

Composite indexes (multi-column) follow the leftmost prefix rule. An index on (country, city, zipcode) can satisfy queries on country, country + city, or country + city + zipcode, but not city alone or zipcode alone.

Hash Indexes

Hash indexes use a hash function to map values to buckets. They are fast for exact equality lookups (= and IN) but cannot help with range queries, sorting, or prefix matching.

In MySQL, the MEMORY storage engine supports explicit hash indexes. InnoDB uses an internal adaptive hash index that it builds automatically on top of B-tree indexes for frequently accessed pages - you do not create it manually.

In PostgreSQL, you can create hash indexes explicitly:

CREATE INDEX idx_sessions_token ON sessions USING HASH (token);

Hash indexes in PostgreSQL were not crash-safe until version 10. If you are on PostgreSQL 10+, they are a reasonable choice for columns that are only ever queried with =.

GIN and GiST Indexes (PostgreSQL)

PostgreSQL offers specialized index types for non-scalar data:

GIN (Generalized Inverted Index) is designed for values that contain multiple elements - arrays, JSONB documents, and full-text search vectors. A GIN index on a JSONB column lets you query for documents containing specific keys or values efficiently:

CREATE INDEX idx_metadata ON events USING GIN (metadata);
-- Now this query uses the index:
SELECT * FROM events WHERE metadata @> '{"type": "click"}';

GiST (Generalized Search Tree) supports geometric data, ranges, and full-text search. It is the index type for PostGIS spatial queries and range type operations:

CREATE INDEX idx_location ON stores USING GIST (location);
-- Spatial query using the index:
SELECT * FROM stores WHERE location <-> point '(40.7, -74.0)' < 0.1;
Index Type Best For Range Queries Equality Multi-Value
B-tree Most columns Yes Yes No
Hash Equality-only lookups No Yes No
GIN JSONB, arrays, full-text No Yes Yes
GiST Spatial, ranges, full-text Yes Yes Yes

Index only what you query

Every index speeds up reads but slows down writes because the database must update the index on every INSERT, UPDATE, and DELETE. Do not create indexes speculatively. Look at your actual query patterns, identify slow queries with EXPLAIN, and add indexes to address specific bottlenecks.


Data Type Selection

Choosing the right data type affects storage size, query performance, and the correctness of your application. The wrong type can waste disk space, prevent useful optimizations, or silently lose precision.

Integers

Type Storage Range Use When
TINYINT / SMALLINT 1-2 bytes Up to 32,767 (signed) Status codes, small enumerations
INT 4 bytes Up to ~2.1 billion Most primary keys, counters, foreign keys
BIGINT 8 bytes Up to ~9.2 quintillion Tables expected to exceed 2 billion rows, timestamps as epoch milliseconds

Use INT for primary keys unless you have a concrete reason for BIGINT. A table with a BIGINT primary key and three BIGINT foreign keys uses 16 bytes more per row than the INT equivalent. Over 100 million rows, that is 1.6 GB of extra storage just for those four columns - and the indexes are larger too, which means more memory pressure and slower scans.

Use BIGINT when you know the table will grow past 2 billion rows, or when you are storing values like Unix timestamps in milliseconds or external IDs from systems that use 64-bit identifiers.

Strings

Type Behavior Use When
CHAR(N) Fixed-length, padded with spaces Codes with known length (country codes, currency codes)
VARCHAR(N) Variable-length, stores only what you need Most string data (names, emails, addresses)
TEXT Variable-length, no practical limit Large content (descriptions, comments, articles)

In PostgreSQL, there is no performance difference between VARCHAR(N) and TEXT. The length limit on VARCHAR(N) is purely a constraint - if you want to enforce a maximum length, use it. If you do not need a maximum, TEXT is simpler.

In MySQL, VARCHAR columns longer than a threshold (roughly 768 bytes for InnoDB) cannot be fully included in B-tree indexes. Keep indexed VARCHAR columns reasonably sized.

Numeric Precision

Type Behavior Use When
DECIMAL(P,S) / NUMERIC(P,S) Exact arithmetic Money, financial calculations, anything where rounding errors are unacceptable
FLOAT / DOUBLE Approximate (IEEE 754 floating point) Scientific measurements, coordinates, values where small rounding errors are acceptable

Never store money as FLOAT or DOUBLE. The value 0.1 cannot be represented exactly in binary floating point:

-- This might not return your row:
SELECT * FROM accounts WHERE balance = 0.10;

-- Use DECIMAL instead:
CREATE TABLE accounts (
    account_id INT PRIMARY KEY,
    balance DECIMAL(12,2) NOT NULL DEFAULT 0.00
);

Timestamps

Type Stores Use When
TIMESTAMP Date and time, usually UTC-aware Event times, created_at, updated_at
DATETIME (MySQL) Date and time, no timezone conversion Scheduling (you want the literal date/time stored)
TIMESTAMPTZ (PostgreSQL) Timestamp with timezone awareness Almost always - PostgreSQL converts to UTC on storage and back to the session timezone on retrieval
DATE Date only Birthdays, due dates, report dates

Always store timestamps in UTC

Store all timestamps in UTC and convert to local time in the application layer. Mixing timezones in the database leads to subtle bugs that are extraordinarily painful to debug. In PostgreSQL, use TIMESTAMPTZ. In MySQL, set your server to UTC and use TIMESTAMP.

Boolean and Enumeration Types

For columns with a small fixed set of values, you have options:

  • BOOLEAN (or TINYINT(1) in MySQL) for true/false flags
  • ENUM types for small fixed sets - convenient but creates schema coupling (adding a new value requires ALTER TABLE)
  • A VARCHAR column with a CHECK constraint - more flexible than ENUM and easier to extend
  • A separate lookup table with a foreign key - the most normalized approach, best when the set of values may grow or has its own attributes

Constraints

Constraints are rules enforced by the database engine that prevent invalid data from entering your tables. They are your last line of defense against application bugs, bad imports, and manual mistakes.

PRIMARY KEY

Every table should have a primary key - a column or combination of columns that uniquely identifies each row. The database enforces uniqueness and creates an index automatically.

CREATE TABLE products (
    product_id INT PRIMARY KEY,
    name VARCHAR(255) NOT NULL
);

-- Composite primary key:
CREATE TABLE order_lines (
    order_id INT,
    product_id INT,
    quantity INT NOT NULL,
    PRIMARY KEY (order_id, product_id)
);

FOREIGN KEY

A foreign key constraint ensures that a value in one table references a valid row in another table. It prevents orphaned records and enforces referential integrity.

The ON DELETE and ON UPDATE clauses control what happens when the referenced row changes:

Action Behavior
CASCADE Delete/update the child rows automatically
SET NULL Set the foreign key column to NULL
SET DEFAULT Set the foreign key column to its default value
RESTRICT Block the delete/update if child rows exist (checked immediately)
NO ACTION Same as RESTRICT in most databases (checked at end of statement in PostgreSQL)
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
        ON DELETE RESTRICT
        ON UPDATE CASCADE
);

This means: if you try to delete a customer who has orders, the database blocks it (RESTRICT). If you change a customer's ID, the change propagates to all their orders (CASCADE).

UNIQUE, CHECK, NOT NULL, and DEFAULT

CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    name VARCHAR(255) NOT NULL,
    department_id INT REFERENCES departments(department_id),
    salary DECIMAL(10,2) CHECK (salary > 0),
    hire_date DATE NOT NULL DEFAULT CURRENT_DATE,
    status VARCHAR(20) NOT NULL DEFAULT 'active'
        CHECK (status IN ('active', 'inactive', 'terminated'))
);
  • NOT NULL prevents missing values - use it on every column that should always have a value
  • UNIQUE prevents duplicates - use it for natural identifiers like email, username, or SKU
  • CHECK enforces arbitrary conditions - salaries must be positive, status must be one of a known set
  • DEFAULT provides a value when one is not specified during insert

Putting It All Together

Here is a complete schema demonstrating constraints, data types, and relationships. The following ER diagram shows the structure:

erDiagram
    DEPARTMENT ||--o{ EMPLOYEE : employs
    EMPLOYEE ||--o{ PROJECT_ASSIGNMENT : "assigned to"
    PROJECT ||--o{ PROJECT_ASSIGNMENT : "staffed by"
    DEPARTMENT {
        int department_id PK
        varchar name UK
        varchar location
    }
    EMPLOYEE {
        int employee_id PK
        varchar email UK
        varchar name
        int department_id FK
        decimal salary
        date hire_date
    }
    PROJECT {
        int project_id PK
        varchar name
        date start_date
        date end_date
    }
    PROJECT_ASSIGNMENT {
        int employee_id FK
        int project_id FK
        varchar role
        date assigned_date
    }

Design Decisions in Practice

Real schema design involves trade-offs. Here is a summary of the key decisions and when each option wins:

Decision Option A Option B Choose A When Choose B When
Normalization level 3NF (fully normalized) Denormalized with redundancy Write-heavy, data integrity critical Read-heavy dashboards, rare updates
Primary key type Natural key (email, SSN) Surrogate key (auto-increment INT) Key is truly immutable and unique Natural keys can change or are large
INT vs BIGINT INT (4 bytes) BIGINT (8 bytes) Table stays under 2B rows Table will exceed 2B rows
VARCHAR(N) vs TEXT VARCHAR(N) TEXT You want a length constraint No meaningful max length
DECIMAL vs FLOAT DECIMAL FLOAT/DOUBLE Money, exact arithmetic Scientific data, acceptable rounding
ENUM vs CHECK ENUM type VARCHAR + CHECK MySQL with stable small sets PostgreSQL, or sets that may grow
CASCADE vs RESTRICT ON DELETE CASCADE ON DELETE RESTRICT Child rows meaningless without parent Child rows should block parent deletion

Further Reading

  • Database Normalization Basics - Open textbook chapter with detailed worked examples through 3NF
  • Use The Index, Luke - The definitive free resource on SQL indexing and query performance, covering B-tree internals, composite indexes, and execution plans
  • PostgreSQL Index Types - Official documentation on B-tree, Hash, GIN, GiST, SP-GiST, and BRIN index types
  • MySQL Data Types - Official MySQL reference for storage requirements and behavior of each data type
  • SQL Constraints on W3Schools - Quick reference for PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, NOT NULL, and DEFAULT

Previous: SQL Essentials | Next: MySQL Installation & Configuration | Back to Index

Comments