Database Design & Modeling¶
Good schema design is the foundation that everything else rests on - query performance, data integrity, application complexity, and operational burden. A poorly designed schema creates problems that no amount of indexing or hardware can fix. A well-designed one makes queries obvious, constraints enforceable, and future changes manageable.
This guide covers the full lifecycle of turning requirements into schema: modeling entities and relationships, normalizing to eliminate redundancy, choosing when to denormalize for performance, selecting appropriate data types, enforcing integrity through constraints, and building indexes that make your queries fast.
Entity-Relationship Modeling¶
Before writing any CREATE TABLE statements, you need a clear picture of what your data looks like and how the pieces connect. Entity-Relationship (ER) diagrams are the standard tool for this. They capture the nouns in your system (entities), the attributes those nouns have, and the connections (relationships) between them.
Entities and Attributes¶
An entity is a thing your system tracks - a customer, an order, a product, an employee. Each entity has attributes: a customer has a name and email, an order has a date and total, a product has a price and description.
In a relational database, each entity becomes a table. Each attribute becomes a column.
Relationships and Cardinality¶
Entities connect to each other through relationships, and those relationships have cardinality - rules about how many of one entity can relate to another:
| Cardinality | Meaning | Example |
|---|---|---|
| One-to-one (1:1) | Each row in A matches exactly one row in B | User and user profile |
| One-to-many (1:N) | Each row in A matches zero or more rows in B | Customer and orders |
| Many-to-many (M:N) | Rows in A can match multiple rows in B and vice versa | Students and courses |
Many-to-many relationships require a junction table (also called a join table or associative table) to implement in a relational database. The junction table holds foreign keys pointing to both sides.
Reading an ER Diagram¶
Here is an ER diagram for a simple e-commerce system. Each line represents a relationship, and the symbols at each end indicate cardinality:
erDiagram
CUSTOMER ||--o{ ORDER : places
ORDER ||--|{ ORDER_LINE : contains
PRODUCT ||--o{ ORDER_LINE : "appears in"
PRODUCT }o--|| CATEGORY : "belongs to"
CUSTOMER {
int customer_id PK
varchar name
varchar email
timestamp created_at
}
ORDER {
int order_id PK
int customer_id FK
date order_date
decimal total
varchar status
}
ORDER_LINE {
int order_id FK
int product_id FK
int quantity
decimal unit_price
}
PRODUCT {
int product_id PK
int category_id FK
varchar name
decimal price
text description
}
CATEGORY {
int category_id PK
varchar name
}
The notation works like this:
||means "exactly one"o{means "zero or more"|{means "one or more"
So CUSTOMER ||--o{ ORDER reads as: each customer places zero or more orders, and each order belongs to exactly one customer.
Start with the diagram
Sketch your ER diagram before writing SQL. It forces you to think about relationships and cardinality up front, and catches design mistakes that are painful to fix after tables are populated with data.
Normalization¶
Normalization is the process of organizing columns and tables to reduce data redundancy and prevent update anomalies. Each normal form builds on the previous one, imposing stricter rules about how data is structured.
The goal is not normalization for its own sake. The goal is a schema where each fact is stored exactly once, so updates cannot create inconsistencies.
First Normal Form (1NF)¶
A table is in first normal form if:
- Every column contains atomic (indivisible) values - no lists, sets, or nested structures
- Each row is unique (the table has a primary key)
Problem - a table violating 1NF:
| order_id | customer | products |
|---|---|---|
| 1 | Alice | Widget, Gadget, Sprocket |
| 2 | Bob | Gadget |
The products column contains a comma-separated list. You cannot query for all orders containing "Widget" without string parsing. You cannot count products per order without splitting strings. You cannot enforce that each product name is valid.
Fix - split into atomic values with a proper relationship:
| order_id | customer |
|---|---|
| 1 | Alice |
| 2 | Bob |
| order_id | product |
|---|---|
| 1 | Widget |
| 1 | Gadget |
| 1 | Sprocket |
| 2 | Gadget |
Now each cell holds one value, each row is unique, and you can query, index, and constrain the data properly.
Second Normal Form (2NF)¶
A table is in second normal form if:
- It is in 1NF
- Every non-key column depends on the entire primary key, not just part of it
2NF only matters for tables with composite primary keys. If your primary key is a single column, 1NF compliance automatically gives you 2NF.
Problem - a table violating 2NF:
| order_id | product_id | product_name | quantity |
|---|---|---|---|
| 1 | 10 | Widget | 3 |
| 1 | 20 | Gadget | 1 |
| 2 | 10 | Widget | 5 |
The primary key is (order_id, product_id). But product_name depends only on product_id, not on the full composite key. If you rename "Widget" to "Super Widget", you must update every row where product_id = 10. Miss one and your data is inconsistent.
Fix - move partial dependencies to their own table:
order_lines:
| order_id | product_id | quantity |
|---|---|---|
| 1 | 10 | 3 |
| 1 | 20 | 1 |
| 2 | 10 | 5 |
products:
| product_id | product_name |
|---|---|
| 10 | Widget |
| 20 | Gadget |
Now product_name is stored once and referenced by product_id.
Third Normal Form (3NF)¶
A table is in third normal form if:
- It is in 2NF
- No non-key column depends on another non-key column (no transitive dependencies)
Problem - a table violating 3NF:
| employee_id | name | department_id | department_name | department_location |
|---|---|---|---|---|
| 1 | Alice | 10 | Engineering | Building A |
| 2 | Bob | 10 | Engineering | Building A |
| 3 | Carol | 20 | Marketing | Building B |
Here, department_name and department_location depend on department_id, not on employee_id. If Engineering moves to Building C, you update multiple rows. Miss one and you have Engineering in two buildings.
Fix - extract the transitive dependency:
employees:
| employee_id | name | department_id |
|---|---|---|
| 1 | Alice | 10 |
| 2 | Bob | 10 |
| 3 | Carol | 20 |
departments:
| department_id | department_name | department_location |
|---|---|---|
| 10 | Engineering | Building A |
| 20 | Marketing | Building B |
The department fact is stored once. Updates happen in one place.
Boyce-Codd Normal Form (BCNF)¶
Boyce-Codd Normal Form is a stricter version of 3NF. A table is in BCNF if every determinant (a column or set of columns that other columns depend on) is a candidate key.
The difference from 3NF is subtle and only surfaces in tables with multiple overlapping candidate keys. Here is a classic example:
Problem - a table in 3NF but not BCNF:
Consider a university where each subject is taught by one professor, but a professor can teach only one subject. Students can enroll in multiple subjects.
| student_id | subject | professor |
|---|---|---|
| 1 | Math | Dr. Smith |
| 1 | Physics | Dr. Jones |
| 2 | Math | Dr. Smith |
The candidate keys are (student_id, subject) and (student_id, professor). But subject determines professor and professor determines subject - and neither is a candidate key on its own. If Dr. Smith switches from Math to Chemistry, you update multiple rows.
Fix - decompose so every determinant is a candidate key:
professor_subjects:
| professor | subject |
|---|---|
| Dr. Smith | Math |
| Dr. Jones | Physics |
student_enrollments:
| student_id | professor |
|---|---|
| 1 | Dr. Smith |
| 1 | Dr. Jones |
| 2 | Dr. Smith |
Now the professor -> subject dependency lives in a table where professor is the key.
Normalization is a tool, not a religion
In practice, most production schemas aim for 3NF. BCNF violations are rare and usually harmless. Going beyond 3NF (there are 4NF, 5NF, and even 6NF) brings diminishing returns and increasing complexity. Know the rules so you can make informed decisions about when to follow them and when to break them.
Denormalization¶
Fully normalized schemas store each fact once, which is excellent for writes and data integrity. But reads can suffer because answering a single question may require joining five or six tables. Denormalization is the deliberate introduction of redundancy to speed up reads.
When to Denormalize¶
Denormalization makes sense when:
- A query is run frequently and the joins are measurably slow
- The data changes rarely but is read constantly (high read-to-write ratio)
- Reporting queries need pre-aggregated data to avoid scanning millions of rows
- You have measured the problem with
EXPLAINand confirmed that joins are the bottleneck
Denormalization does not make sense when:
- You are guessing about performance before measuring
- Write consistency is critical and you cannot tolerate stale duplicated data
- The application is write-heavy
Common Denormalization Strategies¶
Pre-computed columns. Store a calculated value directly on the row instead of computing it at query time:
-- Instead of SUM(order_lines.quantity * order_lines.unit_price) every time:
ALTER TABLE orders ADD COLUMN total_amount DECIMAL(10,2);
You update total_amount whenever an order line changes. The trade-off is that your application code (or a trigger) must keep it in sync.
Duplicated columns. Copy a frequently accessed column from a related table:
-- Instead of joining to customers for every order display:
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(255);
If the customer's name changes, every order must be updated. This is acceptable if name changes are rare and order listing queries are constant.
Summary tables. Maintain a separate table with pre-aggregated data:
CREATE TABLE daily_sales_summary (
summary_date DATE PRIMARY KEY,
total_orders INT NOT NULL,
total_revenue DECIMAL(12,2) NOT NULL,
avg_order_value DECIMAL(10,2) NOT NULL
);
Populated by a nightly batch job or maintained by triggers. Dashboard queries hit this table instead of scanning the entire orders table.
Denormalization creates maintenance burden
Every piece of duplicated data is a consistency risk. When you denormalize, document which columns are derived, what the source of truth is, and how synchronization happens (trigger, application code, batch job). If you cannot answer those questions, do not denormalize.
Index Theory¶
Indexes are separate data structures that help the database find rows without scanning every page in the table. Without an index, a query against a million-row table reads every row. With the right index, it reads a handful of pages.
B-tree Indexes¶
The B-tree (balanced tree) is the default index structure in virtually every relational database. It keeps data sorted in a tree structure where:
- The root node points to intermediate nodes
- Intermediate nodes point to leaf nodes
- Leaf nodes contain the indexed values and pointers to the actual table rows
- All leaf nodes are at the same depth, so lookup time is consistent
B-tree indexes support:
- Exact match lookups:
WHERE email = 'alice@example.com' - Range scans:
WHERE price BETWEEN 10 AND 50 - Prefix matching:
WHERE name LIKE 'Ali%' - Sorting:
ORDER BY created_at DESC - Min/max:
SELECT MIN(price) FROM products
They do not efficiently support:
- Suffix or infix matching:
WHERE name LIKE '%ice' - Functions on the indexed column:
WHERE YEAR(created_at) = 2025(unless you create a functional index)
Composite indexes (multi-column) follow the leftmost prefix rule. An index on (country, city, zipcode) can satisfy queries on country, country + city, or country + city + zipcode, but not city alone or zipcode alone.
Hash Indexes¶
Hash indexes use a hash function to map values to buckets. They are fast for exact equality lookups (= and IN) but cannot help with range queries, sorting, or prefix matching.
In MySQL, the MEMORY storage engine supports explicit hash indexes. InnoDB uses an internal adaptive hash index that it builds automatically on top of B-tree indexes for frequently accessed pages - you do not create it manually.
In PostgreSQL, you can create hash indexes explicitly:
Hash indexes in PostgreSQL were not crash-safe until version 10. If you are on PostgreSQL 10+, they are a reasonable choice for columns that are only ever queried with =.
GIN and GiST Indexes (PostgreSQL)¶
PostgreSQL offers specialized index types for non-scalar data:
GIN (Generalized Inverted Index) is designed for values that contain multiple elements - arrays, JSONB documents, and full-text search vectors. A GIN index on a JSONB column lets you query for documents containing specific keys or values efficiently:
CREATE INDEX idx_metadata ON events USING GIN (metadata);
-- Now this query uses the index:
SELECT * FROM events WHERE metadata @> '{"type": "click"}';
GiST (Generalized Search Tree) supports geometric data, ranges, and full-text search. It is the index type for PostGIS spatial queries and range type operations:
CREATE INDEX idx_location ON stores USING GIST (location);
-- Spatial query using the index:
SELECT * FROM stores WHERE location <-> point '(40.7, -74.0)' < 0.1;
| Index Type | Best For | Range Queries | Equality | Multi-Value |
|---|---|---|---|---|
| B-tree | Most columns | Yes | Yes | No |
| Hash | Equality-only lookups | No | Yes | No |
| GIN | JSONB, arrays, full-text | No | Yes | Yes |
| GiST | Spatial, ranges, full-text | Yes | Yes | Yes |
Index only what you query
Every index speeds up reads but slows down writes because the database must update the index on every INSERT, UPDATE, and DELETE. Do not create indexes speculatively. Look at your actual query patterns, identify slow queries with EXPLAIN, and add indexes to address specific bottlenecks.
Data Type Selection¶
Choosing the right data type affects storage size, query performance, and the correctness of your application. The wrong type can waste disk space, prevent useful optimizations, or silently lose precision.
Integers¶
| Type | Storage | Range | Use When |
|---|---|---|---|
TINYINT / SMALLINT |
1-2 bytes | Up to 32,767 (signed) | Status codes, small enumerations |
INT |
4 bytes | Up to ~2.1 billion | Most primary keys, counters, foreign keys |
BIGINT |
8 bytes | Up to ~9.2 quintillion | Tables expected to exceed 2 billion rows, timestamps as epoch milliseconds |
Use INT for primary keys unless you have a concrete reason for BIGINT. A table with a BIGINT primary key and three BIGINT foreign keys uses 16 bytes more per row than the INT equivalent. Over 100 million rows, that is 1.6 GB of extra storage just for those four columns - and the indexes are larger too, which means more memory pressure and slower scans.
Use BIGINT when you know the table will grow past 2 billion rows, or when you are storing values like Unix timestamps in milliseconds or external IDs from systems that use 64-bit identifiers.
Strings¶
| Type | Behavior | Use When |
|---|---|---|
CHAR(N) |
Fixed-length, padded with spaces | Codes with known length (country codes, currency codes) |
VARCHAR(N) |
Variable-length, stores only what you need | Most string data (names, emails, addresses) |
TEXT |
Variable-length, no practical limit | Large content (descriptions, comments, articles) |
In PostgreSQL, there is no performance difference between VARCHAR(N) and TEXT. The length limit on VARCHAR(N) is purely a constraint - if you want to enforce a maximum length, use it. If you do not need a maximum, TEXT is simpler.
In MySQL, VARCHAR columns longer than a threshold (roughly 768 bytes for InnoDB) cannot be fully included in B-tree indexes. Keep indexed VARCHAR columns reasonably sized.
Numeric Precision¶
| Type | Behavior | Use When |
|---|---|---|
DECIMAL(P,S) / NUMERIC(P,S) |
Exact arithmetic | Money, financial calculations, anything where rounding errors are unacceptable |
FLOAT / DOUBLE |
Approximate (IEEE 754 floating point) | Scientific measurements, coordinates, values where small rounding errors are acceptable |
Never store money as FLOAT or DOUBLE. The value 0.1 cannot be represented exactly in binary floating point:
-- This might not return your row:
SELECT * FROM accounts WHERE balance = 0.10;
-- Use DECIMAL instead:
CREATE TABLE accounts (
account_id INT PRIMARY KEY,
balance DECIMAL(12,2) NOT NULL DEFAULT 0.00
);
Timestamps¶
| Type | Stores | Use When |
|---|---|---|
TIMESTAMP |
Date and time, usually UTC-aware | Event times, created_at, updated_at |
DATETIME (MySQL) |
Date and time, no timezone conversion | Scheduling (you want the literal date/time stored) |
TIMESTAMPTZ (PostgreSQL) |
Timestamp with timezone awareness | Almost always - PostgreSQL converts to UTC on storage and back to the session timezone on retrieval |
DATE |
Date only | Birthdays, due dates, report dates |
Always store timestamps in UTC
Store all timestamps in UTC and convert to local time in the application layer. Mixing timezones in the database leads to subtle bugs that are extraordinarily painful to debug. In PostgreSQL, use TIMESTAMPTZ. In MySQL, set your server to UTC and use TIMESTAMP.
Boolean and Enumeration Types¶
For columns with a small fixed set of values, you have options:
BOOLEAN(orTINYINT(1)in MySQL) for true/false flagsENUMtypes for small fixed sets - convenient but creates schema coupling (adding a new value requiresALTER TABLE)- A
VARCHARcolumn with aCHECKconstraint - more flexible thanENUMand easier to extend - A separate lookup table with a foreign key - the most normalized approach, best when the set of values may grow or has its own attributes
Constraints¶
Constraints are rules enforced by the database engine that prevent invalid data from entering your tables. They are your last line of defense against application bugs, bad imports, and manual mistakes.
PRIMARY KEY¶
Every table should have a primary key - a column or combination of columns that uniquely identifies each row. The database enforces uniqueness and creates an index automatically.
CREATE TABLE products (
product_id INT PRIMARY KEY,
name VARCHAR(255) NOT NULL
);
-- Composite primary key:
CREATE TABLE order_lines (
order_id INT,
product_id INT,
quantity INT NOT NULL,
PRIMARY KEY (order_id, product_id)
);
FOREIGN KEY¶
A foreign key constraint ensures that a value in one table references a valid row in another table. It prevents orphaned records and enforces referential integrity.
The ON DELETE and ON UPDATE clauses control what happens when the referenced row changes:
| Action | Behavior |
|---|---|
CASCADE |
Delete/update the child rows automatically |
SET NULL |
Set the foreign key column to NULL |
SET DEFAULT |
Set the foreign key column to its default value |
RESTRICT |
Block the delete/update if child rows exist (checked immediately) |
NO ACTION |
Same as RESTRICT in most databases (checked at end of statement in PostgreSQL) |
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
ON DELETE RESTRICT
ON UPDATE CASCADE
);
This means: if you try to delete a customer who has orders, the database blocks it (RESTRICT). If you change a customer's ID, the change propagates to all their orders (CASCADE).
UNIQUE, CHECK, NOT NULL, and DEFAULT¶
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
name VARCHAR(255) NOT NULL,
department_id INT REFERENCES departments(department_id),
salary DECIMAL(10,2) CHECK (salary > 0),
hire_date DATE NOT NULL DEFAULT CURRENT_DATE,
status VARCHAR(20) NOT NULL DEFAULT 'active'
CHECK (status IN ('active', 'inactive', 'terminated'))
);
NOT NULLprevents missing values - use it on every column that should always have a valueUNIQUEprevents duplicates - use it for natural identifiers like email, username, or SKUCHECKenforces arbitrary conditions - salaries must be positive, status must be one of a known setDEFAULTprovides a value when one is not specified during insert
Putting It All Together¶
Here is a complete schema demonstrating constraints, data types, and relationships. The following ER diagram shows the structure:
erDiagram
DEPARTMENT ||--o{ EMPLOYEE : employs
EMPLOYEE ||--o{ PROJECT_ASSIGNMENT : "assigned to"
PROJECT ||--o{ PROJECT_ASSIGNMENT : "staffed by"
DEPARTMENT {
int department_id PK
varchar name UK
varchar location
}
EMPLOYEE {
int employee_id PK
varchar email UK
varchar name
int department_id FK
decimal salary
date hire_date
}
PROJECT {
int project_id PK
varchar name
date start_date
date end_date
}
PROJECT_ASSIGNMENT {
int employee_id FK
int project_id FK
varchar role
date assigned_date
}
Design Decisions in Practice¶
Real schema design involves trade-offs. Here is a summary of the key decisions and when each option wins:
| Decision | Option A | Option B | Choose A When | Choose B When |
|---|---|---|---|---|
| Normalization level | 3NF (fully normalized) | Denormalized with redundancy | Write-heavy, data integrity critical | Read-heavy dashboards, rare updates |
| Primary key type | Natural key (email, SSN) | Surrogate key (auto-increment INT) | Key is truly immutable and unique | Natural keys can change or are large |
INT vs BIGINT |
INT (4 bytes) |
BIGINT (8 bytes) |
Table stays under 2B rows | Table will exceed 2B rows |
VARCHAR(N) vs TEXT |
VARCHAR(N) |
TEXT |
You want a length constraint | No meaningful max length |
DECIMAL vs FLOAT |
DECIMAL |
FLOAT/DOUBLE |
Money, exact arithmetic | Scientific data, acceptable rounding |
ENUM vs CHECK |
ENUM type |
VARCHAR + CHECK |
MySQL with stable small sets | PostgreSQL, or sets that may grow |
CASCADE vs RESTRICT |
ON DELETE CASCADE |
ON DELETE RESTRICT |
Child rows meaningless without parent | Child rows should block parent deletion |
Further Reading¶
- Database Normalization Basics - Open textbook chapter with detailed worked examples through 3NF
- Use The Index, Luke - The definitive free resource on SQL indexing and query performance, covering B-tree internals, composite indexes, and execution plans
- PostgreSQL Index Types - Official documentation on B-tree, Hash, GIN, GiST, SP-GiST, and BRIN index types
- MySQL Data Types - Official MySQL reference for storage requirements and behavior of each data type
- SQL Constraints on W3Schools - Quick reference for PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, NOT NULL, and DEFAULT
Previous: SQL Essentials | Next: MySQL Installation & Configuration | Back to Index