PostgreSQL Fundamentals¶

Intermediate35 minPrereqs: sql-essentialsdatabases postgresql

Learning outcomes

Install PostgreSQL and configure postgresql.conf and pg_hba.conf
Use psql meta-commands and understand the schema/database hierarchy
Explain MVCC snapshot isolation and PostgreSQL's object-relational design

PostgreSQL began in 1986 as the POSTGRES project at UC Berkeley, led by Michael Stonebraker. The original goal was to push beyond the limitations of existing relational databases by adding support for complex data types, user-defined functions, and extensible type systems. In 1996, the project adopted SQL as its query language and was renamed PostgreSQL to reflect that change. The project has been community-driven ever since, with no single corporate owner.

Three design principles distinguish PostgreSQL from other relational databases:

Object-relational design - tables can inherit from other tables, custom types can be defined with operators and functions, and the type system is extensible at runtime
Standards compliance - PostgreSQL implements more of the SQL standard than any other open-source database, including window functions, CTEs, lateral joins, and MERGE (added in version 15)
Extensibility - you can add custom data types, operators, index methods, procedural languages, and foreign data wrappers without modifying core source code

PostgreSQL uses a multi-version concurrency control (MVCC) model that allows readers and writers to operate without blocking each other. Every transaction sees a consistent snapshot of the database, and old row versions are cleaned up later by the VACUUM process. This architecture trades some write overhead and storage for excellent concurrent read performance.

PostgreSQL MVCC snapshot isolation showing how concurrent transactions see different row versions based on transaction IDs

Installation¶

Package Managers¶

On Debian and Ubuntu, the official PostgreSQL repository provides the latest versions. Add the repository and install:

# Add the PostgreSQL APT repository
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg
sudo apt update
sudo apt install postgresql-16

On RHEL, Fedora, and Rocky Linux:

sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-9-x86_64/pgdg-redhat-repo-latest.noarch.rpm
sudo dnf install -y postgresql16-server postgresql16

On macOS with Homebrew:

brew install postgresql@16
brew services start postgresql@16

Cluster Initialization with initdb¶

On Linux, after installing the packages, you need to initialize a database cluster - the directory structure where PostgreSQL stores all its data. The initdb command creates this structure:

# RHEL/Fedora (the Debian package runs initdb automatically)
sudo /usr/pgsql-16/bin/postgresql-16-setup initdb

Under the hood, initdb creates:

The base/ directory for database files
The global/ directory for cluster-wide tables
Default configuration files (postgresql.conf, pg_hba.conf, pg_ident.conf)
The template0 and template1 template databases used as blueprints when you create new databases
The initial postgres database and postgres superuser role

Data directory location

The default data directory varies by platform: /var/lib/pgsql/16/data on RHEL, /var/lib/postgresql/16/main on Debian/Ubuntu. Homebrew uses /opt/homebrew/var/postgresql@16 on Apple Silicon. You can always find it with SHOW data_directory; from within psql.

Starting and Stopping¶

Use systemctl on Linux systems with systemd:

sudo systemctl start postgresql-16
sudo systemctl enable postgresql-16
sudo systemctl status postgresql-16

For manual control, pg_ctl operates directly on the data directory:

pg_ctl -D /var/lib/pgsql/16/data start
pg_ctl -D /var/lib/pgsql/16/data stop -m fast
pg_ctl -D /var/lib/pgsql/16/data restart
pg_ctl -D /var/lib/pgsql/16/data status

The -m fast flag sends SIGTERM to active connections and shuts down without waiting for clients to disconnect. The alternative -m smart (default) waits for all connections to close on their own, and -m immediate simulates a crash recovery scenario.

Invalid interactive component configuration (terminal)

The psql CLI¶

psql is the interactive terminal for PostgreSQL. It supports SQL execution, meta-commands (backslash commands that query system catalogs), tab completion, command history, and scripting.

Connecting¶

# Connect as the postgres superuser to the postgres database
sudo -u postgres psql

# Connect to a specific database as a specific user
psql -h localhost -p 5432 -U myuser -d mydb

# Connection string format
psql "host=db.example.com port=5432 dbname=mydb user=myuser sslmode=require"

# Switch databases within psql
\c mydb

The \c (or \connect) meta-command switches to a different database without leaving the psql session. You can also change the user at the same time with \c dbname username.

Essential Meta-Commands¶

Meta-commands start with a backslash and are processed by psql itself, not sent to the server. These are the ones you will use daily:

Command	Description
`\l`	List all databases
`\dt`	List tables in the current schema
`\dt+`	List tables with size and description
`\d tablename`	Describe a table (columns, types, indexes, constraints)
`\di`	List indexes
`\du`	List roles (users and groups)
`\df`	List functions
`\dn`	List schemas
`\x`	Toggle expanded display (vertical output for wide rows)
`\timing`	Toggle query execution time display
`\e`	Open the last query in your `$EDITOR`
`\i filename`	Execute commands from a file
`\o filename`	Send query output to a file
`\!`	Execute a shell command

Pattern filtering

Most meta-commands accept a pattern argument. \dt public.* lists tables in the public schema only. \df *json* lists functions with "json" in the name. The pattern follows SQL LIKE syntax.

COPY vs \copy¶

COPY is a server-side SQL command that reads and writes files on the server's filesystem, running as the postgres OS user:

COPY customers TO '/tmp/customers.csv' WITH (FORMAT csv, HEADER);
COPY customers FROM '/tmp/import.csv' WITH (FORMAT csv, HEADER);

\copy is a psql meta-command that transfers data between the server and the local client machine:

\copy customers TO 'customers.csv' WITH (FORMAT csv, HEADER)
\copy customers FROM 'import.csv' WITH (FORMAT csv, HEADER)

Use \copy when you are connecting to a remote server and need files on your local machine, or when the postgres user does not have filesystem access to the target path. Use COPY for server-local bulk operations - it is faster because data does not travel over the client connection.

Customizing with .psqlrc¶

Create a ~/.psqlrc file to customize your psql sessions:

-- Quiet mode during startup
\set QUIET 1

-- Better NULL display
\pset null '(null)'

-- Show query timing
\timing

-- Expanded auto mode (vertical display when rows are wide)
\x auto

-- History per database
\set HISTFILE ~/.psql_history-:DBNAME

-- Verbose error messages
\set VERBOSITY verbose

-- Prompt: user@host:port/dbname
\set PROMPT1 '%n@%M:%>/%/ %# '

-- Restore normal output
\set QUIET 0

Invalid interactive component configuration (terminal)

Configuring postgresql.conf¶

The main server configuration file is postgresql.conf, located in the data directory. Changes to most parameters require a server reload (SELECT pg_reload_conf(); or pg_ctl reload), though some require a full restart.

Connection Settings¶

listen_addresses = 'localhost'    # Which interfaces to listen on ('*' for all)
port = 5432                       # TCP port
max_connections = 100             # Maximum concurrent connections

listen_addresses controls which network interfaces PostgreSQL binds to. The default localhost only accepts local connections. Set it to '*' to listen on all interfaces, but always pair this with proper pg_hba.conf rules.

max_connections has a direct impact on memory usage because each connection consumes work_mem and other per-session resources. Connection poolers like PgBouncer allow hundreds of application connections to share a smaller pool of database connections.

Memory Settings¶

shared_buffers = '256MB'          # Shared memory for caching data pages
work_mem = '4MB'                  # Per-operation memory for sorts and hashes
maintenance_work_mem = '64MB'     # Memory for VACUUM, CREATE INDEX, etc.
effective_cache_size = '1GB'      # Planner's estimate of available OS cache

shared_buffers is the most important memory parameter. It controls how much memory PostgreSQL allocates for caching table and index data in shared memory. Start with 25% of total RAM and adjust based on your workload. Setting it too high wastes memory that the OS file cache could use more effectively.

work_mem is allocated per sort or hash operation within a query - a complex query with multiple sorts can allocate work_mem multiple times. Keep this conservative (4-16 MB) for OLTP workloads with many connections. Increase it for analytical queries that need large in-memory sorts.

maintenance_work_mem affects operations like VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. Setting this higher (256 MB - 1 GB) speeds up these maintenance operations. It only matters during maintenance, so higher values are safe.

effective_cache_size does not allocate any memory. It tells the query planner how much memory is available for disk caching (shared_buffers plus OS file cache). Set it to approximately 75% of total RAM on a dedicated database server. A higher value makes the planner more likely to choose index scans over sequential scans.

WAL Settings¶

wal_level = 'replica'             # Logging level: minimal, replica, or logical

Write-Ahead Logging (WAL) ensures crash recovery by writing changes to a log before applying them to data files. The wal_level parameter controls how much information is recorded:

minimal - enough for crash recovery only; no replication support
replica - adds information needed for streaming replication and point-in-time recovery (the default since PostgreSQL 10)
logical - adds information needed for logical decoding and logical replication

WAL level changes require restart

Changing wal_level requires a full server restart. Set it to replica from the start if you might ever need replication or point-in-time recovery. Upgrading from minimal later means a restart window.

Transaction Isolation Levels¶

PostgreSQL's MVCC engine supports four transaction isolation levels, defined by the SQL standard. Each level determines which concurrency anomalies a transaction can observe. Higher isolation provides stronger consistency guarantees but may increase contention between concurrent transactions.

The three anomalies are:

Dirty read - a transaction reads data written by another transaction that has not yet committed. If that transaction rolls back, the reader saw data that never existed.
Non-repeatable read - a transaction reads the same row twice and gets different values because another transaction modified and committed the row between the two reads.
Phantom read - a transaction re-executes a query with a range condition and gets a different set of rows because another transaction inserted or deleted rows that match the condition.

flowchart TD
    RU["Read Uncommitted<br/>(treated as Read Committed<br/>in PostgreSQL)"] --> DR["Dirty Reads"]
    RU --> NR["Non-repeatable Reads"]
    RU --> PH["Phantom Reads"]

    RC["Read Committed<br/>(default)"] --> NR
    RC --> PH

    RR["Repeatable Read"] --> PH

    SR["Serializable"] --> NONE["No Anomalies"]

PostgreSQL does not actually implement Read Uncommitted - if you request it, you get Read Committed behavior instead. The default isolation level is Read Committed, which prevents dirty reads but allows non-repeatable reads and phantom reads. Most applications never need to change this.

Set the isolation level per transaction:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- All reads within this transaction see a consistent snapshot
-- Conflicting concurrent transactions will be rolled back
COMMIT;

Or set the default for all transactions on a session or role:

SET default_transaction_isolation = 'repeatable read';
ALTER ROLE app_user SET default_transaction_isolation = 'serializable';

Serializable isolation in PostgreSQL uses Serializable Snapshot Isolation (SSI), which detects dependency cycles between concurrent transactions and aborts one of them. Unlike traditional locking-based serializable implementations, SSI does not block reads - it detects conflicts at commit time. Applications using Serializable must be prepared to retry transactions that fail with a serialization error (SQLSTATE 40001).

Configuring pg_hba.conf¶

The pg_hba.conf file (Host-Based Authentication) controls who can connect to the database, from where, and how they must authenticate. PostgreSQL evaluates rules top to bottom and uses the first matching rule. If no rule matches, the connection is rejected.

Record Format¶

Each line in pg_hba.conf has the format:

TYPE  DATABASE  USER  ADDRESS  METHOD

Field	Values
TYPE	`local` (Unix socket), `host` (TCP/IP with or without SSL), `hostssl` (SSL only), `hostnossl` (non-SSL only)
DATABASE	`all`, a database name, a comma-separated list, or `sameuser`
USER	`all`, a role name, a comma-separated list, or `+groupname` for group membership
ADDRESS	IP address with CIDR mask (for `host` types), omitted for `local`
METHOD	`trust`, `peer`, `md5`, `scram-sha-256`, `cert`, `reject`, and others

Authentication Methods¶

trust - allows the connection unconditionally without a password. Only appropriate for local development on single-user machines:

local   all   all   trust

peer - uses the operating system username to authenticate. The connecting OS user must match the PostgreSQL role name. This is the default for local Unix socket connections:

local   all   all   peer

md5 - requires an MD5-hashed password. Supported by all clients but less secure than scram-sha-256:

host    all   all   192.168.1.0/24   md5

scram-sha-256 - the strongest password-based authentication method. Uses the SCRAM-SHA-256 challenge-response protocol. Recommended for all password-authenticated connections:

host    all   all   10.0.0.0/8   scram-sha-256

cert - authenticates using client SSL certificates. The client must present a valid certificate signed by a trusted CA:

hostssl   all   all   0.0.0.0/0   cert

reject - unconditionally rejects the connection. Useful as a catch-all at the end of the file or to block specific networks:

host    all   all   0.0.0.0/0   reject

Order matters

PostgreSQL evaluates pg_hba.conf rules top to bottom and uses the first match. A trust rule above a scram-sha-256 rule will override the stronger method. Always place more restrictive rules before less restrictive ones, and end with a reject rule for defense in depth.

After editing pg_hba.conf, reload the configuration:

sudo -u postgres psql -c "SELECT pg_reload_conf();"
# Or
pg_ctl -D /var/lib/pgsql/16/data reload

Schemas vs Databases¶

PostgreSQL organizes objects in a three-level hierarchy: cluster > database > schema > objects (tables, views, functions, indexes). Understanding this hierarchy is essential for organizing your data and managing access.

The Catalog Hierarchy¶

A cluster is the entire PostgreSQL instance managed by one postmaster process. It contains multiple databases, but cross-database queries are not possible in a single SQL statement (unlike MySQL, where you can SELECT across databases freely).

A database is an isolated container of schemas and objects. Each connection targets exactly one database. You create databases with:

CREATE DATABASE analytics OWNER app_user;

A schema is a namespace within a database. It allows you to group related objects and control access at the namespace level. Multiple schemas can contain tables with the same name without conflict:

CREATE SCHEMA inventory;
CREATE TABLE inventory.products (id serial PRIMARY KEY, name text);

CREATE SCHEMA reporting;
CREATE TABLE reporting.products (id serial PRIMARY KEY, name text, summary text);

The public Schema¶

Every new database contains a public schema. When you create objects without specifying a schema, they go into public. When you query without specifying a schema, PostgreSQL searches according to the search_path:

-- These are equivalent when search_path includes 'public'
SELECT * FROM users;
SELECT * FROM public.users;

-- Check the current search path
SHOW search_path;
-- "$user", public

-- Modify the search path for the session
SET search_path TO myapp, public;

The default search_path is "$user", public. PostgreSQL first looks for a schema matching the current role name, then falls back to public. Setting a custom search_path at the role or database level organizes multi-tenant or multi-module applications:

-- Set a default search_path for a role
ALTER ROLE app_user SET search_path TO myapp, public;

-- Set a default search_path for a database
ALTER DATABASE mydb SET search_path TO myapp, public;

When to Use Schemas vs Databases¶

Use Case	Choose
Fully isolated applications with different users and permissions	Separate databases
Logical grouping within one application (e.g., `auth`, `billing`, `inventory`)	Schemas within one database
Multi-tenant applications where tenants share the same table structure	One schema per tenant
Extensions that need to be shared (e.g., `pg_stat_statements`)	Schemas within one database (extensions are per-database)

System Catalogs¶

PostgreSQL stores all metadata - table definitions, column types, indexes, functions, permissions, and statistics - in system catalogs. These are regular tables that you can query with SQL.

The pg_catalog Schema¶

Every database contains a pg_catalog schema with the catalog tables. The most commonly queried catalogs:

Catalog	Contents
`pg_class`	All relations (tables, indexes, sequences, views)
`pg_attribute`	All columns of all relations
`pg_index`	Index definitions and properties
`pg_namespace`	Schemas
`pg_roles`	Roles (users and groups)
`pg_database`	Databases in the cluster
`pg_proc`	Functions and procedures
`pg_type`	Data types
`pg_settings`	Server configuration parameters (queryable view of `postgresql.conf`)

You can query these directly:

-- Find all tables in the public schema
SELECT relname, relkind, reltuples::bigint AS row_estimate
FROM pg_class
WHERE relnamespace = 'public'::regnamespace
  AND relkind = 'r'
ORDER BY reltuples DESC;

-- Find all indexes on a table
SELECT indexrelid::regclass AS index_name,
       indisunique AS is_unique,
       indisprimary AS is_primary
FROM pg_index
WHERE indrelid = 'users'::regclass;

Statistical Views¶

PostgreSQL collects runtime statistics through the statistics collector. The pg_stat_* views expose this data:

pg_stat_activity - shows every current connection, its query, and its state:

SELECT pid, usename, datname, state, query, query_start
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

This is your first stop when diagnosing slow queries or connection issues. The state column shows active (running a query), idle (waiting for a command), idle in transaction (inside an open transaction but not executing), and idle in transaction (aborted).

Idle in transaction

Connections in idle in transaction state hold locks and prevent VACUUM from cleaning up dead rows. If you see long-running idle-in-transaction sessions, investigate the application code. Set idle_in_transaction_session_timeout to automatically terminate sessions that sit idle inside a transaction too long.

pg_stat_user_tables - per-table statistics including sequential scan counts, index scan counts, and dead tuple counts:

SELECT schemaname, relname,
       seq_scan, seq_tup_read,
       idx_scan, idx_tup_fetch,
       n_tup_ins, n_tup_upd, n_tup_del,
       n_dead_tup, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Tables with high seq_scan counts and low idx_scan counts may need better indexes. Tables with high n_dead_tup counts may need more aggressive autovacuum settings.

The information_schema¶

PostgreSQL also provides the SQL-standard information_schema - a set of views that present catalog data in a standardized, portable format:

-- List all columns in a table (portable across databases)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_schema = 'public'
  AND table_name = 'users'
ORDER BY ordinal_position;

-- List all tables
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_schema NOT IN ('pg_catalog', 'information_schema')
ORDER BY table_schema, table_name;

The information_schema is portable across SQL databases, but it is slower than querying pg_catalog directly because it uses complex views. For PostgreSQL-specific metadata (like tablespace, storage parameters, or toast tables), you need the pg_catalog tables.

PostgreSQL Fundamentals¶

Installation¶

Package Managers¶

Cluster Initialization with initdb¶

Starting and Stopping¶

The psql CLI¶

Connecting¶

Essential Meta-Commands¶

COPY vs \copy¶

Customizing with .psqlrc¶

Configuring postgresql.conf¶

Connection Settings¶

Memory Settings¶

WAL Settings¶

Transaction Isolation Levels¶

Configuring pg_hba.conf¶

Record Format¶

Authentication Methods¶

Schemas vs Databases¶

The Catalog Hierarchy¶

The public Schema¶

When to Use Schemas vs Databases¶

System Catalogs¶

The pg_catalog Schema¶

Statistical Views¶

The information_schema¶

Further Reading¶

Comments