Skip to content

The Object Model

Everything you've learned so far - commits, branches, staging, merging - is built on top of a surprisingly simple storage system. Git is, at its core, a content-addressable filesystem: a key-value store where the key is a SHA-1 hash of the content and the value is the content itself. Understanding this layer explains why Git behaves the way it does and gives you the tools to inspect and repair repositories at the lowest level.


Content-Addressable Storage

The term content-addressable means that the address (name) of every piece of data is derived from the data itself. Git computes a SHA-1 hash of each object's content, and that hash becomes the object's identity. Two files with identical content produce the same hash and are stored once. Change a single byte and the hash - and therefore the identity - changes completely.

This has profound implications:

  • Deduplication is automatic. If the same file appears in 1,000 commits, Git stores one copy.
  • Integrity is guaranteed. If any bit of a stored object changes (disk corruption, tampering), the hash no longer matches and Git detects it immediately.
  • History is tamper-evident. Since each commit's hash includes its parent hash, changing any commit changes every subsequent hash in the chain. You can't alter history silently.

SHA-1 and SHA-256

Git has historically used SHA-1 (160-bit, 40 hex characters). While SHA-1 has known collision vulnerabilities in theory, Git includes additional hardening against known attacks. Git is transitioning to SHA-256 (256-bit, 64 hex characters) with a compatibility layer. New repository formats can opt into SHA-256, but most repositories still use SHA-1.


The Four Object Types

Every object in Git's database is one of four types:

flowchart TD
    C["commit<br/>snapshot + metadata"] --> T["tree<br/>directory listing"]
    T --> B1["blob<br/>file content"]
    T --> B2["blob<br/>file content"]
    T --> T2["tree<br/>subdirectory"]
    T2 --> B3["blob<br/>file content"]
    TAG["tag<br/>named reference + message"] --> C

Blob (Binary Large Object)

A blob stores the content of a single file - nothing else. No filename, no permissions, no metadata. Just the raw bytes. Two files with identical content, regardless of their names or locations, produce the same blob.

# Hash a string as a blob
echo "Hello, Git" | git hash-object --stdin
# 41e40e5a20c7e8657a8a92e2ce0bfa39a9e0d40c

# Hash a file
git hash-object README.md

Tree

A tree represents a directory. It contains entries, each pointing to a blob (file) or another tree (subdirectory), along with the file's name and permission mode:

100644 blob a1b2c3d4...  README.md
100644 blob e5f6a7b8...  app.py
040000 tree c9d0e1f2...  src

Permission modes:

Mode Meaning
100644 Regular file
100755 Executable file
120000 Symbolic link
040000 Subdirectory (tree)

Commit

A commit ties everything together. It references:

  • A tree (the root directory snapshot)
  • Zero or more parent commits
  • Author and committer identity with timestamps
  • A message

The first commit in a repository has no parent. A merge commit has two or more parents. Every other commit has exactly one parent.

Annotated Tag

An annotated tag is a named reference to a commit (or any object) with additional metadata: a tagger identity, timestamp, and message. Unlike lightweight tags (which are just refs), annotated tags are full objects stored in the database.

# Create an annotated tag
git tag -a v1.0 -m "First stable release"

# Show the tag object
git cat-file -p v1.0

The .git/objects Directory

All objects are stored in .git/objects/. Git uses the first two characters of the hash as a directory name and the remaining 38 as the filename:

.git/objects/
├── 41/
│   └── e40e5a20c7e8657a8a92e2ce0bfa39a9e0d40c  (a blob)
├── 8f/
│   └── a3c9b1d2e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8  (a tree)
├── e4/
│   └── f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3  (a commit)
├── info/
└── pack/
    ├── pack-abc123.idx
    └── pack-abc123.pack

Individual objects are called loose objects. As a repository grows, Git periodically packs loose objects into packfiles (.pack with an .idx index) for efficiency. Packfiles use delta compression - storing only the differences between similar objects. The Refs, the Reflog, and the DAG guide covers packfiles in depth.

Each loose object is stored as: type size\0content, compressed with zlib.


Plumbing Commands

Git has two categories of commands: porcelain (user-facing: commit, merge, push) and plumbing (low-level: hash-object, cat-file, write-tree). Plumbing commands let you interact directly with the object database.

git hash-object - Store Content

# Hash content from stdin (just compute the hash, don't store)
echo "Hello" | git hash-object --stdin
# ce013625030ba8dba906f756967f9e9ca394464a

# Hash and store into the object database
echo "Hello" | git hash-object --stdin -w

# Hash a file
git hash-object README.md

git cat-file - Read Objects

# Show object type
git cat-file -t a1b2c3d
# blob, tree, commit, or tag

# Show object size
git cat-file -s a1b2c3d
# 42

# Pretty-print object content
git cat-file -p a1b2c3d

git ls-tree - List Tree Contents

# List the root tree of HEAD
git ls-tree HEAD

# List recursively (all files in all subdirectories)
git ls-tree -r HEAD

# List a specific directory
git ls-tree HEAD src/

git write-tree - Create a Tree from the Index

# Write the current index as a tree object
git write-tree
# Returns the hash of the new tree

git commit-tree - Create a Commit Object

# Create a commit from a tree, with a parent and message
echo "My commit message" | git commit-tree <tree-hash> -p <parent-hash>
# Returns the hash of the new commit

Building a Commit with Plumbing Commands

This is the most illuminating exercise in the entire course. Instead of using git add and git commit, you'll create a commit entirely with low-level plumbing commands - the same operations Git performs internally.


Tracing the Object Graph

Every commit points to a tree, every tree points to blobs and subtrees, and everything is connected by SHA-1 hashes. You can trace the entire object graph starting from any commit:


Object Graph Visualization

The relationships between objects form a directed acyclic graph (DAG). Here's what a small repository's object graph looks like:

flowchart TD
    C2["commit: e4f5a6b<br/>Add utils module"] --> C1["commit: a1b2c3d<br/>Initial commit"]
    C2 --> T2["tree: 8fa3c9b<br/>(root)"]
    C1 --> T1["tree: c3b8bb1<br/>(root)"]

    T1 --> B1["blob: d670460<br/>README.md content"]
    T1 --> B2["blob: f1e2d3c<br/>app.py content v1"]

    T2 --> B1
    T2 --> B3["blob: a9b8c7d<br/>app.py content v2"]
    T2 --> ST["tree: 7e8f9a0<br/>src/"]
    ST --> B4["blob: 5d6e7f8<br/>utils.py content"]

Notice that B1 (README.md) is referenced by both trees - the file didn't change between commits, so Git reuses the same blob. This is content-addressable deduplication in action.


Further Reading


Previous: Configuring Git | Next: Refs, the Reflog, and the DAG | Back to Index

Comments