Monorepos and Scaling Git¶

Git was designed for the Linux kernel - a large project, but one with a relatively straightforward directory structure. When organizations put hundreds of projects, millions of files, and decades of history into a single repository, Git's default behavior starts to struggle. Clone times balloon, git status takes seconds, and CI builds trigger unnecessarily. This guide covers the tools and strategies for making Git perform at scale.

When Monorepos Make Sense¶

A monorepo stores multiple projects, services, or packages in a single Git repository. Major organizations use them:

Google - billions of lines of code in a single repository (custom VCS, not Git)
Meta - millions of files, custom Mercurial extensions (migrating to a custom Git-like system)
Microsoft - Windows codebase moved to Git using VFS for Git and later Scalar

Advantage	Trade-off
Atomic cross-project changes	Clone and checkout take longer
Shared code without versioning overhead	`git status` is slower with many files
Unified CI/CD and tooling	CI must scope to affected code
Single source of truth	Access control is repository-wide
Easier refactoring across boundaries	Git wasn't designed for millions of files

The challenges are real but solvable with the right configuration.

Sparse Checkout¶

Sparse checkout tells Git to only materialize a subset of files in your working directory. The full history is available, but you only see the directories you need.

Setting Up Sparse Checkout¶

# Clone with sparse checkout enabled
git clone --sparse https://github.com/org/monorepo.git
cd monorepo

# Check out specific directories
git sparse-checkout set services/auth services/api shared/utils

# Add more directories later
git sparse-checkout add services/web

# List current sparse checkout patterns
git sparse-checkout list

# Disable sparse checkout (check out everything)
git sparse-checkout disable

Cone Mode vs Non-Cone Mode¶

Sparse checkout has two modes:

Cone mode (default, recommended) - specifies directories. Fast because Git can skip entire subtrees without pattern matching.
Non-cone mode - specifies file glob patterns. More flexible but slower.

# Cone mode (default) - specify directories
git sparse-checkout set services/auth tests/auth

# Non-cone mode - specify patterns
git sparse-checkout set --no-cone '!/*' '/README.md' '/services/auth/**'

Stick with cone mode unless you need file-level granularity.

Partial Clone¶

Partial clone (Git 2.22+) lets you clone a repository without downloading all objects. Git fetches missing objects on demand when you need them.

# Blobless clone - skip file content, download on checkout
git clone --filter=blob:none https://github.com/org/monorepo.git

# Treeless clone - skip trees too (minimal download)
git clone --filter=tree:0 https://github.com/org/monorepo.git

# Size-limited - skip blobs larger than a threshold
git clone --filter=blob:limit=1m https://github.com/org/monorepo.git

Combining Partial Clone and Sparse Checkout¶

The power combo for monorepos:

# Clone with partial objects and sparse checkout
git clone --filter=blob:none --sparse https://github.com/org/monorepo.git
cd monorepo

# Check out only what you need
git sparse-checkout set services/my-service tests/my-service

# Git fetches blobs only for files in your sparse checkout

This gives you:

Full commit history (for git log, git blame)
Minimal disk usage (only blobs for your directories)
Fast initial clone

The Commit Graph File¶

The commit graph file (.git/objects/info/commit-graph) is a pre-computed index of the commit DAG. It stores commit hashes, parent hashes, root tree hashes, commit dates, and generation numbers in a compact binary format.

Without the commit graph file, Git must decompress and parse individual commit objects to traverse history. With it, operations like git log --graph, git merge-base, and reachability queries are significantly faster.

# Generate/update the commit graph
git commit-graph write

# Verify the commit graph
git commit-graph verify

# It's also maintained by git maintenance
git maintenance run --task=commit-graph

`git maintenance` - Background Optimization¶

Git 2.29+ includes git maintenance for scheduling automatic optimization:

# Register this repo for background maintenance
git maintenance register

# Run all tasks now
git maintenance run

# Start the scheduler (uses system scheduler: launchd/cron/systemd)
git maintenance start

# Stop the scheduler
git maintenance stop

Maintenance Tasks¶

Task	Frequency	What it does
`commit-graph`	Hourly	Updates the commit graph file
`prefetch`	Hourly	Background fetch from remotes (no merge)
`loose-objects`	Daily	Packs loose objects
`incremental-repack`	Daily	Consolidates packfiles
`gc`	Weekly	Full garbage collection

Scalar¶

Scalar is Microsoft's tool for optimizing large Git repositories. Since Git 2.38, a subset of Scalar is built into Git itself:

# Initialize a repo with Scalar optimizations
scalar clone https://github.com/org/large-repo.git

# Register an existing repo for Scalar management
scalar register

# Scalar automatically configures:
# - Sparse checkout
# - Partial clone (blob:none)
# - Commit graph
# - Multi-pack index
# - File system monitor (fsmonitor)
# - Background maintenance

Scalar is essentially a convenience wrapper that enables all the individual optimizations covered in this guide in one command.

File System Monitor¶

git status needs to check every tracked file for changes. On large repositories, this filesystem scan is the bottleneck. The file system monitor (fsmonitor) uses OS-level file change notifications to skip files that haven't changed.

# Enable the built-in fsmonitor daemon (Git 2.37+)
git config core.fsmonitor true

# Or use Watchman (Facebook's file watcher)
git config core.fsmonitor "$(which watchman)"

The built-in fsmonitor--daemon (Git 2.37+) watches for filesystem events and tells Git which files have changed since the last query. This can make git status near-instantaneous on repos with hundreds of thousands of files.

Submodules vs Subtrees¶

For projects that need to include code from other repositories, Git offers two mechanisms:

Submodules¶

A submodule is a pointer to a specific commit in another repository:

# Add a submodule
git submodule add https://github.com/lib/awesome-lib.git vendor/awesome-lib

# Clone a repo with submodules
git clone --recurse-submodules https://github.com/org/project.git

# Update submodules to their tracked commits
git submodule update --init --recursive

# Update submodules to the latest remote commit
git submodule update --remote

Submodules are simple in concept but have notorious UX issues: detached HEAD state inside the submodule, forgetting to init/update after clone, and confusing merge conflicts.

Subtrees¶

A subtree merges another repository's content directly into a subdirectory:

# Add a subtree
git subtree add --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main --squash

# Pull updates
git subtree pull --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main --squash

# Push changes back upstream
git subtree push --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main

Subtrees are simpler for consumers (no special init commands), but the history can be messy and push-back workflows are less intuitive.

	Submodules	Subtrees
Storage	Pointer to external commit	Content merged into repo
Clone experience	Requires `--recurse-submodules`	Works with normal clone
Updating	`submodule update --remote`	`subtree pull`
History	Separate (external repo)	Mixed into your repo
Push changes back	cd into submodule, commit, push	`subtree push`

Build System Integration¶

Large monorepos need build systems that understand which projects are affected by a change:

Bazel - Google's build system, tracks dependencies explicitly
Nx - smart monorepo build system for JavaScript/TypeScript
Turborepo - incremental build system for JS monorepos
Pants - scalable build system for Python, Go, Java

These tools use the Git commit graph to determine which packages changed and only build/test those, making CI feasible for large monorepos.

Monorepos and Scaling Git¶

When Monorepos Make Sense¶

Sparse Checkout¶

Setting Up Sparse Checkout¶

Cone Mode vs Non-Cone Mode¶

Partial Clone¶

Combining Partial Clone and Sparse Checkout¶

The Commit Graph File¶

`git maintenance` - Background Optimization¶

Maintenance Tasks¶

Scalar¶

File System Monitor¶

Submodules vs Subtrees¶

Submodules¶

Subtrees¶

Build System Integration¶

Exercise¶

Further Reading¶

Comments

Monorepos and Scaling Git¶

When Monorepos Make Sense¶

Sparse Checkout¶

Setting Up Sparse Checkout¶

Cone Mode vs Non-Cone Mode¶

Partial Clone¶

Filter Options¶

Combining Partial Clone and Sparse Checkout¶

The Commit Graph File¶

git maintenance - Background Optimization¶

Maintenance Tasks¶

Scalar¶

File System Monitor¶

Submodules vs Subtrees¶

Submodules¶

Subtrees¶

Build System Integration¶

Exercise¶

Further Reading¶

Comments

`git maintenance` - Background Optimization¶