Monorepos and Scaling Git¶
Git was designed for the Linux kernel - a large project, but one with a relatively straightforward directory structure. When organizations put hundreds of projects, millions of files, and decades of history into a single repository, Git's default behavior starts to struggle. Clone times balloon, git status takes seconds, and CI builds trigger unnecessarily. This guide covers the tools and strategies for making Git perform at scale.
When Monorepos Make Sense¶
A monorepo stores multiple projects, services, or packages in a single Git repository. Major organizations use them:
- Google - billions of lines of code in a single repository (custom VCS, not Git)
- Meta - millions of files, custom Mercurial extensions (migrating to a custom Git-like system)
- Microsoft - Windows codebase moved to Git using VFS for Git and later Scalar
| Advantage | Trade-off |
|---|---|
| Atomic cross-project changes | Clone and checkout take longer |
| Shared code without versioning overhead | git status is slower with many files |
| Unified CI/CD and tooling | CI must scope to affected code |
| Single source of truth | Access control is repository-wide |
| Easier refactoring across boundaries | Git wasn't designed for millions of files |
The challenges are real but solvable with the right configuration.
Sparse Checkout¶
Sparse checkout tells Git to only materialize a subset of files in your working directory. The full history is available, but you only see the directories you need.
Setting Up Sparse Checkout¶
# Clone with sparse checkout enabled
git clone --sparse https://github.com/org/monorepo.git
cd monorepo
# Check out specific directories
git sparse-checkout set services/auth services/api shared/utils
# Add more directories later
git sparse-checkout add services/web
# List current sparse checkout patterns
git sparse-checkout list
# Disable sparse checkout (check out everything)
git sparse-checkout disable
Cone Mode vs Non-Cone Mode¶
Sparse checkout has two modes:
- Cone mode (default, recommended) - specifies directories. Fast because Git can skip entire subtrees without pattern matching.
- Non-cone mode - specifies file glob patterns. More flexible but slower.
# Cone mode (default) - specify directories
git sparse-checkout set services/auth tests/auth
# Non-cone mode - specify patterns
git sparse-checkout set --no-cone '!/*' '/README.md' '/services/auth/**'
Stick with cone mode unless you need file-level granularity.
Partial Clone¶
Partial clone (Git 2.22+) lets you clone a repository without downloading all objects. Git fetches missing objects on demand when you need them.
Filter Options¶
# Blobless clone - skip file content, download on checkout
git clone --filter=blob:none https://github.com/org/monorepo.git
# Treeless clone - skip trees too (minimal download)
git clone --filter=tree:0 https://github.com/org/monorepo.git
# Size-limited - skip blobs larger than a threshold
git clone --filter=blob:limit=1m https://github.com/org/monorepo.git
Combining Partial Clone and Sparse Checkout¶
The power combo for monorepos:
# Clone with partial objects and sparse checkout
git clone --filter=blob:none --sparse https://github.com/org/monorepo.git
cd monorepo
# Check out only what you need
git sparse-checkout set services/my-service tests/my-service
# Git fetches blobs only for files in your sparse checkout
This gives you:
- Full commit history (for
git log,git blame) - Minimal disk usage (only blobs for your directories)
- Fast initial clone
The Commit Graph File¶
The commit graph file (.git/objects/info/commit-graph) is a pre-computed index of the commit DAG. It stores commit hashes, parent hashes, root tree hashes, commit dates, and generation numbers in a compact binary format.
Without the commit graph file, Git must decompress and parse individual commit objects to traverse history. With it, operations like git log --graph, git merge-base, and reachability queries are significantly faster.
# Generate/update the commit graph
git commit-graph write
# Verify the commit graph
git commit-graph verify
# It's also maintained by git maintenance
git maintenance run --task=commit-graph
git maintenance - Background Optimization¶
Git 2.29+ includes git maintenance for scheduling automatic optimization:
# Register this repo for background maintenance
git maintenance register
# Run all tasks now
git maintenance run
# Start the scheduler (uses system scheduler: launchd/cron/systemd)
git maintenance start
# Stop the scheduler
git maintenance stop
Maintenance Tasks¶
| Task | Frequency | What it does |
|---|---|---|
commit-graph |
Hourly | Updates the commit graph file |
prefetch |
Hourly | Background fetch from remotes (no merge) |
loose-objects |
Daily | Packs loose objects |
incremental-repack |
Daily | Consolidates packfiles |
gc |
Weekly | Full garbage collection |
Scalar¶
Scalar is Microsoft's tool for optimizing large Git repositories. Since Git 2.38, a subset of Scalar is built into Git itself:
# Initialize a repo with Scalar optimizations
scalar clone https://github.com/org/large-repo.git
# Register an existing repo for Scalar management
scalar register
# Scalar automatically configures:
# - Sparse checkout
# - Partial clone (blob:none)
# - Commit graph
# - Multi-pack index
# - File system monitor (fsmonitor)
# - Background maintenance
Scalar is essentially a convenience wrapper that enables all the individual optimizations covered in this guide in one command.
File System Monitor¶
git status needs to check every tracked file for changes. On large repositories, this filesystem scan is the bottleneck. The file system monitor (fsmonitor) uses OS-level file change notifications to skip files that haven't changed.
# Enable the built-in fsmonitor daemon (Git 2.37+)
git config core.fsmonitor true
# Or use Watchman (Facebook's file watcher)
git config core.fsmonitor "$(which watchman)"
The built-in fsmonitor--daemon (Git 2.37+) watches for filesystem events and tells Git which files have changed since the last query. This can make git status near-instantaneous on repos with hundreds of thousands of files.
Submodules vs Subtrees¶
For projects that need to include code from other repositories, Git offers two mechanisms:
Submodules¶
A submodule is a pointer to a specific commit in another repository:
# Add a submodule
git submodule add https://github.com/lib/awesome-lib.git vendor/awesome-lib
# Clone a repo with submodules
git clone --recurse-submodules https://github.com/org/project.git
# Update submodules to their tracked commits
git submodule update --init --recursive
# Update submodules to the latest remote commit
git submodule update --remote
Submodules are simple in concept but have notorious UX issues: detached HEAD state inside the submodule, forgetting to init/update after clone, and confusing merge conflicts.
Subtrees¶
A subtree merges another repository's content directly into a subdirectory:
# Add a subtree
git subtree add --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main --squash
# Pull updates
git subtree pull --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main --squash
# Push changes back upstream
git subtree push --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main
Subtrees are simpler for consumers (no special init commands), but the history can be messy and push-back workflows are less intuitive.
| Submodules | Subtrees | |
|---|---|---|
| Storage | Pointer to external commit | Content merged into repo |
| Clone experience | Requires --recurse-submodules |
Works with normal clone |
| Updating | submodule update --remote |
subtree pull |
| History | Separate (external repo) | Mixed into your repo |
| Push changes back | cd into submodule, commit, push | subtree push |
Build System Integration¶
Large monorepos need build systems that understand which projects are affected by a change:
- Bazel - Google's build system, tracks dependencies explicitly
- Nx - smart monorepo build system for JavaScript/TypeScript
- Turborepo - incremental build system for JS monorepos
- Pants - scalable build system for Python, Go, Java
These tools use the Git commit graph to determine which packages changed and only build/test those, making CI feasible for large monorepos.
Exercise¶
Further Reading¶
- Pro Git - Chapter 7.11: Submodules - submodule setup and workflow
- Official git-sparse-checkout documentation - sparse checkout modes and patterns
- Official git-maintenance documentation - background optimization tasks
- Scalar Documentation - Microsoft's monorepo optimization tool
- Git at Scale (Microsoft DevOps Blog) - Scalar and VFS for Git
- Partial Clone Documentation - filtering objects during clone
Previous: Git Security | Next: Troubleshooting and Recovery | Back to Index