Skip to content

Monorepos and Scaling Git

Git was designed for the Linux kernel - a large project, but one with a relatively straightforward directory structure. When organizations put hundreds of projects, millions of files, and decades of history into a single repository, Git's default behavior starts to struggle. Clone times balloon, git status takes seconds, and CI builds trigger unnecessarily. This guide covers the tools and strategies for making Git perform at scale.


When Monorepos Make Sense

A monorepo stores multiple projects, services, or packages in a single Git repository. Major organizations use them:

  • Google - billions of lines of code in a single repository (custom VCS, not Git)
  • Meta - millions of files, custom Mercurial extensions (migrating to a custom Git-like system)
  • Microsoft - Windows codebase moved to Git using VFS for Git and later Scalar
Advantage Trade-off
Atomic cross-project changes Clone and checkout take longer
Shared code without versioning overhead git status is slower with many files
Unified CI/CD and tooling CI must scope to affected code
Single source of truth Access control is repository-wide
Easier refactoring across boundaries Git wasn't designed for millions of files

The challenges are real but solvable with the right configuration.


Sparse Checkout

Sparse checkout tells Git to only materialize a subset of files in your working directory. The full history is available, but you only see the directories you need.

Setting Up Sparse Checkout

# Clone with sparse checkout enabled
git clone --sparse https://github.com/org/monorepo.git
cd monorepo

# Check out specific directories
git sparse-checkout set services/auth services/api shared/utils

# Add more directories later
git sparse-checkout add services/web

# List current sparse checkout patterns
git sparse-checkout list

# Disable sparse checkout (check out everything)
git sparse-checkout disable

Cone Mode vs Non-Cone Mode

Sparse checkout has two modes:

  • Cone mode (default, recommended) - specifies directories. Fast because Git can skip entire subtrees without pattern matching.
  • Non-cone mode - specifies file glob patterns. More flexible but slower.
# Cone mode (default) - specify directories
git sparse-checkout set services/auth tests/auth

# Non-cone mode - specify patterns
git sparse-checkout set --no-cone '!/*' '/README.md' '/services/auth/**'

Stick with cone mode unless you need file-level granularity.


Partial Clone

Partial clone (Git 2.22+) lets you clone a repository without downloading all objects. Git fetches missing objects on demand when you need them.

Filter Options

# Blobless clone - skip file content, download on checkout
git clone --filter=blob:none https://github.com/org/monorepo.git

# Treeless clone - skip trees too (minimal download)
git clone --filter=tree:0 https://github.com/org/monorepo.git

# Size-limited - skip blobs larger than a threshold
git clone --filter=blob:limit=1m https://github.com/org/monorepo.git

Combining Partial Clone and Sparse Checkout

The power combo for monorepos:

# Clone with partial objects and sparse checkout
git clone --filter=blob:none --sparse https://github.com/org/monorepo.git
cd monorepo

# Check out only what you need
git sparse-checkout set services/my-service tests/my-service

# Git fetches blobs only for files in your sparse checkout

This gives you:

  • Full commit history (for git log, git blame)
  • Minimal disk usage (only blobs for your directories)
  • Fast initial clone

The Commit Graph File

The commit graph file (.git/objects/info/commit-graph) is a pre-computed index of the commit DAG. It stores commit hashes, parent hashes, root tree hashes, commit dates, and generation numbers in a compact binary format.

Without the commit graph file, Git must decompress and parse individual commit objects to traverse history. With it, operations like git log --graph, git merge-base, and reachability queries are significantly faster.

# Generate/update the commit graph
git commit-graph write

# Verify the commit graph
git commit-graph verify

# It's also maintained by git maintenance
git maintenance run --task=commit-graph

git maintenance - Background Optimization

Git 2.29+ includes git maintenance for scheduling automatic optimization:

# Register this repo for background maintenance
git maintenance register

# Run all tasks now
git maintenance run

# Start the scheduler (uses system scheduler: launchd/cron/systemd)
git maintenance start

# Stop the scheduler
git maintenance stop

Maintenance Tasks

Task Frequency What it does
commit-graph Hourly Updates the commit graph file
prefetch Hourly Background fetch from remotes (no merge)
loose-objects Daily Packs loose objects
incremental-repack Daily Consolidates packfiles
gc Weekly Full garbage collection

Scalar

Scalar is Microsoft's tool for optimizing large Git repositories. Since Git 2.38, a subset of Scalar is built into Git itself:

# Initialize a repo with Scalar optimizations
scalar clone https://github.com/org/large-repo.git

# Register an existing repo for Scalar management
scalar register

# Scalar automatically configures:
# - Sparse checkout
# - Partial clone (blob:none)
# - Commit graph
# - Multi-pack index
# - File system monitor (fsmonitor)
# - Background maintenance

Scalar is essentially a convenience wrapper that enables all the individual optimizations covered in this guide in one command.


File System Monitor

git status needs to check every tracked file for changes. On large repositories, this filesystem scan is the bottleneck. The file system monitor (fsmonitor) uses OS-level file change notifications to skip files that haven't changed.

# Enable the built-in fsmonitor daemon (Git 2.37+)
git config core.fsmonitor true

# Or use Watchman (Facebook's file watcher)
git config core.fsmonitor "$(which watchman)"

The built-in fsmonitor--daemon (Git 2.37+) watches for filesystem events and tells Git which files have changed since the last query. This can make git status near-instantaneous on repos with hundreds of thousands of files.


Submodules vs Subtrees

For projects that need to include code from other repositories, Git offers two mechanisms:

Submodules

A submodule is a pointer to a specific commit in another repository:

# Add a submodule
git submodule add https://github.com/lib/awesome-lib.git vendor/awesome-lib

# Clone a repo with submodules
git clone --recurse-submodules https://github.com/org/project.git

# Update submodules to their tracked commits
git submodule update --init --recursive

# Update submodules to the latest remote commit
git submodule update --remote

Submodules are simple in concept but have notorious UX issues: detached HEAD state inside the submodule, forgetting to init/update after clone, and confusing merge conflicts.

Subtrees

A subtree merges another repository's content directly into a subdirectory:

# Add a subtree
git subtree add --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main --squash

# Pull updates
git subtree pull --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main --squash

# Push changes back upstream
git subtree push --prefix=vendor/awesome-lib https://github.com/lib/awesome-lib.git main

Subtrees are simpler for consumers (no special init commands), but the history can be messy and push-back workflows are less intuitive.

Submodules Subtrees
Storage Pointer to external commit Content merged into repo
Clone experience Requires --recurse-submodules Works with normal clone
Updating submodule update --remote subtree pull
History Separate (external repo) Mixed into your repo
Push changes back cd into submodule, commit, push subtree push

Build System Integration

Large monorepos need build systems that understand which projects are affected by a change:

  • Bazel - Google's build system, tracks dependencies explicitly
  • Nx - smart monorepo build system for JavaScript/TypeScript
  • Turborepo - incremental build system for JS monorepos
  • Pants - scalable build system for Python, Go, Java

These tools use the Git commit graph to determine which packages changed and only build/test those, making CI feasible for large monorepos.


Exercise


Further Reading


Previous: Git Security | Next: Troubleshooting and Recovery | Back to Index

Comments