Back to blog

pgit: I Imported the Linux Kernel into PostgreSQL

Mar 26, 2026 10 min read

TL;DR: Imported the full Linux kernel history into pgit. 1,428,882 commits, 24.4 million file versions, 20 years of development, stored in PostgreSQL with delta compression. Actual data: 2.7 GB (git gc --aggressive gets 1.95 GB). The import took 2 hours on a dedicated server. Then I started asking questions. 7 f-bombs in 1.4 million commit messages (all from 2 people). 665 bug fixes pointing at a single commit. A filesystem that took 13 years to merge. Here's what the Linux kernel looks like as a SQL database.


The import

This post builds on pgit: What If Your Git History Was a SQL Database?. If you haven't read it, start there. Short version: pgit is a Git-like CLI where everything lives in PostgreSQL instead of the filesystem. It uses pg-xpatch for transparent delta compression and makes your entire commit history SQL-queryable. After the pgit post hit the HN front page and got picked up by TLDR, console.dev, and dailydev, I teased that I was importing the Linux kernel. Here's what happened.

The Linux kernel is one of the largest actively developed repositories in the world. 1.4 million commits spanning 20 years, 171,000 files, 38,000 contributors. From what I've found, only a handful of VCS besides git have ever managed a full import of the kernel's history. Fossil (SQLite-based, by the SQLite team) never did. Darcs and Monotone attempted it with severe performance problems. Mercurial can do it. Correct me if I'm wrong on any of this.

pgit handled it.

MetricValue
Commits1,428,882
File versions (file refs)24,384,844
Unique blobs3,089,589
Unique paths171,525
Path groups (delta chains)137,600
Import time2h 0m 48s

The import ran on a Hetzner dedicated server in Finland: AMD EPYC 7401P (24 cores / 48 threads), 512 GB DDR4 ECC RAM, 2×1.92 TB SSD in RAID 0. With a 350 GB xpatch content cache, the entire decoded repository fits in memory.

Full server setup, git baseline, and pgit configuration

The server

Hetzner Dedicated "Server Auction" from their Finland datacenter (HEL1):

ComponentSpec
CPUAMD EPYC 7401P (24 cores / 48 threads)
RAM16×32 GB DDR4 ECC reg. (512 GB total)
Storage2×Micron SSD SATA 1.92 TB Datacenter (RAID 0)
NIC1 Gbit Intel I350
Cost~€272/month

OS installation

Hetzner installimage with Ubuntu 24.04 LTS. Two changes from the default config: RAID 0 (SWRAIDLEVEL 0) for maximum throughput (no redundancy needed for ephemeral analysis work), and a simple partition layout:

PART /boot ext3 1024M
PART swap swap 4G
PART / ext4 all

This gives ~3.5 TB usable storage across the two 1.92 TB SSDs.

OS tuning

After booting into the installed image:

# --- Packages ---
apt update && apt upgrade -y
apt install -y \
  tmux btop htop iotop \
  cpufrequtils numactl \
  git curl wget unzip \
  build-essential \
  ufw \
  linux-tools-common linux-tools-$(uname -r)

# --- CPU governor → performance (all 48 threads) ---
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > "$cpu"
done
cat > /etc/default/cpufrequtils << 'EOF'
GOVERNOR="performance"
EOF
systemctl enable cpufrequtils
systemctl restart cpufrequtils

# --- Kernel mitigations off ---
sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0"/GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 mitigations=off"/' /etc/default/grub.d/hetzner.cfg
update-grub

# --- sysctl ---
cat >> /etc/sysctl.conf << 'EOF'

vm.swappiness = 1
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
kernel.numa_balancing = 1
EOF
sysctl -p

# --- Disable Transparent Huge Pages ---
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
cat > /etc/systemd/system/disable-thp.service << 'EOF'
[Unit]
Description=Disable Transparent Huge Pages
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=basic.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag'

[Install]
WantedBy=basic.target
EOF
systemctl daemon-reload
systemctl enable disable-thp

# --- noatime ---
sed -i 's|relatime|noatime|g' /etc/fstab
mount -o remount,noatime /

# --- Firewall ---
ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw --force enable

# --- Go 1.26.0 ---
wget https://go.dev/dl/go1.26.0.linux-amd64.tar.gz
rm -rf /usr/local/go && tar -C /usr/local -xzf go1.26.0.linux-amd64.tar.gz
rm go1.26.0.linux-amd64.tar.gz
echo 'export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin' >> ~/.bashrc
source ~/.bashrc

# --- Docker ---
apt install -y docker.io
systemctl enable docker
systemctl start docker

# --- Reboot for mitigations ---
reboot

pg-xpatch container

Pulled the standard latest pg-xpatch Docker image:

docker pull ghcr.io/imgajeed76/pg-xpatch:latest

pgit version

pgit v4 with a few local changes that weren't released at the time of the import. By the time you're reading this, they should be included in the latest version, so everything here is reproducible with a normal go install. The main change is a seq ordering fix that replaces a monotonic timestamp hack with an explicit seq INTEGER NOT NULL column for commit ordering. This makes delta chain decompression significantly faster for sequential scans. Full changelist:

  • db/schema.go — Added seq INTEGER NOT NULL column, order_by => 'seq' in xpatch.configure()
  • db/commits.go — Added Seq field to struct, updated all INSERT/COPY statements
  • cli/import.go — Populates Seq (1-indexed), removed monotonic timestamp hack
  • repo/commit.gopgit commit computes MAX(seq)+1, includes seq in INSERT
  • cli/analyze.go--timeout flag (replaces hardcoded 5min), ORDER BY seq ASC in queries
  • cli/local.go — Shows actual container image instead of hardcoded DefaultImage
  • container/runtime.go — New GetContainerImage() function
  • Docs/text cleanup across README, sql.go, commits.go, ulid.go, config/reflect.go

pgit configuration

# --- PostgreSQL core ---
pgit config --global container.shared_buffers 64GB
pgit config --global container.effective_cache_size 400GB
pgit config --global container.work_mem 256MB
pgit config --global container.wal_buffers 512MB
pgit config --global container.max_wal_size 32GB
pgit config --global container.checkpoint_timeout 60min
pgit config --global container.max_connections 50
pgit config --global container.port 5433
pgit config --global container.shm_size 450g

# --- Parallelism (24c/48t EPYC 7401P) ---
pgit config --global container.max_worker_processes 28
pgit config --global container.max_parallel_workers 24
pgit config --global container.max_parallel_per_gather 12

# --- xpatch content cache (350 GB) ---
pgit config --global container.xpatch_cache_size_mb 358400            # 350 GB
pgit config --global container.xpatch_cache_max_entries 41000000      # 31.5M needed + 30%
pgit config --global container.xpatch_cache_max_entry_kb 16384        # 16 MB max single entry
pgit config --global container.xpatch_cache_slot_size_kb 4            # default, fine for mixed sizes
pgit config --global container.xpatch_cache_partitions 24             # one per core

# --- xpatch sub-caches ---
pgit config --global container.xpatch_group_cache_size_mb 256         # 85K groups need 4 MB (62×)
pgit config --global container.xpatch_tid_cache_size_mb 4096          # 25.8M rows need 1.1 GB (3.7×)
pgit config --global container.xpatch_seq_tid_cache_size_mb 4096      # 25.8M rows need 1.5 GB (2.7×)

# --- xpatch insert / encoding ---
pgit config --global container.xpatch_insert_cache_slots 256          # 24 workers × concurrent groups
pgit config --global container.xpatch_encode_threads 2                # ×24 workers = 48 HW threads
pgit config --global container.xpatch_warm_cache_workers 24           # one per core

# --- Import ---
pgit config --global import.workers 24

Configuration rationale

ParameterValueReasoning
shared_buffers64 GBDataset ~20 GB on disk, 3× headroom, entire DB fits in buffer pool
effective_cache_size400 GBHint to planner: shared_buffers + OS page cache + xpatch caches
work_mem256 MBPer-op memory for sorts/hash joins; 50 connections × 256 MB = 12.8 GB worst case
wal_buffers512 MBAbsorb write bursts during bulk import
max_wal_size32 GBDelay checkpoints during heavy writes
checkpoint_timeout60 minMinimize I/O stalls from forced checkpoints
shm_size450 GB64 GB shared_buffers + 350 GB cache + 8 GB sub-caches + ~2 GB internals ≈ 425 GB, rounded up
xpatch_cache_size_mb350 GBEstimated decoded footprint ~55 GB, 84% headroom ensures everything fits
xpatch_cache_max_entries41M1.4M commits × 5 delta cols + 24.4M file versions = 31.5M worst case, +30% padding
xpatch_encode_threads2×24 workers = 48 encoding threads, matching 48 HW threads exactly
import.workers24One per physical core

Git baseline measurements

Before importing, we cloned the kernel and measured the raw git repository:

cd /root
git clone --single-branch --branch master https://github.com/torvalds/linux.git
cd /root/linux

git rev-list --count HEAD                                              # 1,428,882
git gc --quiet && du -sb .git/objects/pack/*.pack                      # 6,213,222,259 (5.79 GB)
git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectsize)' \
  | awk '{sum += $2} END {printf "%.2f GB\n", sum/1024/1024/1024}'     # 144.43 GB
time git fast-export --reencode=yes --show-original-ids master \
  > /root/linux.fastexport                                             # 17m18s, 126 GB
time git gc --aggressive --quiet && du -sb .git/objects/pack/*.pack    # 24m46s, 2,093,181,079 (1.95 GB)
MetricValue
Commits1,428,882
Raw uncompressed object size144.43 GB
Packfile after git gc5.79 GB
Packfile after git gc --aggressive1.95 GB
git gc --aggressive time24m 46s (338m CPU)
Fast-export size126 GB
Fast-export time17m 18s

The import

Started the pg-xpatch container with pgit local start (the start check times out because allocating 450 GB of shared memory takes a while, but it starts eventually). Waited for the database to be ready with docker logs pgit-local -f.

cd /root/linux-analysis
pgit init
time pgit import /root/linux --branch master --fastexport /root/linux.fastexport
PhaseTime
Commit import39m 57s
Commit graph build13s (max depth 75,641)
Path group computation9.6s (137,600 groups from 171,525 paths)
Blob import1h 17m
Index rebuild38s
Total2h 0m 48s (wall), 336m 50s CPU, 21m 12s sys

Compression

Let's be honest about the numbers.

SizeRatio
Raw uncompressed objects144.43 GB1.0x
pgit (on-disk)6.6 GB21.9x
git gc (normal)5.79 GB24.9x
pgit (actual data)2.7 GB53.5x
git gc --aggressive1.95 GB74.1x

git gc --aggressive wins. 1.95 GB vs pgit's 2.7 GB of actual data. About 38% smaller.

This is expected. The Linux kernel is basically git's dream scenario for cross-object delta compression: massive code reuse across architectures, SPDX license headers duplicated in 70,000+ files, and 70% of all files are .c/.h with shared patterns across completely unrelated subsystems. git's packfile format can delta-compress any object against any other object in the entire repository, regardless of file path. pgit compresses within file-level delta chains.

What pgit does get: 114.4x compression on text content alone. xpatch compressed 123 GB of source text into 1.1 GB. The other 1.6 GB is metadata (which file is in which commit, path mappings, refs). Only 52 binary blobs exist in the entire kernel history. It's almost entirely text.

But the comparison isn't really about bytes. git gc --aggressive takes 25 minutes and gives you a packfile. pgit takes 2 hours and gives you a SQL database. The question is what you can do after.

Full on-disk breakdown
ComponentSize
Commits (xpatch)600.6 MB
Text content (xpatch)1.3 GB
Binary content (xpatch)2.0 MB
File refs (heap)2.1 GB
Paths (heap)13.2 MB
Other (refs, metadata, sync)40.0 KB
Indexes2.7 GB
Total on disk6.6 GB
LayerSize
xpatch (commits + text + binary)1.5 GB
Normal tables (file_refs, paths, refs, metadata)1.2 GB
Actual data2.7 GB
PostgreSQL overhead + indexes3.9 GB

Note: The second table strips PostgreSQL overhead from each component, so the rows won't sum to match the on-disk table above.

171,525 paths collapsed into 137,600 delta groups (rename/copy detection). 7.9x blob deduplication: 24.4M file refs point to only 3.1M unique content versions.

What 1.4 million commits reveal

Everything below was queried directly from PostgreSQL. Most queries completed in under 10 seconds. No materialized views, no preprocessing, no scripts parsing git log output. Just SQL on delta-compressed tables.

38,506 authors. 36% never came back.

The kernel has 38,506 unique authors (by email) but only 1,540 unique committers. In the kernel's mailing list workflow, you write a patch and a maintainer merges it. So 38,506 people wrote code, but only 1,540 had merge authority. A 25:1 ratio.

Nearly 14,000 of those authors contributed exactly one patch and never came back.

90% of commits touch 5 files or fewer

Files TouchedCommits% of total
1 file875,54161.3%
2-5 files414,01829.0%
6-10 files70,9515.0%
11-50 files49,7003.5%
51+ files18,5231.3%

The kernel's "one logical change per commit" rule holds up. The largest single commit touched 53,003 files, but every single one of the top 5 biggest commits turned out to be merge commits from subsystem maintainers. Not sweeping API changes. Just plumbing.

File coupling: the hidden dependencies

pgit analyze coupling computes which files always change together. This is the kind of analysis that's painful to do with git (parsing git log output, building co-change matrices, filtering noise) but trivial when your history is a SQL database. On 1.4 million commits, it completed in 48 seconds.

File AFile BCo-changes
i915/intel_drv.hi915/intel_display.c1,117
net/core/dev.cinclude/linux/netdevice.h1,087
i915/i915_gem.ci915/i915_drv.h1,072
arch/x86/kvm/x86.carch/x86/include/asm/kvm_host.h1,066
include/uapi/linux/bpf.htools/include/uapi/linux/bpf.h742
net/ipv4/tcp_ipv4.cnet/ipv6/tcp_ipv6.c739

The Intel i915 GPU driver owns the top spot: intel_drv.h and intel_display.c have been changed together 1,117 times. The i915 driver appears 8 times in the top 30 coupled pairs.

The most interesting entry: include/uapi/linux/bpf.h and tools/include/uapi/linux/bpf.h at 742 co-changes. Every kernel BPF header change requires a manual copy to the tools directory. And tcp_ipv4.c with tcp_ipv6.c at 739: fixing a TCP bug in IPv4 almost always means the same fix in IPv6.

Full coupling top 15
RankFile AFile BCo-changes
1i915/intel_drv.hi915/intel_display.c1,117
2net/core/dev.cinclude/linux/netdevice.h1,087
3i915/i915_gem.ci915/i915_drv.h1,072
4arch/x86/kvm/x86.carch/x86/include/asm/kvm_host.h1,066
5i915/intel_display.ci915/i915_drv.h892
6include/net/cfg80211.hnet/wireless/nl80211.c783
7mlx5/core/en_main.cmlx5/core/en.h778
8fs/btrfs/inode.cfs/btrfs/ctree.h776
9fs/btrfs/ctree.hfs/btrfs/extent-tree.c769
10fs/btrfs/disk-io.cfs/btrfs/ctree.h757
11include/uapi/linux/bpf.htools/include/uapi/linux/bpf.h742
12net/ipv4/tcp_ipv4.cnet/ipv6/tcp_ipv6.c739
13i915/i915_drv.ci915/i915_drv.h739
14sound/soc/codecs/Makefilesound/soc/codecs/Kconfig674
15net/mac80211/ieee80211_i.hnet/mac80211/mlme.c670

Btrfs has 7 entries in the top 30, all radiating from ctree.h. That single header is the coupling hub of the entire Btrfs filesystem.

Three people merge 22.5% of all commits

CommitterPatches MergedSelf-AuthoredMerge Ratio
David S. Miller113,45615,6177.3x
Greg Kroah-Hartman105,7337,07315.0x
Linus Torvalds102,32245,1252.3x

David S. Miller (networking) is the single busiest merge point: 7.9% of all kernel commits flow through him. Greg Kroah-Hartman authored 7K patches but merged 106K. 15:1. John W. Linville is even more lopsided: 18.9K merged, 1.1K written. 16.5:1.

Full committer table (top 10)
CommitterPatches MergedSelf-AuthoredMerge Ratio
David S. Miller113,45615,6177.3x
Greg Kroah-Hartman105,7337,07315.0x
Linus Torvalds102,32245,1252.3x
Mark Brown49,6748,7595.7x
Mauro Carvalho Chehab39,8696,5716.1x
Alex Deucher37,0534,2018.8x
Ingo Molnar27,8705,6484.9x
Jakub Kicinski24,5095,0364.9x
Jens Axboe19,9853,7585.3x
John W. Linville18,8821,14616.5x

Who pays for the kernel

OrganizationCommitsAuthorsCommits/Author
Intel83,1871,70449
Red Hat72,695658110
kernel.org69,451227306
Linaro43,524263166
AMD42,2701,01742
SUSE35,711222161
Google29,27680936
Huawei24,15654045
Amazon1,68812114

Intel is #1 by volume (83K commits, 1,704 engineers). Red Hat is #2 but with the most productive team: 110 commits per engineer. The kernel.org maintainers (227 people) average 306 commits each. These are the elite core.

Amazon stands out at the bottom: 14 commits per person. They're focused on Xen/KVM virtualization for AWS, not broad kernel work. (Note: IBM shows only 53 commits because their engineers use @linux.ibm.com and @linux.vnet.ibm.com, landing them in the "Other" bucket. Similarly, many Huawei engineers use @hisilicon.com.)

Individual contributors (Gmail addresses, a proxy for hobbyists) peaked at 12% of all commits in 2010. By 2025, that's down to 8%. The absolute numbers are stable (~7K/year), but the kernel has become more corporate over time.

Gmail vs corporate trend over time
YearGmail CommitsTotalGmail %
200545216,6962%
20085,60448,84711%
20106,09149,81912% (peak)
20148,60275,65911%
20187,39380,3309%
20227,43886,8108%
20257,16085,1638%

The "buggiest" commit in Linux history

The kernel has a convention: the Fixes: tag in a commit message references the exact commit that introduced a bug. By 2026, more than 1 in 4 commits use it (up from basically zero before 2013).

The commit with the most Fixes: references? 1da177e4c3f4. Linus Torvalds's initial git import. April 16, 2005. 665 bug fixes pointing back at it.

It's not actually buggy. When a bug has existed "forever" and there's no specific commit to blame, developers cite 1da177e4c3f4 as shorthand for "this was always broken."

The second-most-fixed: dd08ebf6c352, Intel's Xe GPU driver introduction. 196 fixes in ~2 years. One bug fix every 4 days since the driver landed.

Polite commits, angry code

7 f-bombs in 1.4 million commit messages. All from exactly 2 people: Al Viro (5) and Linus Torvalds (2).

But the source code tells a different story. pgit search "fuck" --path "*.c" --path "*.h" found 8 matches in the current codebase. Running the same search with --all (every version of every file across 20 years of history) found 50+ matches in 44 seconds. Highlights:

  • "Am I fucking pedantic or what?" (SCSI driver header, still in the code today)
  • "Ugly, ugly fucker." (netfilter header, present since the very first commit, survived 20 years of code review)
  • "fucking gcc" (XFS B-tree header, twice)
  • "If you fuck with this, update ret_from_syscall code too" (SPARC architecture)
Full profanity count in commit messages

These counts are from commit messages only, not source code. Using PostgreSQL word boundary regex (\y) to avoid false positives ("ass" matching "class", "hell" matching "shell"):

WordCountNotes
workaround8,435Not profanity, but reveals pain: 8,435 times someone couldn't fix the real problem
hack2,438The kernel's honest self-assessment
ugly2,161
stupid533
crap268
damn81
shit29
fuck7

And in the source code, some gems:

FileComment
sound/oss/forte.c"FIXME HACK FROM HELL!"
arch/powerpc/sysdev/todc.c"XXXX BAD HACK -> FIX" (×4, and it was copied to arch/ppc/syslib/todc_time.c without being fixed)
drivers/staging/dgap/ (5 files)NOTE TO LINUX KERNEL HACKERS: DO NOT REFORMAT THIS CODE!

Triple reverts

Only 3 commits in all of Linux history are triple reverts (a revert of a revert of a revert).

Greg KH on Lustre (the HPC filesystem): "How many times can we do this..."

Linus on a memory management flag: "This is a revert of a revert of a revert." The i915 GPU driver was using a flag it shouldn't have been, and untangling it took three revert cycles.

All three are in memory management or staging. Subsystems where changes have far-reaching, hard-to-predict effects.

Kent Overstreet: 13 years, one filesystem

Kent Overstreet's first kernel commit was in 2011: bcache, a block cache layer. By 2013 it was in mainline with 213 commits. Then he went quiet. 4 to 34 commits per year from 2014 to 2017, rewriting the whole thing out-of-tree into a full filesystem.

In 2023, bcachefs merged into mainline (kernel 6.7). 904 commits that year. 1,194 the next. He codes on New Year's Day across 3 different years, between 1am and 4am. 27% of his commits are on weekends. When the merge was controversial, he kept going.

More kernel stories

Weekend warriors:

AuthorWeekend CommitsTotalWeekend %
Jonathan Cameron1,1762,29551.2%
Christophe JAILLET1,0612,08650.9%
Kent Overstreet1,4155,21727.1%
Hans de Goede1,3944,62730.1%
Linus Torvalds10,72945,27423.7%

Jonathan Cameron and Christophe JAILLET do more than half their kernel work on weekends. They're hobby contributors maintaining drivers in their free time.

Single-file devotees:

Connor McAdams's entire kernel career (89 commits) is a single Creative Labs sound card driver (patch_ca0132.c). Lydia Wang (63 commits) lived in the VIA audio codec. Two separate people, Pavel Rojtberg and Cameron Gutman, independently devoted their kernel contributions exclusively to the Xbox controller driver (xpad.c).

Career journeys:

James Bottomley (SCSI maintainer) has committed from 19 different email domains over 20 years. His git log IS his resume: SteelEye → HP → Parallels → Odin → and more. Linus has 12 domains. David S. Miller has 11.

Christmas Day commits:

The kernel has never had a zero-commit Christmas in 21 years. 2008 had 157 commits on December 25th (the x86 unification frenzy). Even on the quietest Christmas (2005, 1 commit), someone was working.

The busiest single day:

November 18, 2022: 662 commits. One person (Uwe Kleine-König) did 583 of them in a mass platform_driver return type conversion. The second busiest was January 30, 2008 (640 commits): three developers sustaining 600+ commits/day for three consecutive days during the x86 unification.

Commit message culture gap:

Jeff Garzik averages 161 bytes per commit message. Filipe Manana (Btrfs) averages 1,950 bytes. A 12x gap. The entire memory management subsystem (mm/) skews verbose: its top 3 contributors are all in the top 5 longest-message writers. When your bugs are subtle and catastrophic, you explain yourself.

GPU drivers are revert magnets:

drm/amd/display (191 reverts) + drm/amdgpu (127) + drm/i915 (147) = 516 GPU reverts, nearly 10% of all reverts in the entire kernel. Display code has complex hardware interactions, power management state machines, and regression-sensitive userspace APIs.

Query performance

All of the above was queried on a 1.4 million commit database. Here's how long things took:

QueryTime
Stats + compression info2.1s
Churn (24.4M file refs)2.3s
Hotspots1.8s
Coupling (computed in Go)48.3s
Authors (full commit scan)34.3s
Activity (yearly)9.4s
Day-of-week / hour-of-day4.2-4.4s
Full-text search (current files)25s
Full-text search (all history)44s

No materialized views. No preprocessing. Just SQL on PostgreSQL with pg-xpatch delta compression.

All SQL commands used in this analysis
# Storage & compression (2.1s)
pgit stats --xpatch

# Churn (2.3s)
pgit analyze churn --limit 30

# Hotspots (1.8s)
pgit analyze hotspots --depth 1 --limit 25

# Authors (34.3s)
pgit analyze authors --limit 30 --timeout 60m

# Activity (9.4s)
pgit analyze activity --period year --limit 25 --timeout 60m

# Coupling (48.3s)
pgit analyze coupling --limit 30 --timeout 60m

# Merge hierarchy (4.6s)
pgit sql "SELECT committer_name, COUNT(*) as merged,
  SUM(CASE WHEN author_name = committer_name THEN 1 ELSE 0 END) as self_authored
  FROM pgit_commits GROUP BY committer_name ORDER BY merged DESC LIMIT 15"

# Corporate contributions (5.4s)
pgit sql "SELECT
  CASE
    WHEN author_email LIKE '%%@intel.com' THEN 'Intel'
    WHEN author_email LIKE '%%@redhat.com' THEN 'Red Hat'
    WHEN author_email LIKE '%%@kernel.org' THEN 'kernel.org'
    -- ... (full domain mapping)
  END as org,
  COUNT(*) as commits, COUNT(DISTINCT author_email) as authors
  FROM pgit_commits GROUP BY org ORDER BY commits DESC"

# Fixes: tag evolution (4.5s)
pgit sql "SELECT EXTRACT(YEAR FROM authored_at)::int as year,
  SUM(CASE WHEN message ~* 'Fixes:\s+[0-9a-f]{12}' THEN 1 ELSE 0 END) as fixes_commits,
  COUNT(*) as total
  FROM pgit_commits GROUP BY year ORDER BY year"

# Most-fixed commits (4.9s)
pgit sql "SELECT SUBSTRING(message FROM 'Fixes:\s+([0-9a-f]{12})') as broken_commit,
  COUNT(*) as times_fixed
  FROM pgit_commits
  WHERE message ~* 'Fixes:\s+[0-9a-f]{12}'
  GROUP BY broken_commit ORDER BY times_fixed DESC LIMIT 10"

# Profanity with word boundaries (~5s)
pgit sql "WITH words(word) AS (VALUES ('fuck'),('shit'),('damn'),('stupid'),('crap'),('ugly'),('hack'),('workaround'))
  SELECT w.word, COUNT(*) as cnt
  FROM pgit_commits c, words w
  WHERE c.message ~* ('\y' || w.word || '\y')
  GROUP BY w.word ORDER BY cnt DESC"

# Source code search
pgit search "fuck" --path "*.c" --path "*.h"
pgit search "fuck" --path "*.c" --path "*.h" --all

# Triple reverts (5.7s)
pgit sql "SELECT author_name, authored_at::date, LEFT(message, 200)
  FROM pgit_commits
  WHERE message ILIKE 'Revert \"Revert \"Revert%%'"

# Kent Overstreet trajectory (4.1s)
pgit sql "SELECT EXTRACT(YEAR FROM authored_at)::int as year, COUNT(*)
  FROM pgit_commits WHERE author_name = 'Kent Overstreet'
  GROUP BY year ORDER BY year"

# Weekend warriors (1m 41s)
pgit sql "SELECT author_name, COUNT(*) as weekend_commits,
  (SELECT COUNT(*) FROM pgit_commits c2 WHERE c2.author_name = c.author_name) as total
  FROM pgit_commits c WHERE EXTRACT(DOW FROM authored_at) IN (0, 6)
  GROUP BY author_name ORDER BY weekend_commits DESC LIMIT 15"

Links

pgit can handle the Linux kernel. That was the question I wanted to answer. It imported, it compressed, and it made 20 years of history queryable in seconds.

Since the pgit post, ripgit was built on top of xpatch: a self-hostable git remote backed by Cloudflare Durable Objects with SQLite storage and delta compression. It's wild to see the ecosystem growing.

go install github.com/imgajeed76/pgit/v4/cmd/pgit@latest

Most Similar to pgit: I Imported the Linux Kernel into PostgreSQL

Explore Something Different from pgit: I Imported the Linux Kernel into PostgreSQL