Prithvi Rajan

I built p-grep to answer a simple question: how far can I push fixed-string code search in C++ before I run into the parts of the problem that tools like ripgrep already solve really well?

I wanted a small project with a clear algorithmic core, real benchmark targets, and enough surface area to learn something from implementation details. That led me to Aho-Corasick.

Starting with Aho-Corasick

I started from the classic cp-algorithms version of Aho-Corasick: trie construction, suffix links, then a DFA-style scan over input bytes.

The initial version was intentionally narrow:

fixed-string patterns only
recursive search through a folder
count matching lines first
benchmark against a real repo instead of generated text

That got me a clean baseline. It also made the next lesson obvious: once the algorithm is right, the rest of the runtime is not really "about Aho-Corasick" anymore. It is about files, scheduling, and output.

Making It Search Real Codebases

The first step after the matcher was making it walk a codebase and behave like a practical search tool.

I added recursive directory search, line counting, and later path:line:text output. I also kept the file filtering intentionally simple:

skip hidden names
skip symlinks
skip empty files
skip binary files when a NUL byte shows up
skip common generated directories like .git, node_modules, build, dist, target, and cmake-build-*

I did not implement full .gitignore, .ignore, or .rgignore semantics. That is a real chunk of complexity, and I wanted this project to stay focused on the matcher and the search pipeline.

Threads, Processes, and Bad Work Splits

Once the single-threaded version worked, I added three modes:

single
threads
processes

The first work split was naive: scan all the folders, then distribute folders evenly. Some workers got tiny subtrees, while others got most of the repository, and the timing output made the imbalance obvious.

The second version went one level down the file tree and distributed those folders evenly. Still not great. It was still heuristic-based, and not a solid algorithm.

That pushed me toward a better plan:

walk the tree once
build a list of searchable files and sizes
either assign work greedily for processes, or let threads pull work from a shared queue

That was much better. The thread version usually ended up being the best tradeoff in my runs. Processes were still useful to compare, but they paid extra overhead in spawn, coordination, and output handling.

I also learned that queue granularity matters. Pulling one file at a time created more coordination overhead than necessary. Letting each worker grab 5 files at once helped without making load balance much worse.

The Optimizations That Actually Helped

The most useful changes were not glamorous.

1. Replace `getline` with chunked reads

Line-by-line file reads were expensive. Switching to chunked binary reads and tracking line boundaries manually helped a lot.

That also made it easy to keep the automaton state in the hot loop:

read a chunk
advance the Aho-Corasick state for each byte
mark the line as matched if the state is terminal
reset on newline

2. Inline the binary check

An earlier version did a separate probe read to decide whether the file looked binary, then rewound and read it again. That was wasted work.

Inlining the NUL-byte check into the main read loop was cleaner and faster.

3. Improve the directory walk

std::filesystem::recursive_directory_iterator worked, but it was not especially cheap. Replacing it with a POSIX opendir / readdir / lstat walk trimmed planning time a bit.

Once the scan got faster, planning and file discovery started to matter more.

4. Make the hot loop a little hotter

I added has_output and split hot transition data from colder metadata in the Aho-Corasick representation.

The important point was not that this produced a dramatic speedup. It did not. The point was that the scan loop only really needs two things per byte:

the next state for this byte
whether that state is terminal

Separating that from the colder match metadata made the structure better aligned with the actual access pattern. That meant more cache hits.

Why I Skipped Full `.gitignore`

Too complicated.

Output Is a Different Benchmark

Count-only benchmarks and output-producing benchmarks are not the same workload.

That became obvious once I added path:line:text output.

If I only count matching lines, I am mostly measuring:

file walking
file reading
automaton transitions
worker scheduling

Once I emit actual output, I also pay for:

reconstructing line text
formatting path and line numbers
synchronization between workers
filesystem write overhead

I tried both buffered output and direct line-by-line writes. Direct writes were much more expensive, which is not surprising in hindsight. Output can dominate the result quickly.

I can optimize this way more, but maybe another day. Right now, with printing output, ripgrep is still faster by around 15%.

Benchmark Results

I used three pattern sets against the OpenClaw repository:

Pattern file	Patterns	Matches	Best p-grep	p-grep ms	rg ms
`few-similar.txt`	4	60,211	threads x8	200.87	227.81
`code-search.txt`	33	46,503	threads x8	202.53	223.08
`many-code-search.txt`	250	1,123,976	threads x8	196.58	706.17

The 250-pattern case is especially favorable to Aho-Corasick, but who really needs to search for 250 patterns? Output mode also changes the picture a lot.

Still, the pattern is useful:

the algorithm scales well as the pattern set gets larger
thread scheduling helped more than process-based parallelism in most runs
once the matcher is decent, I/O and file handling dominate a surprising amount of total time

What I'd Change Next

There are still a few obvious next steps:

buffered per-worker output instead of direct line-by-line writes
open / read instead of C++ streams in the hot path
walk the file tree as the trie is being built

Building a Faster Multi-Pattern Code Searcher in C++ with Aho-Corasick