I built p-grep to answer a simple question: how far can I push fixed-string code search in C++ before I run into the parts of the problem that tools like ripgrep already solve really well?
I wanted a small project with a clear algorithmic core, real benchmark targets, and enough surface area to learn something from implementation details. That led me to Aho-Corasick.
Starting with Aho-Corasick
I started from the classic cp-algorithms version of Aho-Corasick: trie construction, suffix links, then a DFA-style scan over input bytes.
The initial version was intentionally narrow:
- fixed-string patterns only
- recursive search through a folder
- count matching lines first
- benchmark against a real repo instead of generated text
That got me a clean baseline. It also made the next lesson obvious: once the algorithm is right, the rest of the runtime is not really "about Aho-Corasick" anymore. It is about files, scheduling, and output.
Making It Search Real Codebases
The first step after the matcher was making it walk a codebase and behave like a practical search tool.
I added recursive directory search, line counting, and later path:line:text output. I also kept the file filtering intentionally simple:
- skip hidden names
- skip symlinks
- skip empty files
- skip binary files when a NUL byte shows up
- skip common generated directories like
.git,node_modules,build,dist,target, andcmake-build-*
I did not implement full .gitignore, .ignore, or .rgignore semantics. That is a real chunk of complexity, and I wanted this project to stay focused on the matcher and the search pipeline.
Threads, Processes, and Bad Work Splits
Once the single-threaded version worked, I added three modes:
- single
- threads
- processes
The first work split was naive: scan all the folders, then distribute folders evenly. Some workers got tiny subtrees, while others got most of the repository, and the timing output made the imbalance obvious.
The second version went one level down the file tree and distributed those folders evenly. Still not great. It was still heuristic-based, and not a solid algorithm.
That pushed me toward a better plan:
- walk the tree once
- build a list of searchable files and sizes
- either assign work greedily for processes, or let threads pull work from a shared queue
That was much better. The thread version usually ended up being the best tradeoff in my runs. Processes were still useful to compare, but they paid extra overhead in spawn, coordination, and output handling.
I also learned that queue granularity matters. Pulling one file at a time created more coordination overhead than necessary. Letting each worker grab 5 files at once helped without making load balance much worse.
The Optimizations That Actually Helped
The most useful changes were not glamorous.
1. Replace getline with chunked reads
Line-by-line file reads were expensive. Switching to chunked binary reads and tracking line boundaries manually helped a lot.
That also made it easy to keep the automaton state in the hot loop:
- read a chunk
- advance the Aho-Corasick state for each byte
- mark the line as matched if the state is terminal
- reset on newline
2. Inline the binary check
An earlier version did a separate probe read to decide whether the file looked binary, then rewound and read it again. That was wasted work.
Inlining the NUL-byte check into the main read loop was cleaner and faster.
3. Improve the directory walk
std::filesystem::recursive_directory_iterator worked, but it was not especially cheap. Replacing it with a POSIX opendir / readdir / lstat walk trimmed planning time a bit.
Once the scan got faster, planning and file discovery started to matter more.
4. Make the hot loop a little hotter
I added has_output and split hot transition data from colder metadata in the Aho-Corasick representation.
The important point was not that this produced a dramatic speedup. It did not. The point was that the scan loop only really needs two things per byte:
- the next state for this byte
- whether that state is terminal
Separating that from the colder match metadata made the structure better aligned with the actual access pattern. That meant more cache hits.
Why I Skipped Full .gitignore
Too complicated.
Output Is a Different Benchmark
Count-only benchmarks and output-producing benchmarks are not the same workload.
That became obvious once I added path:line:text output.
If I only count matching lines, I am mostly measuring:
- file walking
- file reading
- automaton transitions
- worker scheduling
Once I emit actual output, I also pay for:
- reconstructing line text
- formatting path and line numbers
- synchronization between workers
- filesystem write overhead
I tried both buffered output and direct line-by-line writes. Direct writes were much more expensive, which is not surprising in hindsight. Output can dominate the result quickly.
I can optimize this way more, but maybe another day. Right now, with printing output, ripgrep is still faster by around 15%.
Benchmark Results
I used three pattern sets against the OpenClaw repository:
| Pattern file | Patterns | Matches | Best p-grep | p-grep ms | rg ms |
|---|---|---|---|---|---|
few-similar.txt | 4 | 60,211 | threads x8 | 200.87 | 227.81 |
code-search.txt | 33 | 46,503 | threads x8 | 202.53 | 223.08 |
many-code-search.txt | 250 | 1,123,976 | threads x8 | 196.58 | 706.17 |
The 250-pattern case is especially favorable to Aho-Corasick, but who really needs to search for 250 patterns? Output mode also changes the picture a lot.
Still, the pattern is useful:
- the algorithm scales well as the pattern set gets larger
- thread scheduling helped more than process-based parallelism in most runs
- once the matcher is decent, I/O and file handling dominate a surprising amount of total time
What I'd Change Next
There are still a few obvious next steps:
- buffered per-worker output instead of direct line-by-line writes
open/readinstead of C++ streams in the hot path- walk the file tree as the trie is being built