Code Search Ranking: A Benchmark Across 4 Tools and 41 Queries

How good is code search ranking, really? When you search for router in a web framework, do you get the file that defines routing — or a changelog entry that mentions the word? When you search for context in Go’s standard library, do you get context/context.go — or context_test.go?

We benchmarked four code search tools across 41 queries and 8 repositories to find out. The results were stark: searchcode returned the correct #1 result 86% of the time, compared to 24% for Tool A, 50% for Tool C, and 54% for Tool B.

This post walks through the methodology, the raw results, and the technical reasons behind the gap.

Methodology

Tools tested

searchcode — BM25-based ranking with code-aware heuristics
Tool A — enterprise code search (public instance)
Tool B — platform-native code search
Tool C — code search (repo filtering available for indexed repos only)

Repositories

We chose well-known open source projects across multiple languages:

Repository	Language	Stars	Why chosen
`golang/go`	Go	135k+	Massive stdlib, deep package hierarchy
`gin-gonic/gin`	Go	80k+	Popular web framework, clear file structure
`expressjs/express`	JavaScript	65k+	Node.js web framework, well-organized
`pallets/flask`	Python	70k+	Python web framework, clean codebase
`rust-lang/regex`	Rust	3.9k	Complex parsing/compilation pipeline
`servo/servo`	Rust	36k+	Browser engine, deep component hierarchy
`jetbrains/kotlin`	Kotlin/Java	50k+	Compiler, massive codebase
`aquasecurity/vuln-list-update`	Go	191	Vulnerability updater, many subpackages

How we judged “correct”

For each query, we defined the expected #1 result before searching: the file a developer would most likely want to find. For router in gin, that’s routergroup.go or gin.go (where routing is implemented), not BENCHMARKS.md or README.md. For context in Go, it’s context/context.go, not context_test.go.

A result was marked correct if the #1 result was a core implementation file relevant to the query. We gave partial credit for results in the right package but wrong file. Documentation, changelogs, test files, and example files were marked incorrect — a developer searching for parser wants the parser implementation, not a changelog entry mentioning a parser fix.

Results: Four-Way Comparison

We ran 8 queries across gin-gonic/gin and expressjs/express where all four tools could be compared head-to-head.

gin-gonic/gin

Query	searchcode #1	Tool A #1	Tool C #1	Tool B #1
`router`	`gin.go`	`BENCHMARKS.md`	`routergroup.go`	`routergroup.go`
`context`	`context.go`	`context_test.go`	`context.go`	`context.go`
`middleware`	`gin.go`	`README.md`	`routergroup.go`	`README.md`
`binding`	`binding/binding.go`	`binding_nomsgpack.go`	`context.go`	`binding/binding.go`

expressjs/express

Query	searchcode #1	Tool A #1	Tool C #1	Tool B #1
`router`	`lib/application.js`	`History.md`	`test/Router.js`	`lib/application.js`
`request`	`lib/request.js`	`test/req.xhr.js`	`test/express.static.js`	`lib/request.js`
`response`	`lib/response.js`	`test/res.status.js`	`lib/response.js`	`lib/response.js`
`middleware`	`lib/application.js`	`README.md`	`test/app.use.js`	`examples/route-middleware/index.js`

Four-way scorecard

Tool	Correct	Accuracy
searchcode	8/8	100%
Tool B	6/8	75%
Tool C	3/8	38%
Tool A	0/8	0%

Tool A returned a documentation or test file for every single query across both repositories.

Results: Four-Way on Large Codebases

We extended the four-way comparison to two much larger repositories: servo/servo (a browser engine in Rust) and jetbrains/kotlin (the Kotlin compiler).

A note on Tool C: Tool C can filter to a single repository, but only for repos that appear in its faceted sidebar — essentially popular repos already in its index. The URL parameter filter[repo] is silently ignored; you must use f.repo= or click from the sidebar. For smaller repos like aquasecurity/vuln-list-update, Tool C cannot scope at all.

servo/servo

Query	searchcode #1	Tool A #1	Tool C #1	Tool B #1
`layout`	`components/layout/layout_impl.rs`	`components/layout/flow/mod.rs`	`components/layout/dom.rs`	`components/layout/flow/float.rs`
`script`	`components/script/script_thread.rs`	`tests/wpt/.../client.py`	`components/script/dom/html/htmlscriptelement.rs`	`components/shared/embedder/user_contents.rs`
`render`	`components/paint/painter.rs`	`tests/wpt/.../serializer.py`	`components/paint/painter.rs`	`components/media/.../render.rs`
`parse`	`components/script/dom/html/htmlimageelement.rs`	`python/servo/try_parser.py`	`components/script/dom/servoparser/async_html.rs`	`python/servo/try_parser.py`

For script, Tool A returned a WebDriver test tool from tests/wpt/ — a third-party Python file completely unrelated to Servo’s script engine. For render, it returned an html5lib serializer from the same test tools directory.

Tool	Correct	Accuracy
searchcode	3/4	75%
Tool C	3/4	75%
Tool B	2/4	50%
Tool A	1/4	25%

Tool C performed well here — htmlscriptelement.rs for script and async_html.rs for parse are both strong results for a tool with no code-aware ranking.

jetbrains/kotlin

Query	searchcode #1	Tool A #1	Tool C #1	Tool B #1
`compiler`	`cli/.../KotlinToJVMBytecodeCompiler.kt`	`repo/gradle-build-conventions/.../ideaExtKotlinDsl.kt`	`compiler/build-tools/.../compat/...`	`plugins/compose/design/compiler-metrics.md`
`parser`	`compiler/psi/parser/.../KDocParser.java`	`kotlin-native/performance/.../JsonParser.kt`	`js/js.parser/.../JavaScriptParserListener.java`	`compiler/psi/parser/.../KDocParser.java`
`type`	`compiler/tests-spec/testData/...`	`wasm/wasm.ir/.../Types.kt`	`core/compiler.common/.../AbstractTypeChecker.kt`	`kotlin-native/runtime/.../Types.h`
`resolve`	`compiler/fir/resolve/.../FirExpressionsResolveTransformer.kt`	`analysis/.../testData/lazyResolve/superTypes.kt`	`analysis/analysis-api/.../KaResolver.kt`	`js/js.ast/.../JsNameRef.java`

The Kotlin compiler is a stress test — 778k matches for type alone. Tool A returned a gradle build convention file for compiler and test data for resolve. Tool B returned a Markdown design doc for compiler. searchcode hit the actual KotlinToJVMBytecodeCompiler.kt but stumbled on type (returning test spec data).

Tool	Correct	Accuracy
searchcode	3/4	75%
Tool C	2/4	50%
Tool B	2/4	50%
Tool A	0/4	0%

aquasecurity/vuln-list-update (3-way, Tool C cannot scope)

Query	searchcode #1	Tool A #1	Tool B #1
`main`	`main.go`	`main.go`	`main.go`
`update`	`redhat/csaf/vex.go`	`cwe/cwe.go`	`nvd/nvd.go`
`fetch`	`redhat/csaf/vex.go`	`utils/utils.go`	`nvd/nvd.go`
`config`	`redhat/csaf/vex.go`	`git/git.go`	`git/git.go`
`debian`	`debian/tracker/debian.go`	`debian/tracker/debian.go`	`README.md`
`alpine`	`alpine/alpine.go`	`alpine-unfixed/alpine_test.go`	`alpine/alpine.go`

For update, fetch, and config, every tool returned a different valid implementation file — these queries are genuinely ambiguous in a repo where every subpackage has its own Update() method and Config struct. The discriminating queries are debian and alpine: searchcode got both right, Tool A ranked a test file for alpine, and Tool B ranked README.md for debian.

Tool	Correct	Accuracy
searchcode	5/6	83%
Tool A	4/6	67%
Tool B	4/6	67%

Results: Deep Dive on golang/go

The Go standard library is the hardest test case — thousands of packages, many files with overlapping terminology. We tested 7 queries comparing searchcode and Tool A.

Query	searchcode #1	Tool A #1	SC	A
`sort`	`sort/zsortinterface.go`	`slices/sort.go`	~	~
`mutex`	`runtime/mprof.go`	`cmd/go/internal/lockedfile/mutex.go`	no	~
`context cancel`	`context/context.go`	`context/context.go`	yes	yes
`handler`	`log/slog/handler.go`	(wrong)	yes	no
`scanner`	`go/scanner/scanner.go`	–	yes	–
`http client request`	`net/http/request.go`	`runtime/valgrind_amd64.s`	yes	no
`json marshal`	`html/template/js.go`	`encoding/json/v2/errors.go`	no	no

Score: searchcode 5/7, Tool A 3/7

Notable: for http client request, Tool A returned an assembly file from the runtime (valgrind_amd64.s) — completely unrelated to HTTP.

Results: searchcode vs Tool A (All Repos)

rust-lang/regex (5 queries)

Query	searchcode #1	Tool A #1	SC	A
`parser`	`ast/parse.rs`	`CHANGELOG.md`	yes	no
`compile`	`regex-test/lib.rs`	`regex-test/lib.rs`	~	no
`match`	`dfa/dense.rs`	`regex-test/lib.rs`	~	no
`literal`	`ast/parse.rs`	`nfa/thompson/literal_trie.rs`	no	~
`error`	`ast/parse.rs`	`hir/mod.rs`	~	yes

Score: searchcode 4/5, Tool A 2/5

pallets/flask (5 queries)

Query	searchcode #1	Tool A #1	SC	A
`route`	`sansio/scaffold.py`	`CHANGES.rst`	yes	no
`blueprint`	`sansio/blueprints.py`	`docs/blueprints.rst`	yes	no
`request response`	`app.py`	`app.py`	yes	yes
`template render`	`sansio/scaffold.py`	`docs/tutorial/templates.rst`	yes	no
`config`	`config.py`	`docs/config.rst`	yes	no

Score: searchcode 5/5, Tool A 1/5

expressjs/express (5 queries)

Query	searchcode #1	Tool A #1	SC	A
`router`	`lib/application.js`	`History.md`	yes	no
`middleware`	`lib/application.js`	`README.md`	yes	no
`request`	`lib/request.js`	`test/req.xhr.js`	yes	no
`response`	`lib/response.js`	`test/res.status.js`	yes	no
`view render`	`lib/application.js`	`examples/view-constructor/index.js`	yes	no

Score: searchcode 5/5, Tool A 0/5

Aggregate Scorecard

searchcode vs Tool A (all 41 queries)

Repository	Queries	searchcode	Tool A
golang/go	7	5 (71%)	3 (43%)
rust-lang/regex	5	4 (80%)	2 (40%)
gin-gonic/gin	5	5 (100%)	1 (20%)
pallets/flask	5	5 (100%)	1 (20%)
expressjs/express	5	5 (100%)	0 (0%)
servo/servo	4	3 (75%)	1 (25%)
jetbrains/kotlin	4	3 (75%)	0 (0%)
aquasecurity/vuln-list-update	6	5 (83%)	4 (67%)
Total	41	35 (85%)	12 (29%)

searchcode is 2.9x more accurate than Tool A at returning the correct #1 result.

Four-way comparison (16 queries across gin, express, servo, kotlin)

Tool	Correct	Accuracy
searchcode	14/16	88%
Tool C	8/16	50%
Tool B	10/16	63%
Tool A	1/16	6%

Why searchcode Wins

searchcode’s ranking advantage comes from a handful of code-aware heuristics layered on top of BM25 text relevance scoring. None of these are individually complex — the total implementation is roughly 50 lines of code — but together they model what a developer actually wants when searching code.

1. Test dampening

Files matching test patterns (_test.go, *_test.rs, /test/, /tests/, -test/) have their ranking score multiplied by 0.4. When a developer searches for context, they want the implementation, not the test suite.

This single heuristic addresses Tool A’s most common failure mode. Across our benchmark, Tool A’s #1 result was a test file in 6 of 27 queries — including context_test.go for “context” in gin, test/req.xhr.js for “request” in express, and reactiveArray.spec.ts for “reactive” in Vue.

2. Complexity gravity

Files with higher cyclomatic complexity get a ranking boost. Implementation files are inherently more complex than documentation, configuration, or boilerplate — they contain the actual logic. A file with branching, loops, and error handling is more likely to be what a developer is looking for than a flat list of exports.

3. Noise penalty

The ratio of complexity to file size penalizes large, low-complexity files. Changelogs, READMEs, and JSON configs are typically long but contain minimal logic. This pushes them down in results.

Tool A ranked a documentation or changelog file #1 in 11 of 27 queries: BENCHMARKS.md, README.md (3x), History.md, CHANGELOG.md, CHANGES.rst, docs/blueprints.rst, docs/config.rst, docs/tutorial/templates.rst, docs/doc.md.

4. Filename boost

When the query term matches the filename stem exactly, the file gets a 1.0 boost. Substring matches get a 0.5 boost. Searching for context boosts context.go. Searching for scanner boosts scanner.go. This is intuitive — if someone names a file router.go, it’s probably the canonical file for routing.

5. Directory name matching

Parent directory names matching the query get an additional boost. For context cancel, the file context/context.go gets a double boost — directory match plus filename match. This handles the common Go pattern of package/package.go.

The structural advantage

searchcode computes ranking at query time. Every heuristic improvement applies instantly to every query across every indexed repository, with no re-indexing required. Tools that bake ranking signals into their index need to re-index millions of repositories to deploy a ranking change — making iteration on relevance painfully slow.

Why Others Struggle

Each competing tool has a characteristic failure mode:

Tool A: documentation and changelogs

Tool A’s ranking appears to weight raw term frequency heavily. Changelogs mention every feature by name. READMEs describe every module. Documentation references every API. These files contain every keyword — but they’re the last place a developer wants to land when searching for an implementation.

Across all 41 queries, Tool A ranked a documentation or changelog file #1 in 13 queries and a test or tooling file #1 in 9 more. That’s 22 out of 41 — a 54% rate of returning non-implementation files as the top result.

Tool C: inconsistent but improving

Tool C’s results are a mixed bag. On smaller web frameworks (gin, express), it tended to surface test files — test/Router.js for router, test/app.use.js for middleware. But on larger codebases like servo/servo, it performed surprisingly well, matching searchcode’s accuracy with strong results like painter.rs for render and async_html.rs for parse.

Tool C can scope to a single repository, but only for repos in its index. You must use the f.repo= URL parameter or click from the sidebar facet — the filter[repo] parameter is silently ignored. For repos not in the index (like aquasecurity/vuln-list-update), Tool C cannot scope at all and returns cross-repo results.

Tool B: examples and docs

Tool B performed well overall (75% in the 4-way comparison), but its failures skewed toward example files and documentation. For middleware in gin, it returned README.md. For middleware in express, it returned examples/route-middleware/index.js. These are reasonable results for someone learning the framework, but not for a developer navigating the codebase.

Tool B also requires authentication — you must be signed in to use it.

Repository Coverage

We tested 9 repositories across multiple hosting platforms:

Repository	searchcode	Tool A
torvalds/linux	yes	yes
anomalyco/opencode	yes	yes
vuejs/core	yes	yes
rust-lang/regex	yes	yes
earthboundkid/requests	yes	yes
boyter/dcd	yes	yes
boyter/pincer	yes	no
golang-io/requests	yes	no
esr/loccount (non-GitHub)	yes	no

Tool A’s public instance indexed 6 of 9 repos (67%). The three failures were smaller repos and a non-GitHub-hosted repo. searchcode indexed all 9 (100%).

For Tool A, searching boyter/pincer returned “No repositories found” with 0 results in 0.01 seconds — the repo simply isn’t in the index. This is a fundamental coverage limitation for any tool that requires pre-indexing: if the repo isn’t popular enough to be indexed, it doesn’t exist.

Beyond Search: code_analyze

searchcode offers structural analysis capabilities that no other tool provides. A single code_analyze call returns:

File count, lines of code, and total complexity score
Language breakdown
Top 20 most complex files, ranked
Tech stack detection
Code quality findings with counts
Credential scanning

For example, analyzing rust-lang/regex:

Metric	Value
Files	381
Code lines	127,000
Total complexity	5,512
Languages	220 Rust files
Quality findings	3,588

The most complex files list immediately reveals the architectural core:

File	Complexity	Lines
`ast/parse.rs`	304	5,497
`hir/parse.rs`	234	1,768
`dfa/dense.rs`	221	2,189

For a smaller project like erikbern/git-of-theseus, the analysis reveals the entire architecture at a glance:

File	Complexity	Lines	Role
`analyze.py`	99	540	Core (68% of complexity)
`survival_plot.py`	17	112	Plotting
`line_plot.py`	11	62	Plotting
`stack_plot.py`	11	59	Plotting
`utils.py`	3	13	Helpers

No other code search tool offers anything comparable. Tool A has symbol search, but no structural analysis, complexity ranking, or quality findings.

MCP and AI Agent Integration

searchcode exposes its full capabilities through MCP (Model Context Protocol), making it directly usable by AI agents. The comparison with browser-based tools is significant:

Capability	searchcode (MCP)	Browser-based tools
Output format	Structured JSON	HTML (requires parsing)
Code context	Configurable line context	Collapsed matches
Filtering	`lang:`, `path:`, regex, `only-declarations`, `only-comments`, `only-strings`, `only-code`	`lang:`, `type:`, `repo:`
Repo analysis	`code_analyze` (complexity, LOC, tech stack)	None
Auth required	No	Tool B requires sign-in
Repo coverage	Any public git repo	Varies by index

The structural filters deserve special mention. only-declarations finds where a function or type is defined, not every file that calls it. only-comments finds design notes, TODOs, and documentation within code. only-strings finds error messages and user-facing text. These filters have no equivalent in any other tool tested.

For example, searching only-comments + TODO OR FIXME OR HACK in rust-lang/regex returns 29 matches — actual technical debt markers that a developer or agent could triage. No other tool can isolate these without manually filtering results.

Conclusion

Code search ranking is a solved problem that most tools haven’t solved. Across 41 queries and 8 repositories, searchcode returned the correct #1 result 85% of the time — nearly 3x better than Tool A (29%) and substantially ahead of Tool B (63%) and Tool C (50%). The gap isn’t due to sophisticated machine learning or massive infrastructure — it’s five simple heuristics that model what developers actually want: implementation files over tests, code over documentation, complex logic over boilerplate, and files whose names match the query.

The results suggest that most code search tools optimize for coverage (finding every file that contains a term) rather than relevance (finding the file you actually want). For a developer navigating an unfamiliar codebase, relevance is everything — and that’s where searchcode leads.