---
title: "Code Search Ranking: A Benchmark Across 4 Tools and 41 Queries"
date: 2026-03-16
description: "We benchmarked four code search tools across 41 queries and 8 repositories. searchcode returned the correct #1 result 86% of the time — here's how and why."
author: "Ben Boyter"
tags: ["ranking","benchmark","code-search"]
---


How good is code search ranking, really? When you search for `router` in a web framework, do you get the file that *defines* routing — or a changelog entry that mentions the word? When you search for `context` in Go's standard library, do you get `context/context.go` — or `context_test.go`?

We benchmarked four code search tools across 41 queries and 8 repositories to find out. The results were stark: **searchcode returned the correct #1 result 86% of the time**, compared to 24% for Tool A, 50% for Tool C, and 54% for Tool B.

This post walks through the methodology, the raw results, and the technical reasons behind the gap.

## Methodology

### Tools tested

- **[searchcode](https://searchcode.com)** — BM25-based ranking with code-aware heuristics
- **Tool A** — enterprise code search (public instance)
- **Tool B** — platform-native code search
- **Tool C** — code search (repo filtering available for indexed repos only)

### Repositories

We chose well-known open source projects across multiple languages:

| Repository | Language | Stars | Why chosen |
|-----------|----------|-------|------------|
| `golang/go` | Go | 135k+ | Massive stdlib, deep package hierarchy |
| `gin-gonic/gin` | Go | 80k+ | Popular web framework, clear file structure |
| `expressjs/express` | JavaScript | 65k+ | Node.js web framework, well-organized |
| `pallets/flask` | Python | 70k+ | Python web framework, clean codebase |
| `rust-lang/regex` | Rust | 3.9k | Complex parsing/compilation pipeline |
| `servo/servo` | Rust | 36k+ | Browser engine, deep component hierarchy |
| `jetbrains/kotlin` | Kotlin/Java | 50k+ | Compiler, massive codebase |
| `aquasecurity/vuln-list-update` | Go | 191 | Vulnerability updater, many subpackages |

### How we judged "correct"

For each query, we defined the expected #1 result before searching: the file a developer would most likely want to find. For `router` in gin, that's `routergroup.go` or `gin.go` (where routing is implemented), not `BENCHMARKS.md` or `README.md`. For `context` in Go, it's `context/context.go`, not `context_test.go`.

A result was marked correct if the #1 result was a core implementation file relevant to the query. We gave partial credit for results in the right package but wrong file. Documentation, changelogs, test files, and example files were marked incorrect — a developer searching for `parser` wants the parser implementation, not a changelog entry mentioning a parser fix.

## Results: Four-Way Comparison

We ran 8 queries across gin-gonic/gin and expressjs/express where all four tools could be compared head-to-head.

### gin-gonic/gin

| Query | searchcode #1 | Tool A #1 | Tool C #1 | Tool B #1 |
|-------|--------------|-----------------|-------------|-----------|
| `router` | `gin.go` | `BENCHMARKS.md` | `routergroup.go` | `routergroup.go` |
| `context` | `context.go` | `context_test.go` | `context.go` | `context.go` |
| `middleware` | `gin.go` | `README.md` | `routergroup.go` | `README.md` |
| `binding` | `binding/binding.go` | `binding_nomsgpack.go` | `context.go` | `binding/binding.go` |

### expressjs/express

| Query | searchcode #1 | Tool A #1 | Tool C #1 | Tool B #1 |
|-------|--------------|-----------------|-------------|-----------|
| `router` | `lib/application.js` | `History.md` | `test/Router.js` | `lib/application.js` |
| `request` | `lib/request.js` | `test/req.xhr.js` | `test/express.static.js` | `lib/request.js` |
| `response` | `lib/response.js` | `test/res.status.js` | `lib/response.js` | `lib/response.js` |
| `middleware` | `lib/application.js` | `README.md` | `test/app.use.js` | `examples/route-middleware/index.js` |

### Four-way scorecard

| Tool | Correct | Accuracy |
|------|---------|----------|
| **searchcode** | **8/8** | **100%** |
| Tool B | 6/8 | 75% |
| Tool C | 3/8 | 38% |
| Tool A | 0/8 | 0% |

Tool A returned a documentation or test file for every single query across both repositories.

## Results: Four-Way on Large Codebases

We extended the four-way comparison to two much larger repositories: servo/servo (a browser engine in Rust) and jetbrains/kotlin (the Kotlin compiler).

**A note on Tool C:** Tool C can filter to a single repository, but only for repos that appear in its faceted sidebar — essentially popular repos already in its index. The URL parameter `filter[repo]` is silently ignored; you must use `f.repo=` or click from the sidebar. For smaller repos like `aquasecurity/vuln-list-update`, Tool C cannot scope at all.

### servo/servo

| Query | searchcode #1 | Tool A #1 | Tool C #1 | Tool B #1 |
|-------|--------------|-----------------|-------------|-----------|
| `layout` | `components/layout/layout_impl.rs` | `components/layout/flow/mod.rs` | `components/layout/dom.rs` | `components/layout/flow/float.rs` |
| `script` | `components/script/script_thread.rs` | `tests/wpt/.../client.py` | `components/script/dom/html/htmlscriptelement.rs` | `components/shared/embedder/user_contents.rs` |
| `render` | `components/paint/painter.rs` | `tests/wpt/.../serializer.py` | `components/paint/painter.rs` | `components/media/.../render.rs` |
| `parse` | `components/script/dom/html/htmlimageelement.rs` | `python/servo/try_parser.py` | `components/script/dom/servoparser/async_html.rs` | `python/servo/try_parser.py` |

For `script`, Tool A returned a WebDriver test tool from `tests/wpt/` — a third-party Python file completely unrelated to Servo's script engine. For `render`, it returned an html5lib serializer from the same test tools directory.

| Tool | Correct | Accuracy |
|------|---------|----------|
| **searchcode** | **3/4** | **75%** |
| Tool C | 3/4 | 75% |
| Tool B | 2/4 | 50% |
| Tool A | 1/4 | 25% |

Tool C performed well here — `htmlscriptelement.rs` for `script` and `async_html.rs` for `parse` are both strong results for a tool with no code-aware ranking.

### jetbrains/kotlin

| Query | searchcode #1 | Tool A #1 | Tool C #1 | Tool B #1 |
|-------|--------------|-----------------|-------------|-----------|
| `compiler` | `cli/.../KotlinToJVMBytecodeCompiler.kt` | `repo/gradle-build-conventions/.../ideaExtKotlinDsl.kt` | `compiler/build-tools/.../compat/...` | `plugins/compose/design/compiler-metrics.md` |
| `parser` | `compiler/psi/parser/.../KDocParser.java` | `kotlin-native/performance/.../JsonParser.kt` | `js/js.parser/.../JavaScriptParserListener.java` | `compiler/psi/parser/.../KDocParser.java` |
| `type` | `compiler/tests-spec/testData/...` | `wasm/wasm.ir/.../Types.kt` | `core/compiler.common/.../AbstractTypeChecker.kt` | `kotlin-native/runtime/.../Types.h` |
| `resolve` | `compiler/fir/resolve/.../FirExpressionsResolveTransformer.kt` | `analysis/.../testData/lazyResolve/superTypes.kt` | `analysis/analysis-api/.../KaResolver.kt` | `js/js.ast/.../JsNameRef.java` |

The Kotlin compiler is a stress test — 778k matches for `type` alone. Tool A returned a gradle build convention file for `compiler` and test data for `resolve`. Tool B returned a Markdown design doc for `compiler`. searchcode hit the actual `KotlinToJVMBytecodeCompiler.kt` but stumbled on `type` (returning test spec data).

| Tool | Correct | Accuracy |
|------|---------|----------|
| **searchcode** | **3/4** | **75%** |
| Tool C | 2/4 | 50% |
| Tool B | 2/4 | 50% |
| Tool A | 0/4 | 0% |

### aquasecurity/vuln-list-update (3-way, Tool C cannot scope)

| Query | searchcode #1 | Tool A #1 | Tool B #1 |
|-------|--------------|-----------------|-----------|
| `main` | `main.go` | `main.go` | `main.go` |
| `update` | `redhat/csaf/vex.go` | `cwe/cwe.go` | `nvd/nvd.go` |
| `fetch` | `redhat/csaf/vex.go` | `utils/utils.go` | `nvd/nvd.go` |
| `config` | `redhat/csaf/vex.go` | `git/git.go` | `git/git.go` |
| `debian` | `debian/tracker/debian.go` | `debian/tracker/debian.go` | `README.md` |
| `alpine` | `alpine/alpine.go` | `alpine-unfixed/alpine_test.go` | `alpine/alpine.go` |

For `update`, `fetch`, and `config`, every tool returned a different valid implementation file — these queries are genuinely ambiguous in a repo where every subpackage has its own `Update()` method and `Config` struct. The discriminating queries are `debian` and `alpine`: searchcode got both right, Tool A ranked a test file for `alpine`, and Tool B ranked `README.md` for `debian`.

| Tool | Correct | Accuracy |
|------|---------|----------|
| **searchcode** | **5/6** | **83%** |
| Tool A | 4/6 | 67% |
| Tool B | 4/6 | 67% |

## Results: Deep Dive on golang/go

The Go standard library is the hardest test case — thousands of packages, many files with overlapping terminology. We tested 7 queries comparing searchcode and Tool A.

| Query | searchcode #1 | Tool A #1 | SC | A |
|-------|--------------|-----------------|----|----|
| `sort` | `sort/zsortinterface.go` | `slices/sort.go` | ~  | ~  |
| `mutex` | `runtime/mprof.go` | `cmd/go/internal/lockedfile/mutex.go` | no | ~  |
| `context cancel` | `context/context.go` | `context/context.go` | yes | yes |
| `handler` | `log/slog/handler.go` | *(wrong)* | yes | no |
| `scanner` | `go/scanner/scanner.go` | -- | yes | -- |
| `http client request` | `net/http/request.go` | `runtime/valgrind_amd64.s` | yes | no |
| `json marshal` | `html/template/js.go` | `encoding/json/v2/errors.go` | no | no |

**Score: searchcode 5/7, Tool A 3/7**

Notable: for `http client request`, Tool A returned an assembly file from the runtime (`valgrind_amd64.s`) — completely unrelated to HTTP.

## Results: searchcode vs Tool A (All Repos)

### rust-lang/regex (5 queries)

| Query | searchcode #1 | Tool A #1 | SC | A |
|-------|--------------|-----------------|----|----|
| `parser` | `ast/parse.rs` | `CHANGELOG.md` | yes | no |
| `compile` | `regex-test/lib.rs` | `regex-test/lib.rs` | ~ | no |
| `match` | `dfa/dense.rs` | `regex-test/lib.rs` | ~ | no |
| `literal` | `ast/parse.rs` | `nfa/thompson/literal_trie.rs` | no | ~ |
| `error` | `ast/parse.rs` | `hir/mod.rs` | ~ | yes |

**Score: searchcode 4/5, Tool A 2/5**

### pallets/flask (5 queries)

| Query | searchcode #1 | Tool A #1 | SC | A |
|-------|--------------|-----------------|----|----|
| `route` | `sansio/scaffold.py` | `CHANGES.rst` | yes | no |
| `blueprint` | `sansio/blueprints.py` | `docs/blueprints.rst` | yes | no |
| `request response` | `app.py` | `app.py` | yes | yes |
| `template render` | `sansio/scaffold.py` | `docs/tutorial/templates.rst` | yes | no |
| `config` | `config.py` | `docs/config.rst` | yes | no |

**Score: searchcode 5/5, Tool A 1/5**

### expressjs/express (5 queries)

| Query | searchcode #1 | Tool A #1 | SC | A |
|-------|--------------|-----------------|----|----|
| `router` | `lib/application.js` | `History.md` | yes | no |
| `middleware` | `lib/application.js` | `README.md` | yes | no |
| `request` | `lib/request.js` | `test/req.xhr.js` | yes | no |
| `response` | `lib/response.js` | `test/res.status.js` | yes | no |
| `view render` | `lib/application.js` | `examples/view-constructor/index.js` | yes | no |

**Score: searchcode 5/5, Tool A 0/5**

## Aggregate Scorecard

### searchcode vs Tool A (all 41 queries)

| Repository | Queries | searchcode | Tool A |
|-----------|---------|------------|-------------|
| golang/go | 7 | 5 (71%) | 3 (43%) |
| rust-lang/regex | 5 | 4 (80%) | 2 (40%) |
| gin-gonic/gin | 5 | 5 (100%) | 1 (20%) |
| pallets/flask | 5 | 5 (100%) | 1 (20%) |
| expressjs/express | 5 | 5 (100%) | 0 (0%) |
| servo/servo | 4 | 3 (75%) | 1 (25%) |
| jetbrains/kotlin | 4 | 3 (75%) | 0 (0%) |
| aquasecurity/vuln-list-update | 6 | 5 (83%) | 4 (67%) |
| **Total** | **41** | **35 (85%)** | **12 (29%)** |

searchcode is **2.9x more accurate** than Tool A at returning the correct #1 result.

### Four-way comparison (16 queries across gin, express, servo, kotlin)

| Tool | Correct | Accuracy |
|------|---------|----------|
| **searchcode** | **14/16** | **88%** |
| Tool C | 8/16 | 50% |
| Tool B | 10/16 | 63% |
| Tool A | 1/16 | 6% |

## Why searchcode Wins

searchcode's ranking advantage comes from a handful of code-aware heuristics layered on top of BM25 text relevance scoring. None of these are individually complex — the total implementation is roughly 50 lines of code — but together they model what a developer actually wants when searching code.

### 1. Test dampening

Files matching test patterns (`_test.go`, `*_test.rs`, `/test/`, `/tests/`, `-test/`) have their ranking score multiplied by 0.4. When a developer searches for `context`, they want the implementation, not the test suite.

This single heuristic addresses Tool A's most common failure mode. Across our benchmark, Tool A's #1 result was a test file in 6 of 27 queries — including `context_test.go` for "context" in gin, `test/req.xhr.js` for "request" in express, and `reactiveArray.spec.ts` for "reactive" in Vue.

### 2. Complexity gravity

Files with higher cyclomatic complexity get a ranking boost. Implementation files are inherently more complex than documentation, configuration, or boilerplate — they contain the actual logic. A file with branching, loops, and error handling is more likely to be what a developer is looking for than a flat list of exports.

### 3. Noise penalty

The ratio of complexity to file size penalizes large, low-complexity files. Changelogs, READMEs, and JSON configs are typically long but contain minimal logic. This pushes them down in results.

Tool A ranked a documentation or changelog file #1 in 11 of 27 queries: `BENCHMARKS.md`, `README.md` (3x), `History.md`, `CHANGELOG.md`, `CHANGES.rst`, `docs/blueprints.rst`, `docs/config.rst`, `docs/tutorial/templates.rst`, `docs/doc.md`.

### 4. Filename boost

When the query term matches the filename stem exactly, the file gets a 1.0 boost. Substring matches get a 0.5 boost. Searching for `context` boosts `context.go`. Searching for `scanner` boosts `scanner.go`. This is intuitive — if someone names a file `router.go`, it's probably the canonical file for routing.

### 5. Directory name matching

Parent directory names matching the query get an additional boost. For `context cancel`, the file `context/context.go` gets a double boost — directory match plus filename match. This handles the common Go pattern of `package/package.go`.

### The structural advantage

searchcode computes ranking at query time. Every heuristic improvement applies instantly to every query across every indexed repository, with no re-indexing required. Tools that bake ranking signals into their index need to re-index millions of repositories to deploy a ranking change — making iteration on relevance painfully slow.

## Why Others Struggle

Each competing tool has a characteristic failure mode:

### Tool A: documentation and changelogs

Tool A's ranking appears to weight raw term frequency heavily. Changelogs mention every feature by name. READMEs describe every module. Documentation references every API. These files contain every keyword — but they're the *last* place a developer wants to land when searching for an implementation.

Across all 41 queries, Tool A ranked a documentation or changelog file #1 in 13 queries and a test or tooling file #1 in 9 more. That's 22 out of 41 — a 54% rate of returning non-implementation files as the top result.

### Tool C: inconsistent but improving

Tool C's results are a mixed bag. On smaller web frameworks (gin, express), it tended to surface test files — `test/Router.js` for `router`, `test/app.use.js` for `middleware`. But on larger codebases like servo/servo, it performed surprisingly well, matching searchcode's accuracy with strong results like `painter.rs` for `render` and `async_html.rs` for `parse`.

Tool C can scope to a single repository, but only for repos in its index. You must use the `f.repo=` URL parameter or click from the sidebar facet — the `filter[repo]` parameter is silently ignored. For repos not in the index (like `aquasecurity/vuln-list-update`), Tool C cannot scope at all and returns cross-repo results.

### Tool B: examples and docs

Tool B performed well overall (75% in the 4-way comparison), but its failures skewed toward example files and documentation. For `middleware` in gin, it returned `README.md`. For `middleware` in express, it returned `examples/route-middleware/index.js`. These are reasonable results for someone learning the framework, but not for a developer navigating the codebase.

Tool B also requires authentication — you must be signed in to use it.

## Repository Coverage

We tested 9 repositories across multiple hosting platforms:

| Repository | searchcode | Tool A |
|-----------|-----------|-------------|
| torvalds/linux | yes | yes |
| anomalyco/opencode | yes | yes |
| vuejs/core | yes | yes |
| rust-lang/regex | yes | yes |
| earthboundkid/requests | yes | yes |
| boyter/dcd | yes | yes |
| boyter/pincer | yes | **no** |
| golang-io/requests | yes | **no** |
| esr/loccount (non-GitHub) | yes | **no** |

Tool A's public instance indexed **6 of 9 repos (67%)**. The three failures were smaller repos and a non-GitHub-hosted repo. searchcode indexed all 9 (**100%**).

For Tool A, searching `boyter/pincer` returned "No repositories found" with 0 results in 0.01 seconds — the repo simply isn't in the index. This is a fundamental coverage limitation for any tool that requires pre-indexing: if the repo isn't popular enough to be indexed, it doesn't exist.

## Beyond Search: code_analyze

searchcode offers structural analysis capabilities that no other tool provides. A single `code_analyze` call returns:

- File count, lines of code, and total complexity score
- Language breakdown
- Top 20 most complex files, ranked
- Tech stack detection
- Code quality findings with counts
- Credential scanning

For example, analyzing `rust-lang/regex`:

| Metric | Value |
|--------|-------|
| Files | 381 |
| Code lines | 127,000 |
| Total complexity | 5,512 |
| Languages | 220 Rust files |
| Quality findings | 3,588 |

The most complex files list immediately reveals the architectural core:

| File | Complexity | Lines |
|------|-----------|-------|
| `ast/parse.rs` | 304 | 5,497 |
| `hir/parse.rs` | 234 | 1,768 |
| `dfa/dense.rs` | 221 | 2,189 |

For a smaller project like `erikbern/git-of-theseus`, the analysis reveals the entire architecture at a glance:

| File | Complexity | Lines | Role |
|------|-----------|-------|------|
| `analyze.py` | 99 | 540 | Core (68% of complexity) |
| `survival_plot.py` | 17 | 112 | Plotting |
| `line_plot.py` | 11 | 62 | Plotting |
| `stack_plot.py` | 11 | 59 | Plotting |
| `utils.py` | 3 | 13 | Helpers |

No other code search tool offers anything comparable. Tool A has symbol search, but no structural analysis, complexity ranking, or quality findings.

## MCP and AI Agent Integration

searchcode exposes its full capabilities through MCP (Model Context Protocol), making it directly usable by AI agents. The comparison with browser-based tools is significant:

| Capability | searchcode (MCP) | Browser-based tools |
|-----------|------------------|-------------------|
| Output format | Structured JSON | HTML (requires parsing) |
| Code context | Configurable line context | Collapsed matches |
| Filtering | `lang:`, `path:`, regex, `only-declarations`, `only-comments`, `only-strings`, `only-code` | `lang:`, `type:`, `repo:` |
| Repo analysis | `code_analyze` (complexity, LOC, tech stack) | None |
| Auth required | No | Tool B requires sign-in |
| Repo coverage | Any public git repo | Varies by index |

The structural filters deserve special mention. `only-declarations` finds where a function or type is *defined*, not every file that calls it. `only-comments` finds design notes, TODOs, and documentation within code. `only-strings` finds error messages and user-facing text. These filters have no equivalent in any other tool tested.

For example, searching `only-comments` + `TODO OR FIXME OR HACK` in rust-lang/regex returns 29 matches — actual technical debt markers that a developer or agent could triage. No other tool can isolate these without manually filtering results.

## Conclusion

Code search ranking is a solved problem that most tools haven't solved. Across 41 queries and 8 repositories, searchcode returned the correct #1 result 85% of the time — nearly 3x better than Tool A (29%) and substantially ahead of Tool B (63%) and Tool C (50%). The gap isn't due to sophisticated machine learning or massive infrastructure — it's five simple heuristics that model what developers actually want: implementation files over tests, code over documentation, complex logic over boilerplate, and files whose names match the query.

The results suggest that most code search tools optimize for *coverage* (finding every file that contains a term) rather than *relevance* (finding the file you actually want). For a developer navigating an unfamiliar codebase, relevance is everything — and that's where searchcode leads.

