5,000 Relevance Judgments a Week, Zero Human Annotators

1 min read
Work — classifieds marketplace100 queries × ~50 results = ~5,000 judgments weekly · precision tracked over time

We had the classic search-quality blind spot on the marketplace I work on: the database gave us the top-100 user queries per month, but nobody could say how good the results for those queries actually were. Relevance judgment is manual, boring, and expensive — so historically it just didn't happen.

Now an AI agent does it weekly.

The setup

Once a week, a computer-use agent (OpenClaw):

  1. Opens the site like a regular visitor.
  2. Runs all 100 top queries through the real search UI — the same path users take, not an internal API.
  3. Collects roughly 50 listings per query.
  4. Judges every listing for relevance to the query.
  5. Aggregates precision and related quality metrics per query and overall.

That's ~5,000 relevance judgments a week. As a manual annotation effort that's multiple people doing joyless work indefinitely; as an agent run, it's a scheduled job.

What it changed

We can finally see the search engine. Some query classes turned out to be served well; others were quietly bad in ways nobody had reported — users don't file tickets about mediocre result lists, they just leave.

Prioritization got evidence. "Improve search" became "these twelve query patterns have the worst precision and the highest volume — start there." That's a roadmap conversation, not a vibes conversation.

Quality became a time series. Because the run repeats weekly on the same query set, precision is now trackable across releases. A ranking change that degrades results shows up in the next run — not in a quarterly complaint pattern.

What I learned

Evals aren't just for LLM products. The eval mindset — fixed query set, repeatable judgments, metrics over time — maps perfectly onto classic search quality, and AI agents make it affordable for the first time.

Going through the real UI matters. The agent sees what users see: the ranking, the ads, the layout quirks. An internal-API harness would measure a system that users never actually experience.

Scale changes what's worth measuring. 5,000 judgments a week is beyond any manual process a mid-size product team would sustain. Once judgments are effectively free, you stop sampling and start measuring.

Originally discussed (in Russian) on my Telegram channel, where I share product and AI cases from work and side projects.