We had the classic search-quality blind spot on the marketplace I work on: the database gave us the top-100 user queries per month, but nobody could say how good the results for those queries actually were. Relevance judgment is manual, boring, and expensive — so historically it just didn't happen.
Now an AI agent does it weekly.
The setup
Once a week, a computer-use agent (OpenClaw):
- Opens the site like a regular visitor.
- Runs all 100 top queries through the real search UI — the same path users take, not an internal API.
- Collects roughly 50 listings per query.
- Judges every listing for relevance to the query.
- Aggregates precision and related quality metrics per query and overall.
That's ~5,000 relevance judgments a week. As a manual annotation effort that's multiple people doing joyless work indefinitely; as an agent run, it's a scheduled job.
What it changed
We can finally see the search engine. Some query classes turned out to be served well; others were quietly bad in ways nobody had reported — users don't file tickets about mediocre result lists, they just leave.
Prioritization got evidence. "Improve search" became "these twelve query patterns have the worst precision and the highest volume — start there." That's a roadmap conversation, not a vibes conversation.
Quality became a time series. Because the run repeats weekly on the same query set, precision is now trackable across releases. A ranking change that degrades results shows up in the next run — not in a quarterly complaint pattern.
What I learned
Evals aren't just for LLM products. The eval mindset — fixed query set, repeatable judgments, metrics over time — maps perfectly onto classic search quality, and AI agents make it affordable for the first time.
Going through the real UI matters. The agent sees what users see: the ranking, the ads, the layout quirks. An internal-API harness would measure a system that users never actually experience.
Scale changes what's worth measuring. 5,000 judgments a week is beyond any manual process a mid-size product team would sustain. Once judgments are effectively free, you stop sampling and start measuring.
Originally discussed (in Russian) on my Telegram channel, where I share product and AI cases from work and side projects.