OpenSeeker-v2 Sets Search Agent Benchmark Records

A purely academic research team has matched and surpassed industrial-scale AI training pipelines using a fraction of the data and none of the complexity. OpenSeeker-v2, a 30B-parameter search agent trained on just 10.6k data samples using supervised fine-tuning alone, has posted state-of-the-art scores across four deep search benchmarks outperforming Tongyi DeepResearch, which was built on a far heavier continual pre-training, supervised fine-tuning, and reinforcement learning stack.

What OpenSeeker-v2 Scored and Who It Beat

According to the paper, OpenSeeker-v2 scored 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench. Tongyi DeepResearch the industrial baseline it was measured against posted 43.4%, 46.7%, 32.9%, and 75.0% on the same four tests.

The results position OpenSeeker-v2 as the first SOTA search agent at the 30B scale and ReAct paradigm to emerge from an academic team relying solely on SFT. That distinction matters because it challenges the assumption that cutting-edge search agent performance requires massive compute budgets and multi-stage training pipelines.

Three Data Changes That Drove the Gap

The paper attributes the performance gains to three targeted modifications in how training trajectories were synthesized, rather than any architectural overhaul. The team scaled up knowledge graph size to encourage richer multi-hop exploration, expanded the available tool set to give the agent broader functional range, and applied strict low-step filtering to eliminate trajectories that resolved too easily.

Together, these changes produced training data that is both more informative and harder to game a curriculum designed to force the model to practice the kind of extended, multi-step reasoning that deep search benchmarks actually test. The paper's framing is direct: high-difficulty trajectories, not pipeline complexity, are what move the needle for supervised fine-tuning in this domain.

Supporting that, the model was built with a 256k context window and supports up to 200 tool calls per trajectory, giving it the headroom to handle long-horizon search tasks without truncation or forced shortcuts.

Why This Matters for Deep Search Research

The gap between academic and industrial AI development in agentic domains has been widening. Industrial labs can stack pre-training runs, proprietary data pipelines, and reinforcement learning phases that most university teams simply cannot replicate. OpenSeeker-v2 is a direct challenge to that gap, specifically in the deep search agent space.

By open-sourcing both the model weights and the findings, the team is making a case that carefully constructed SFT data not scale or training complexity may be the primary lever for advancing search agent capabilities at this model size. That has practical implications for researchers and developers who want to build capable agents without enterprise-level infrastructure.

The paper was published on May 5 and quickly surfaced as the second most-discussed paper on Hugging Face the following day. Model weights are available through the PolarSeeker organization. Full methodology, benchmark results, and open-source access are detailed in the paper page on Hugging Face.