arxiv:2601.07641

Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

Published on Jan 12

· Submitted by

Jiaxuan Lu on Jan 16

· shanghai ailab

Upvote

Authors:

Jiaxuan Lu ,

Ziyu Kong ,

Haiyuan Wan ,

Cheng Yang ,

Lilong Wang ,

Yankai Jiang ,

Dongzhan Zhou

Abstract

Test-Time Tool Evolution enables AI agents to dynamically create and refine computational tools during inference, overcoming limitations of static tool libraries in scientific applications.

AI-generated summary

The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test-Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem-driven artifacts, TTE overcomes the rigidity and long-tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state-of-the-art performance in both accuracy and tool efficiency, while enabling effective cross-domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test-Time-Tool-Evol.

View arXiv page View PDF GitHub 37 Add to collection

Community

Blue-Giant

Paper author Paper submitter 1 day ago

🎉 Introducing Test-Time Tool Evolution (TTE) — Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning (arXiv:2601.07641)!

Why it matters: Scientific problems don’t come with a complete tool library. TTE lets agents synthesize → validate → evolve executable tools at inference time, turning tools from “fixed resources” into problem-driven, self-improving artifacts.

✅ Code + dataset are fully open-sourced.