FinTrace Evaluates LLM Tool-Calling for Financial Tasks

FinTrace delivers a trajectory-level benchmark and training dataset for long-horizon financial tool-calling. The benchmark contains 800 expert-annotated trajectories across 34 real-world financial categories and evaluates models on a rubric of nine metrics organized into four axes: action correctness, execution efficiency, process quality, and output quality. Evaluation of 13 LLMs finds reliable tool selection but weak information utilization and poor final answer quality. The authors release FinTrace-Training, an 8,196-trajectory preference dataset, and show that supervised fine-tuning plus DPO on Qwen-3.5-9B improves intermediate reasoning metrics and suppresses failure modes, yet end-to-end answer quality remains the main bottleneck. The work targets practitioners building tool-augmented agents for finance and supplies both diagnostic metrics and a path toward trajectory-level preference tuning.
What happened
FinTrace introduces a trajectory-level benchmark and training corpus for evaluating LLM tool-calling in long-horizon financial tasks. The benchmark includes 800 expert-annotated trajectories spanning 34 real-world financial task categories and a rubric of nine metrics grouped into four axes to capture action-level and outcome-level behavior. The authors evaluate 13 LLMs and find that while models often choose correct tools, they fail to reliably use tool outputs to produce high-quality final answers. The paper also releases FinTrace-Training, a trajectory-level preference dataset with 8,196 curated trajectories, and demonstrates fine-tuning of Qwen-3.5-9B with supervised fine-tuning then DPO to improve intermediate reasoning metrics.
Technical details
The rubric splits evaluation into four axes to measure different failure modes. Key axes include:
- •Action correctness: whether the model calls the right tools at the right time
- •Execution efficiency: how economically the model sequences calls and avoids redundant actions
- •Process quality: the model's intermediate reasoning and use of tool outputs
- •Output quality: final answer accuracy and completeness
The authors annotate full trajectories rather than isolated calls, enabling trajectory-level preference labels. They construct FinTrace-Training with pairwise preferences over tool-augmented contexts and apply supervised fine-tuning followed by DPO on Qwen-3.5-9B. DPO notably reduces common failure modes, improving intermediate metrics more than raw end-to-end answer quality.
Context and significance
Tool calling and agentic LLM workflows are converging on long-horizon, multi-step domains such as finance, where chaining external tools is necessary but error-prone. Existing benchmarks focused on call-level correctness and short scenarios; FinTrace fills the gap by measuring trajectory-level reasoning and process fidelity. The dataset and preference pairs make this a practical resource for researchers and engineers who need to align agent behavior over extended sequences rather than single API calls. The finding that trajectory-level gains do not fully translate into final answer improvements highlights a recurring problem: bridging process improvements and end-to-end utility remains a core research frontier.
What to watch
Apply trajectory-level diagnostics when developing tool-augmented agents and consider DPO over curated trajectories to suppress brittle behaviors. The crucial open question is how to propagate intermediate reasoning gains into consistent final-answer quality, a research direction likely to attract follow-up work and alternative alignment techniques.
Scoring Rationale
FinTrace addresses a practical and under-evaluated issue: trajectory-level tool-calling in finance, supplying a sizable benchmark and preference dataset. The work is notable for practitioners building agentic systems, and the empirical finding that `DPO` improves process metrics but not final answer quality identifies a meaningful research gap. Freshness adjustment applied.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

