Models & Researchbenchmarksfinancial llmstool callingpreference learning

FinTrace Evaluates LLM Tool-Calling for Financial Tasks

|April 14, 2026|By LDS Team

7.1

Relevance Score

FinTrace Evaluates LLM Tool-Calling for Financial Tasks

FinTrace delivers a trajectory-level benchmark and training dataset for long-horizon financial tool-calling. The benchmark contains 800 expert-annotated trajectories across 34 real-world financial categories and evaluates models on a rubric of nine metrics organized into four axes: action correctness, execution efficiency, process quality, and output quality. Evaluation of 13 LLMs finds reliable tool selection but weak information utilization and poor final answer quality. The authors release FinTrace-Training, an 8,196-trajectory preference dataset, and show that supervised fine-tuning plus DPO on Qwen-3.5-9B improves intermediate reasoning metrics and suppresses failure modes, yet end-to-end answer quality remains the main bottleneck. The work targets practitioners building tool-augmented agents for finance and supplies both diagnostic metrics and a path toward trajectory-level preference tuning.

What happened

FinTrace introduces a trajectory-level benchmark and training corpus for evaluating LLM tool-calling in long-horizon financial tasks. The benchmark includes 800 expert-annotated trajectories spanning 34 real-world financial task categories and a rubric of nine metrics grouped into four axes to capture action-level and outcome-level behavior. The authors evaluate 13 LLMs and find that while models often choose correct tools, they fail to reliably use tool outputs to produce high-quality final answers. The paper also releases FinTrace-Training, a trajectory-level preference dataset with 8,196 curated trajectories, and demonstrates fine-tuning of Qwen-3.5-9B with supervised fine-tuning then DPO to improve intermediate reasoning metrics.

Technical details

The rubric splits evaluation into four axes to measure different failure modes. Key axes include:

•Action correctness: whether the model calls the right tools at the right time
•Execution efficiency: how economically the model sequences calls and avoids redundant actions
•Process quality: the model's intermediate reasoning and use of tool outputs
•Output quality: final answer accuracy and completeness

The authors annotate full trajectories rather than isolated calls, enabling trajectory-level preference labels. They construct FinTrace-Training with pairwise preferences over tool-augmented contexts and apply supervised fine-tuning followed by DPO on Qwen-3.5-9B. DPO notably reduces common failure modes, improving intermediate metrics more than raw end-to-end answer quality.

Context and significance

Tool calling and agentic LLM workflows are converging on long-horizon, multi-step domains such as finance, where chaining external tools is necessary but error-prone. Existing benchmarks focused on call-level correctness and short scenarios; FinTrace fills the gap by measuring trajectory-level reasoning and process fidelity. The dataset and preference pairs make this a practical resource for researchers and engineers who need to align agent behavior over extended sequences rather than single API calls. The finding that trajectory-level gains do not fully translate into final answer improvements highlights a recurring problem: bridging process improvements and end-to-end utility remains a core research frontier.

What to watch

Apply trajectory-level diagnostics when developing tool-augmented agents and consider DPO over curated trajectories to suppress brittle behaviors. The crucial open question is how to propagate intermediate reasoning gains into consistent final-answer quality, a research direction likely to attract follow-up work and alternative alignment techniques.

Key Points

1FinTrace provides a trajectory-level benchmark with 800 annotated trajectories across 34 financial categories, exposing long-horizon failures.
2Trajectory-level preference data (8,196 trajectories) plus supervised fine-tuning and DPO improves intermediate reasoning and suppresses failure modes.
3Despite better tool selection and process metrics, models still fail to convert trajectory improvements into consistently high final-answer quality, a key research gap.

Scoring Rationale

FinTrace addresses a practical and under-evaluated issue: trajectory-level tool-calling in finance, supplying a sizable benchmark and preference dataset. The work is notable for practitioners building agentic systems, and the empirical finding that DPO improves process metrics but not final answer quality identifies a meaningful research gap. Freshness adjustment applied.

MoreAI Evals news

Sources

Public references used for this report.

1 source

01arxiv.org[2604.10015] FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems