Authors introduce Step Rejection Fine-Tuning for trajectory distillation

Per the arXiv paper, "Step Rejection Fine-Tuning: A Practical Distillation Recipe" (submitted May 11, 2026), the authors propose Step Rejection Fine-Tuning (SRFT) to leverage unresolved trajectories that standard Rejection Fine-Tuning (RFT) discards. Per the paper, SRFT uses a critic LLM to evaluate each step in a trajectory, masks the loss for steps judged erroneous while keeping the entire trajectory in the context window, and improves resolution on the SWE-bench Verified benchmark. According to the paper, excluding unresolved trajectories with RFT raises resolution by 2.4%, while SRFT raises it by 3.7%, reaching a total resolution rate of 32.2%. Editorial analysis: This approach converts partial, unresolved traces into usable supervision, which can be valuable for hard, multi-step code generation and reasoning tasks.
What happened
Per the arXiv submission "Step Rejection Fine-Tuning: A Practical Distillation Recipe" (submitted May 11, 2026), the authors introduce Step Rejection Fine-Tuning (SRFT) as a modification of standard Rejection Fine-Tuning (RFT) for training large language models on multi-step agent trajectories. The paper reports that RFT, which discards trajectories that fail to resolve, yields a 2.4% improvement in resolution on SWE-bench Verified by excluding unresolved runs, while SRFT improves resolution by 3.7% and achieves a total resolution rate of 32.2%, according to the paper.
Technical details
Per the paper, SRFT uses a separate critic LLM to assess correctness at the granularity of individual steps within a trajectory. During training the method retains the full trajectory in the model's context window but masks the loss for steps flagged as erroneous, so the model does not learn to reproduce those specific mistakes. The authors evaluate on SWE-bench Verified, a software-engineering-oriented benchmark, and report the numerical improvements cited above.
Editorial analysis - technical context
Converting unresolved trajectories into partially supervised training examples by masking loss at erroneous steps echoes selective-loss masking and targeted distillation techniques used elsewhere in model compression and imitation learning. For practitioners, the main practical benefit is more efficient use of generated trajectories: instead of discarding long, partially-correct episodes, selective masking preserves corrective signal while preventing error replication.
Context and significance
Editorial analysis: The paper addresses a practical bottleneck for improving agentic LLM performance on hard, long-horizon tasks where full success is rare. By increasing usable training signal from failed runs, SRFT targets sample-efficiency and robustness of multi-step generation, which are active concerns for research on code generation, theorem proving, and agent frameworks.
What to watch
Editorial analysis: Replicability on other benchmarks and sensitivity to the critic's accuracy are the natural next checks. Observers should look for released code, critic model configurations, and ablations showing how masking thresholds and critic errors affect downstream generalization.
Scoring Rationale
This is a notable methodological contribution improving how unresolved trajectories are used for LLM fine-tuning, with measurable gains on a software-engineering benchmark. The advance is incremental rather than paradigm-shifting but relevant to practitioners working on multi-step generation and agent training.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
