Decision-Aware Memory Cards improve tool-using LLM agents

An arXiv paper titled "Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents" was submitted on 6 June 2026 by Xinyu Guan and two coauthors, per arXiv. The paper introduces CICL, a "decision-aware context layer" that converts instance evidence into a context graph, routes judgments through a shared eight-field schema, scores evidence units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility items as typed memory cards for budgeted agents, according to the arXiv abstract. Empirically, the paper reports that Qwen3.6-plus reranking of BM25 top-50 candidates on 50 SWE-bench Verified file-retrieval instances raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, per arXiv. The authors report additional diagnostics: at budget 120 CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing a top-utility semantic v3 unit collapses F1 to 0.000, per arXiv. The paper frames CICL as a reproducible measurement and selection layer rather than an end-to-end coding-agent repair claim, per arXiv.
What happened
The arXiv paper "Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents" was submitted on 6 June 2026 by Xinyu Guan and two coauthors, per arXiv. The paper proposes CICL, a "decision-aware context layer" that turns instance evidence into a context graph and produces typed memory cards for budgeted agents, according to the arXiv abstract. The submission reports empirical gains on a retrieval task and includes controlled diagnostics and auxiliary agreement checks, per arXiv.
Technical details
Per the arXiv abstract, CICL routes deterministic and model-assisted judgments from models including Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA through a shared eight-field schema, scores candidate evidence units on metrics labeled action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility evidence into compact "memory cards" for an agent with a context budget. The paper reports that Qwen3.6-plus reranking of BM25 top-50 candidates on 50 SWE-bench Verified file-retrieval instances raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top-utility semantic v3 unit collapses F1 to 0.000. The abstract also notes smaller agreement checks using Qwen-QLoRA, a 200-label Opus-assisted signal, and a three-instance patch smoke validating retrieval-to-patch plumbing, and states that RepoBench-R summaries still outperform the cards and that compact rankers do not yet replace the heuristic, per arXiv.
Editorial analysis
Decision-aware selection addresses a common failure mode in retrieval-augmented agents where decisive evidence exists but is not surfaced or compressed at action time. Industry-pattern observations indicate scoring candidates by downstream decision impact, rather than raw similarity alone, can materially change which evidence guides actions in tool-using agents. The reported jump in hit@1 and MRR suggests that reranking with a decision-focused signal can improve retrieval precision on task-specific benchmarks, though the paper's own caveats about limits and comparisons to summary baselines temper broad generalization.
What to watch
For practitioners and researchers: reproducibility of the reported metrics on larger, diverse benchmarks; head-to-head comparisons with dense retrieval and learned rankers; latency and compute costs of the judgment layer; and how decision-aware selection composes with retrieval-augmented generation pipelines. The arXiv submission frames CICL as a measurement and selection layer rather than a turnkey agent fix, per arXiv.
Scoring Rationale
A single arXiv preprint introducing a decision-aware context selection/compression layer (CICL) for tool-using LLM agents, with reported retrieval gains on a small SWE-bench Verified slice. Relevant to agent-tooling practitioners but an early, narrow measurement-layer contribution rather than an end-to-end agent advance.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


