Decision-Aware Memory Cards improve tool-using LLM agents
An arXiv paper titled "Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents" was submitted on 6 June 2026 by Xinyu Guan and two coauthors, per arXiv. The paper introduces CICL, a "decision-aware context layer" that converts instance evidence into a context graph, routes judgments through a shared eight-field schema, scores evidence units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility items as typed memory cards for budgeted agents, according to the arXiv abstract. Empirically, the paper reports that Qwen3.6-plus reranking of BM25 top-50 candidates on 50 SWE-bench Verified file-retrieval instances raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, per arXiv. The authors report additional diagnostics: at budget 120 CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing a top-utility semantic v3 unit collapses F1 to 0.000, per arXiv. The paper frames CICL as a reproducible measurement and selection layer rather than an end-to-end coding-agent repair claim, per arXiv.
What happened
The arXiv paper "Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents" was submitted on 6 June 2026 by Xinyu Guan and two coauthors, per arXiv. The paper proposes CICL, a "decision-aware context layer" that turns instance evidence into a context graph and produces typed memory cards for budgeted agents, according to the arXiv abstract. The submission reports empirical gains on a retrieval task and includes controlled diagnostics and auxiliary agreement checks, per arXiv.
Technical details
Per the arXiv abstract, CICL routes deterministic and model-assisted judgments from models including Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA through a shared eight-field schema, scores candidate evidence units on metrics labeled action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility evidence into compact "memory cards" for an agent with a context budget. The paper reports that Qwen3.6-plus reranking of BM25 top-50 candidates on 50 SWE-bench Verified file-retrieval instances raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top-utility semantic v3 unit collapses F1 to 0.000. The abstract also notes smaller agreement checks using Qwen-QLoRA, a 200-label Opus-assisted signal, and a three-instance patch smoke validating retrieval-to-patch plumbing, and states that RepoBench-R summaries still outperform the cards and that compact rankers do not yet replace the heuristic, per arXiv.
Editorial analysis
Decision-aware selection addresses a common failure mode in retrieval-augmented agents where decisive evidence exists but is not surfaced or compressed at action time. Industry-pattern observations indicate scoring candidates by downstream decision impact, rather than raw similarity alone, can materially change which evidence guides actions in tool-using agents. The reported jump in hit@1 and MRR suggests that reranking with a decision-focused signal can improve retrieval precision on task-specific benchmarks, though the paper's own caveats about limits and comparisons to summary baselines temper broad generalization.
What to watch
For practitioners and researchers
reproducibility of the reported metrics on larger, diverse benchmarks; head-to-head comparisons with dense retrieval and learned rankers; latency and compute costs of the judgment layer; and how decision-aware selection composes with retrieval-augmented generation pipelines. The arXiv submission frames CICL as a measurement and selection layer rather than a turnkey agent fix, per arXiv.
Key Points
- 1Decision-aware scoring prioritizes action-critical evidence, which can boost retrieval precision beyond similarity-only reranking approaches.
- 2The paper reports a large hit@1 jump using Qwen3.6-plus reranking on SWE-bench, showing task-specific rerankers can change agent outcomes.
- 3Industry-pattern observation: adding an auditable selection layer enables controlled diagnostics but raises cost and integration questions for production agents.
Scoring Rationale
A single arXiv preprint introducing a decision-aware context selection/compression layer (CICL) for tool-using LLM agents, with reported retrieval gains on a small SWE-bench Verified slice. Relevant to agent-tooling practitioners but an early, narrow measurement-layer contribution rather than an end-to-end agent advance.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems