DUAL-BLADE offloads KV-cache for edge LLM inference

An arXiv paper submitted 29 Apr 2026 by Bodon Jeong and colleagues introduces DUAL-BLADE, a dual-path KV residency framework for edge LLM inference, per the arXiv abstract. The paper describes a dual-path design that assigns Key-Value (KV) tensors to either a page-cache path or an NVMe-direct path at runtime. The NVMe-direct path bypasses the filesystem by mapping KV tensors to contiguous logical block address regions, and the system uses adaptive pipeline parallelism to overlap storage I/O with GPU DMA, according to the paper. The authors report reductions in prefill and decode latency of up to 33.1% and 42.4% respectively, and an increase in SSD utilization by 2.2x across varied memory budgets, per the arXiv submission.
What happened
Per the arXiv paper submitted 29 Apr 2026, DUAL-BLADE is a dual-path KV residency framework designed for edge LLM inference. The paper reports that DUAL-BLADE dynamically assigns Key-Value (KV) tensors to either a page-cache path or an NVMe-direct path based on runtime memory availability. The paper states that the NVMe-direct path bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions and that the design includes adaptive pipeline parallelism to overlap storage I/O with GPU DMA. The authors report latency reductions of up to 33.1% for prefill and 42.4% for decode, and an improvement in SSD utilization by 2.2x, per the arXiv abstract.
Technical details
Per the paper, the NVMe-direct path avoids kernel page-cache dependence by exposing contiguous LBA regions for KV tensors, enabling lower-overhead direct storage access. The submission also describes adaptive pipeline parallelism that schedules and overlaps NVMe I/O with GPU DMA to improve throughput under tight device memory budgets. The evaluation reported in the paper covers a range of memory budgets and SSD behaviors, as summarized in the abstract.
Editorial analysis - technical context
Industry pattern: Offloading large KV caches to local NVMe is a common approach to extend working sets beyond device DRAM on edge systems; the DUAL-BLADE paper contributes a specific dual-path mechanism that aims to reduce page-cache thrashing and software overhead compared with file-based designs.
Industry pattern: Overlapping storage I/O with GPU DMA is a standard throughput optimization for memory-constrained inference; adaptive pipeline parallelism described in the paper mirrors techniques used in prior storage-GPU co-scheduling research.
Context and significance
Industry context
For practitioners building edge deployments of LLMs, KV cache management is a practical bottleneck when model context scales and device RAM is limited. The reported 33.1% and 42.4% latency reductions, and 2.2x SSD utilization improvement in the paper, if reproduced, would be meaningful for latency-sensitive applications running on constrained hardware.
What to watch
- •Whether the full paper and code release provide reproducible evaluation details and workloads.
- •Real-world behavior across different NVMe devices and SSD endurance tradeoffs when using NVMe-direct mappings.
- •Integration complexity with existing inference runtimes and portability across operating systems.
Scoring Rationale
This paper addresses a practical infrastructure bottleneck for edge LLM deployments: KV-cache memory pressure and storage I/O. Results reported on arXiv show measurable latency and SSD utilization improvements; the work is notable for practitioners building low-latency, memory-constrained inference systems.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

