Redditor runs 1T-parameter LLM from 768GB Optane DIMMs

Tom's Hardware reports a Reddit user, APFrisco, assembled six Intel Optane Persistent Memory (DCPMM) sticks for 768GB of addressable persistent RAM and used the capacity to run a 1-trillion-parameter Kimi K2.5 model locally on a Xeon workstation with a single GPU. APFrisco posted that the setup achieved roughly 4 tokens/second, per Tom's Hardware. The article notes the Optane DCPMM format was designed to bridge DRAM and SSD, offers much lower latency than SSD but remains about two to three times slower than DRAM, and that used Optane modules were significantly cheaper than equivalent DRAM capacity, according to Tom's Hardware.
What happened
Tom's Hardware reports a Reddit user, APFrisco, bought six Intel Optane Persistent Memory (DCPMM) modules totaling 768GB (6 x 128GB) and used them as system memory on a Xeon workstation to host a 1-trillion-parameter Kimi K2.5 model while driving inference with a single GPU. Per Tom's Hardware's coverage of APFrisco's Local LLaMA subreddit post, the local install yielded approximately 4 tokens/second.
Technical details
Tom's Hardware notes the modules used are the now-discontinued Intel Optane DCPMM format designed to sit between traditional DRAM and SSD. The coverage states Optane offers much lower latency than SSD but remains about two to three times slower than DRAM, which is why the report frames this as an "exotic solution." The article attributes the performance figure (~4 tokens/second) and the hardware list to APFrisco's Reddit post.
Industry context
What to watch
Editorial analysis
Using large persistent-memory pools to host model weights is an emergent community technique for increasing working capacity on commodity workstation platforms. Observed patterns in similar community experiments show a tradeoff: much larger addressable memory at materially lower cost, in exchange for reduced throughput and higher latency compared with DRAM-backed systems.
Practitioners and infrastructure teams will watch for reproducibility across different models, software-level support for persistent-memory (DAX/pmem-aware allocators), and the point at which increased capacity outweighs throughput loss. Community posts, benchmarked comparisons against DRAM+NVMe swapping, and tooling that optimizes for high-capacity, higher-latency memory will determine whether this pattern scales beyond hobbyist builds.
Key Points
- 1Second-hand Intel Optane modules can provide 768GB of addressable memory for local LLM hosting at far lower cost than DRAM.
- 2Persistent memory trades throughput for capacity: community tests show usable inference at low token rates (about 4 tokens/second) for 1T-parameter models.
- 3Software support (pmem-aware allocators, model sharding) is a key factor for whether persistent memory becomes a practical stopgap for large-model inference.
Scoring Rationale
The report documents a practical, low-cost hardware workaround that could matter to practitioners trying to host very large models locally. It is notable for hardware-savvy operators but remains niche because throughput is low and the approach relies on discontinued hardware and community tinkering.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
