Redditor runs 1T-parameter LLM from 768GB Optane DIMMs

Tom's Hardware reports a Reddit user, APFrisco, assembled six Intel Optane Persistent Memory (DCPMM) sticks for 768GB of addressable persistent RAM and used the capacity to run a 1-trillion-parameter Kimi K2.5 model locally on a Xeon workstation with a single GPU. APFrisco posted that the setup achieved roughly 4 tokens/second, per Tom's Hardware. The article notes the Optane DCPMM format was designed to bridge DRAM and SSD, offers much lower latency than SSD but remains about two to three times slower than DRAM, and that used Optane modules were significantly cheaper than equivalent DRAM capacity, according to Tom's Hardware.
What happened
Tom's Hardware reports a Reddit user, APFrisco, bought six Intel Optane Persistent Memory (DCPMM) modules totaling 768GB (6 x 128GB) and used them as system memory on a Xeon workstation to host a 1-trillion-parameter Kimi K2.5 model while driving inference with a single GPU. Per Tom's Hardware's coverage of APFrisco's Local LLaMA subreddit post, the local install yielded approximately 4 tokens/second.
Technical details
Tom's Hardware notes the modules used are the now-discontinued Intel Optane DCPMM format designed to sit between traditional DRAM and SSD. The coverage states Optane offers much lower latency than SSD but remains about two to three times slower than DRAM, which is why the report frames this as an "exotic solution." The article attributes the performance figure (~4 tokens/second) and the hardware list to APFrisco's Reddit post.
Industry context
Editorial analysis: Using large persistent-memory pools to host model weights is an emergent community technique for increasing working capacity on commodity workstation platforms. Observed patterns in similar community experiments show a tradeoff: much larger addressable memory at materially lower cost, in exchange for reduced throughput and higher latency compared with DRAM-backed systems.
What to watch
Editorial analysis: Practitioners and infrastructure teams will watch for reproducibility across different models, software-level support for persistent-memory (DAX/pmem-aware allocators), and the point at which increased capacity outweighs throughput loss. Community posts, benchmarked comparisons against DRAM+NVMe swapping, and tooling that optimizes for high-capacity, higher-latency memory will determine whether this pattern scales beyond hobbyist builds.
Scoring Rationale
The report documents a practical, low-cost hardware workaround that could matter to practitioners trying to host very large models locally. It is notable for hardware-savvy operators but remains niche because throughput is low and the approach relies on discontinued hardware and community tinkering.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


