Infrastructureintel optanepersistent memorylocal llmllm inference

Redditor runs 1T-parameter LLM from 768GB Optane DIMMs

|May 23, 2026|By LDS Team

6.2

Relevance Score

Redditor runs 1T-parameter LLM from 768GB Optane DIMMs — Photo: cdn.mos.cms.futurecdn.net · rights & takedowns

Tom's Hardware reports a Reddit user, APFrisco, assembled six Intel Optane Persistent Memory (DCPMM) sticks for 768GB of addressable persistent RAM and used the capacity to run a 1-trillion-parameter Kimi K2.5 model locally on a Xeon workstation with a single GPU. APFrisco posted that the setup achieved roughly 4 tokens/second, per Tom's Hardware. The article notes the Optane DCPMM format was designed to bridge DRAM and SSD, offers much lower latency than SSD but remains about two to three times slower than DRAM, and that used Optane modules were significantly cheaper than equivalent DRAM capacity, according to Tom's Hardware.

What happened

Tom's Hardware reports a Reddit user, APFrisco, bought six Intel Optane Persistent Memory (DCPMM) modules totaling 768GB (6 x 128GB) and used them as system memory on a Xeon workstation to host a 1-trillion-parameter Kimi K2.5 model while driving inference with a single GPU. Per Tom's Hardware's coverage of APFrisco's Local LLaMA subreddit post, the local install yielded approximately 4 tokens/second.

Technical details

Tom's Hardware notes the modules used are the now-discontinued Intel Optane DCPMM format designed to sit between traditional DRAM and SSD. The coverage states Optane offers much lower latency than SSD but remains about two to three times slower than DRAM, which is why the report frames this as an "exotic solution." The article attributes the performance figure (~4 tokens/second) and the hardware list to APFrisco's Reddit post.

Industry context

What to watch

Editorial analysis

Using large persistent-memory pools to host model weights is an emergent community technique for increasing working capacity on commodity workstation platforms. Observed patterns in similar community experiments show a tradeoff: much larger addressable memory at materially lower cost, in exchange for reduced throughput and higher latency compared with DRAM-backed systems.

Practitioners and infrastructure teams will watch for reproducibility across different models, software-level support for persistent-memory (DAX/pmem-aware allocators), and the point at which increased capacity outweighs throughput loss. Community posts, benchmarked comparisons against DRAM+NVMe swapping, and tooling that optimizes for high-capacity, higher-latency memory will determine whether this pattern scales beyond hobbyist builds.

Key Points

1Second-hand Intel Optane modules can provide 768GB of addressable memory for local LLM hosting at far lower cost than DRAM.
2Persistent memory trades throughput for capacity: community tests show usable inference at low token rates (about 4 tokens/second) for 1T-parameter models.
3Software support (pmem-aware allocators, model sharding) is a key factor for whether persistent memory becomes a practical stopgap for large-model inference.

Scoring Rationale

The report documents a practical, low-cost hardware workaround that could matter to practitioners trying to host very large models locally. It is notable for hardware-savvy operators but remains niche because throughput is low and the approach relies on discontinued hardware and community tinkering.

Sources

Public references used for this report.

1 source

tomshardware.com768GB of cheap Intel Optane DIMM memory sticks used to run 1-trillion-parameter LLM on a system with a single GPU — local Kimi K2.5 install achieved roughly 4 tokens per second

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

What happened

Technical details

Industry context

What to watch

Editorial analysis

Key Points

1Second-hand Intel Optane modules can provide 768GB of addressable memory for local LLM hosting at far lower cost than DRAM.

2Persistent memory trades throughput for capacity: community tests show usable inference at low token rates (about 4 tokens/second) for 1T-parameter models.

3Software support (pmem-aware allocators, model sharding) is a key factor for whether persistent memory becomes a practical stopgap for large-model inference.

Scoring Rationale

Redditor runs 1T-parameter LLM from 768GB Optane DIMMs

What happened

Technical details

Industry context

What to watch

Editorial analysis

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Ghost Font Uses Motion to Confound AI Vision

AegisAI Raises $36 Million to Expand AI Email Security

Delaware Court Lets Google AI Defamation Case Proceed

OpenAI Explores APIs for Deeper ChatGPT Wearable Integrations

Redditor runs 1T-parameter LLM from 768GB Optane DIMMs

What happened

Technical details

Industry context

What to watch

Editorial analysis

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Ghost Font Uses Motion to Confound AI Vision

AegisAI Raises $36 Million to Expand AI Email Security

Delaware Court Lets Google AI Defamation Case Proceed

OpenAI Explores APIs for Deeper ChatGPT Wearable Integrations