Models & Researchdiffusion modelsimage generationgenerative decoding

Diffusion Decoders Replace VAE Decoders for 4K Images

|June 11, 2026|By LDS Team

5.5

Relevance Score

Diffusion Decoders Replace VAE Decoders for 4K Images — Photo: doimages.nyc3.cdn.digitaloceanspaces.com · rights & takedowns

NVIDIA Research and Tencent YoutuResearch have each released open-source systems that replace or eliminate the VAE decoder bottleneck in latent diffusion models. NVIDIA's PiD (arXiv 2605.23902) reformulates decoding as conditional pixel diffusion, decoding 512x512 latents to 2048x2048 in under 1 second on an RTX 5090 - about 6 times faster than cascaded super-resolution pipelines. Tencent and NJU PCaLab's L2P (arXiv 2605.12013) takes the stronger approach of removing the VAE entirely: pretrained latent diffusion models are converted to pixel-space generators through shallow adaptation layers, enabling native 4K generation and 8K zero-shot extrapolation with 97.67% faster 4K inference. Both projects release code and weights and support existing popular backbones.

Background

Most text-to-image systems run diffusion in a compact latent space and then decode back to pixels through a Variational Autoencoder (VAE). The VAE decoder is reconstruction-oriented - optimized to invert the encoder rather than synthesize new detail - and becomes costly and quality-limited at megapixel scale. Two independent research groups published systems in May 2026 that address this bottleneck from different angles.

NVIDIA PiD

NVIDIA's Spatial Intelligence Lab published PiD (Pixel diffusion Decoder) as a preprint in May 2026 (arXiv 2605.23902). PiD reformulates latent decoding as conditional pixel diffusion, training a single generative module that unifies decoding and upsampling. Rather than reconstructing from a latent, PiD denoises directly in high-resolution pixel space. A lightweight sigma-aware adapter injects noise-corrupted latents into the diffusion backbone, and DMD2 distillation reduces inference to 4 steps. Performance figures from NVIDIA Research: 512x512 latents decoded to 2048x2048 in under 1 second with 13 GB peak memory on a consumer RTX 5090, or 210 ms on a GB200 GPU - approximately 6 times faster than cascaded super-resolution pipelines, with higher visual fidelity per Gemini-3-Flash judge evaluation. PiD supports FLUX.1[dev], FLUX.2[dev], Z-Image, SD3, DINOv2, and SigLIP backbones. Code and model weights are available on GitHub (nv-tlabs/PiD) and Hugging Face.

Tencent L2P

L2P (arXiv 2605.12013), from researchers at NJU PCaLab and Tencent YoutuResearch, takes a more complete approach: it removes the VAE entirely rather than replacing only the decode step. L2P converts pretrained latent diffusion models into pixel-space generators by freezing most of the source model and training only shallow transformation layers. Training requires no real images - only synthetic outputs from the source model itself - and runs on 8 GPUs with, per the authors, negligible additional compute. Results: native 4K generation, 8K zero-shot extrapolation, and 4K single-step inference running 97.67% faster than the source latent model at that resolution. GenEval scores reach 93% of the source LDM, indicating prompt adherence is largely preserved through the conversion. Code is available on GitHub (TencentYoutuResearch/T2I-L2P); weights are on Hugging Face. The repository does not specify an explicit open-source license, so commercial use terms should be confirmed with the research team.

Practitioner implications

Both approaches are designed to slot into existing pipelines. PiD acts as a drop-in replacement for the VAE decode step in any latent diffusion pipeline; L2P converts a pretrained LDM checkpoint through a short adaptation run. The practical outcome is 4K and higher generation becoming accessible on consumer hardware without cascaded pipelines or separate upscaling models. For inference at scale, PiD's latency reduction at the decode step is directly relevant to throughput and cost.

Key Points

1What: NVIDIA PiD decodes 512x512 latents to 2048x2048 in under 1 second (approx. 6x faster than cascaded SR pipelines) by replacing the VAE decode step with pixel diffusion.
2Why: VAE decoders are reconstruction-oriented and degrade in quality and cost at megapixel scale; pixel-space diffusion generates new detail rather than inverting the encoder.
3So what: Both PiD and L2P are open-source and slot into existing pipelines, making 4K and 8K generation feasible on consumer hardware without retraining base models.

Scoring Rationale

Two genuine research advances from NVIDIA and Tencent open-source systems that replace the VAE decoder bottleneck in latent diffusion models, with measurable speed and quality gains at 4K. Covered primarily through a tutorial explainer; the underlying papers are from May 2026. Relevant to ML practitioners building image generation pipelines, but not a breaking frontier news event.

MoreGenerative AI news

Sources

Public references used for this report.

6 sources

research.nvidia.comPiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

arxiv.orgPiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion (arXiv 2605.23902)

github.comnv-tlabs/PiD - GitHub

View 3 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

Models & Researchdiffusion modelsimage generationgenerative decoding