ERNIE-Image Delivers Accurate Text-inclusive Image Generation

Baidu released ERNIE-Image, an open-source 8B diffusion transformer that targets cases that break other generators: legible in-image text, strict layouts, multi-panel comics and bilingual prompts. The release includes a distilled ERNIE-Image-Turbo 8-step variant for rapid iteration alongside a higher-quality 50-step SFT model. Benchmarks show strong performance, with GENEval 0.8856 and LongTextBench 0.9733, and the code and weights are available under Apache-2.0. A built-in Prompt Enhancer expands terse prompts into structured instructions, and the hosted repo and Hugging Face packages make self-hosting and fine-tuning straightforward. For creatives and product teams, ERNIE-Image lowers the barrier to production-grade poster, comic, storyboard and UI asset generation without license friction.
What happened
Baidu published ERNIE-Image, an open-source, single-stream Diffusion Transformer with 8B DiT parameters designed for high-fidelity text-in-image generation, layout fidelity, and bilingual prompts. The release includes a distilled 8-step ERNIE-Image-Turbo for draft-speed generation and a 50-step SFT model for final renders. Core benchmark numbers reported include GENEval 0.8856 and LongTextBench 0.9733, and the project is released under Apache-2.0 so weights and outputs are commercially usable.
Technical details
ERNIE-Image is built on a single-stream DiT architecture and pairs the generator with a lightweight Prompt Enhancer that expands terse user inputs into structured descriptions. The team ships two runtime points: the distilled ERNIE-Image-Turbo (8 steps) for quick iteration and a full 50-step SFT model for quality. Supported resolutions include 1024x1024, and the codebase exposes generation, edit, composite, and upscale primitives so designers can centralize an asset pipeline.
Core capabilities: The model is explicitly trained to solve three failure modes common to diffusion-based image models: accurate long-form glyph rendering, instruction-faithful multi-object composition, and page-level layout reasoning. Independent and reproducible benchmark claims include LongTextBench 0.9733 for text rendering and GENEval 0.8856 for instruction fidelity, positioning ERNIE-Image ahead of several open competitors on text and layout-centric tasks.
Feature set:
- •Distilled 8-step ERNIE-Image-Turbo for rapid drafts and a 50-step SFT model for production frames
- •Built-in Prompt Enhancer to reduce manual prompt engineering and expand short prompts into structured instructions
- •Pipeline features: generate, edit, composite, upscale, and export in a single surface
- •Apache-2.0 licensing, enabling commercial use, fine-tuning, and self-hosting
Context and significance
ERNIE-Image addresses a persistent pain point for practitioners: the inability of many diffusion models to render legible, layout-sensitive text and strict compositions. Where Stable Diffusion and some proprietary models produce blurred or hallucinated glyphs, ERNIE-Image targets fidelity with dedicated training and evaluation on bilingual prompts. The Apache-2.0 license is strategically important for design teams and agencies that need clear commercial rights and the option to self-host or fine-tune. This release also reflects continued maturation of the Chinese foundation-model ecosystem, building on lineage work such as ERNIE-ViLG and signaling stronger competition in the open-weight image model space.
Practical implications for practitioners: If you operate creative pipelines, ERNIE-Image lowers the integration cost for generating posters, comics, storyboards, UI mockups, and ad frames that require accurate typography and layout constraints. Teams can self-host the weights, plug the Prompt Enhancer into existing prompt flows, and use the Turbo variant for iterative authoring followed by the SFT model for high-quality outputs.
What to watch
Adoption by the open-source community, forks that add safety filters or domain-specific fine-tunes, and real-world asset pipelines that replace stock shoots. Also monitor compute costs for self-hosting, model safety and copyright moderation in downstream tooling, and how competitors respond on the text-in-image benchmark front.
Scoring Rationale
This is a major open-source model release that directly fixes text-in-image and layout weaknesses in diffusion models and ships with permissive Apache-2.0 licensing. That combination makes it immediately useful to practitioners, meriting a high impact score in the open-model category.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

