ERNIE-Image Delivers Accurate Text-inclusive Image Generation

Baidu released ERNIE-Image, an open-source 8B diffusion transformer that targets cases that break other generators: legible in-image text, strict layouts, multi-panel comics and bilingual prompts. The release includes a distilled ERNIE-Image-Turbo 8-step variant for rapid iteration alongside a higher-quality 50-step SFT model. Benchmarks show strong performance, with GENEval 0.8856 and LongTextBench 0.9733, and the code and weights are available under Apache-2.0. A built-in Prompt Enhancer expands terse prompts into structured instructions, and the hosted repo and Hugging Face packages make self-hosting and fine-tuning straightforward. For creatives and product teams, ERNIE-Image lowers the barrier to production-grade poster, comic, storyboard and UI asset generation without license friction.
What happened
Baidu published ERNIE-Image, an open-source, single-stream Diffusion Transformer with 8B DiT parameters designed for high-fidelity text-in-image generation, layout fidelity, and bilingual prompts. The release includes a distilled 8-step ERNIE-Image-Turbo for draft-speed generation and a 50-step SFT model for final renders. Core benchmark numbers reported include GENEval 0.8856 and LongTextBench 0.9733, and the project is released under Apache-2.0 so weights and outputs are commercially usable.
Technical details
ERNIE-Image is built on a single-stream DiT architecture and pairs the generator with a lightweight Prompt Enhancer that expands terse user inputs into structured descriptions. The team ships two runtime points: the distilled ERNIE-Image-Turbo (8 steps) for quick iteration and a full 50-step SFT model for quality. Supported resolutions include 1024x1024, and the codebase exposes generation, edit, composite, and upscale primitives so designers can centralize an asset pipeline.
Core capabilities
The model is explicitly trained to solve three failure modes common to diffusion-based image models: accurate long-form glyph rendering, instruction-faithful multi-object composition, and page-level layout reasoning. Independent and reproducible benchmark claims include LongTextBench 0.9733 for text rendering and GENEval 0.8856 for instruction fidelity, positioning ERNIE-Image ahead of several open competitors on text and layout-centric tasks.
Feature set
- •Distilled 8-step ERNIE-Image-Turbo for rapid drafts and a 50-step SFT model for production frames
- •Built-in Prompt Enhancer to reduce manual prompt engineering and expand short prompts into structured instructions
- •Pipeline features: generate, edit, composite, upscale, and export in a single surface
- •Apache-2.0 licensing, enabling commercial use, fine-tuning, and self-hosting
Context and significance
ERNIE-Image addresses a persistent pain point for practitioners: the inability of many diffusion models to render legible, layout-sensitive text and strict compositions. Where Stable Diffusion and some proprietary models produce blurred or hallucinated glyphs, ERNIE-Image targets fidelity with dedicated training and evaluation on bilingual prompts. The Apache-2.0 license is strategically important for design teams and agencies that need clear commercial rights and the option to self-host or fine-tune. This release also reflects continued maturation of the Chinese foundation-model ecosystem, building on lineage work such as ERNIE-ViLG and signaling stronger competition in the open-weight image model space.
Practical implications for practitioners
If you operate creative pipelines, ERNIE-Image lowers the integration cost for generating posters, comics, storyboards, UI mockups, and ad frames that require accurate typography and layout constraints. Teams can self-host the weights, plug the Prompt Enhancer into existing prompt flows, and use the Turbo variant for iterative authoring followed by the SFT model for high-quality outputs.
What to watch
Adoption by the open-source community, forks that add safety filters or domain-specific fine-tunes, and real-world asset pipelines that replace stock shoots. Also monitor compute costs for self-hosting, model safety and copyright moderation in downstream tooling, and how competitors respond on the text-in-image benchmark front.
Scoring Rationale
This is a major open-source model release that directly fixes text-in-image and layout weaknesses in diffusion models and ships with permissive Apache-2.0 licensing. That combination makes it immediately useful to practitioners, meriting a high impact score in the open-model category.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


