Alibaba’s Qwen team has released Qwen-Image, a 20B parameter Multimodal Diffusion Transformer (MMDiT) image foundation model. The release highlights two primary capabilities: complex text rendering across alphabetic and logographic scripts, and precise image editing that preserves both semantic content and visual realism. The model is available via Qwen Chat and accessible through the team’s API.

The announcement positions text rendering as the central differentiator. Most image generation models struggle to accurately reproduce arbitrary strings — especially long paragraphs, mixed-script content, and structured layouts. The blog post describes Qwen-Image as able to handle multi-line layouts, paragraph-level semantics, and fine-grained character details in both English and Chinese.

What the benchmarks show

The team evaluated Qwen-Image across six public benchmarks. For general image generation, they used GenEval, DPG, and OneIG-Bench. For editing they used GEdit, ImgEdit, and GSO. According to the post, Qwen-Image “achieves state-of-the-art performance on all benchmarks.”

The text-rendering results on LongText-Bench, ChineseWord, and TextCraft are described as particularly strong — the post says the model “excels in text rendering — particularly in Chinese text generation — outperforming existing state-of-the-art models by a significant margin.” The team frames this as Qwen-Image’s “unique position” at the intersection of general image capability and text rendering precision.

Chinese and multilingual rendering in practice

The demo section of the blog illustrates the text rendering capability with progressively harder prompts. In one example, a Miyazaki-styled anime scene contains shop signs reading “云存储” (cloud storage), “云计算” (cloud computing), and “云模型” (cloud model), along with “千问” on wine jars. The post notes that all characters are “rendered realistically and accurately with the depth of field” and that character poses and expressions are preserved.

A second example asks the model to generate a traditional Chinese couplet with specific characters written in calligraphy style. According to the post, the model “accurately drew the left and right couplets and the horizontal scroll, applied calligraphy effects, and accurately generated the Yueyang Tower” in the background. Blue-and-white porcelain on the table is described as “very realistic.”

English rendering is demonstrated with a bookstore window prompt containing four specific book titles. The model accurately reproduces all four covers: “The light between worlds,” “When stars are scattered,” “The slient patient,” and “The night circus.” A more complex infographic prompt with six labeled submodules — each carrying an icon, title, and descriptive text — is described as successfully completed in full layout.

The model also handles small text. In one demo, a man holds a paper containing a four-line poem, and the paper “is less than one-tenth of the entire image” — the post reports the paragraph is accurately generated despite its proportion. A harder test includes a dense paragraph of Chinese text written on a glass plate in a handwriting style, and then a bilingual version mixing Chinese and English on the same plate. Both are described as successfully rendered.

Image editing and the multi-task training approach

Beyond text rendering, the post emphasizes image editing as the second major capability. The announcement credits “enhanced multi-task training paradigm” for the model’s ability to preserve semantic meaning and visual realism during edits. The supported editing operations described in the post include text editing within images, object addition and removal, and style transfer.

The team describes this as a unified approach — one foundation model handling both generation from scratch and targeted editing — rather than separate specialized models. The post evaluates this on the GEdit, ImgEdit, and GSO benchmarks, where Qwen-Image reaches state-of-the-art across all three.

The commercial context is visible in the Chinese-language demos, several of which feature Alibaba Cloud product names (“阿里云,” “通义千问”) embedded naturally in generated scenes. One demo shows a corporate-style PPT slide using Alibaba branding alongside the model’s name, suggesting the team is demonstrating enterprise presentation generation as a target use case.

Practical access

The post directs readers to Qwen Chat to try the model directly, selecting “Image Generation” in the interface. No separate API documentation is linked in the excerpt, but the model is also listed on GitHub, Hugging Face, and ModelScope. The announcement does not specify pricing or rate limits.

The key claim from the post is that Qwen-Image establishes “a strong foundation model for image generation” that is competitive across both broad visual generation and the narrower but commercially valuable problem of accurate typographic rendering in both Latin and CJK scripts.