DiffUTE: A Universal Multilingual Text Editing Diffusion Model for High-Fidelity Image Text Manipulation
The article presents DiffUTE, an end‑to‑end self‑supervised multilingual text‑editing diffusion model that leverages fine‑grained position and glyph guidance together with large language model control to achieve high‑quality, high‑fidelity text modifications in images, and demonstrates its effectiveness through extensive experiments and real‑world deployments at Ant Group.
Ant Group’s security laboratory faces constantly evolving black‑market attacks that often have sparse samples, making traditional risk‑identification models insufficient; to quickly cover new threats, the lab has built a security‑AIGC image generation platform that both evaluates model defenses and augments them with synthetic data.
In collaboration with Ant Security’s Tianjian and Tiancun labs and Nanjing University, they introduced DiffUTE, the industry’s first end‑to‑end multilingual controllable text‑editing diffusion model, which was accepted at NeurIPS 2023.
Technical Background : Scene text editing modifies text in images for applications such as advertising, image restoration, and film post‑production. Recent AIGC advances (e.g., Stable Diffusion, ControlNet) excel at image‑to‑image tasks but perform poorly on text editing, often producing unreadable characters. Existing works focus on single‑character or low‑resolution English text, failing to meet multilingual, high‑fidelity requirements.
Method : DiffUTE introduces two fine‑grained guides—position guidance using binary masks to isolate text regions, and glyph guidance that feeds character images as additional conditioning to preserve stroke details. Because large labeled datasets are unavailable, a self‑supervised task is devised: randomly select an OCR region, regenerate its glyph in a uniform font, encode it, combine mask, noise, and glyph representations, and train the U‑Net to predict the noise at a given diffusion step.
During inference, DiffUTE can be steered by a large language model (LLM) that parses user‑provided edit instructions, extracts the target text and region from OCR results, and feeds them to the diffusion process, enabling natural‑language‑driven image text editing.
Experiments : Qualitative and quantitative evaluations show that DiffUTE markedly improves text accuracy compared with prior models (as shown in Table 1). Visual results demonstrate that the model preserves background while generating correctly oriented, high‑quality text (Figure 4). Ablation studies (Table 2, Figure 5) confirm the importance of both position and glyph guidance.
Conclusion and Outlook : DiffUTE achieves high‑fidelity multilingual text editing through fine‑grained control and self‑supervised training, and is already deployed in Ant Group’s credential verification and anti‑fraud services to generate synthetic data for attack evaluation and rapid model adaptation. Future work includes extending support to low‑resource languages and researching Chinese text‑to‑image generation.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.