Artificial Intelligence 8 min read

Ele.me Vertical Business AIGC Image Model: Architecture, Training Pipeline, and Evaluation

Ele.me created a domain-specific AIGC image model built from scratch on its own data using the DiT backbone, a three-stage training pipeline (transformer pre-training, prompt alignment, aesthetic fine-tuning), custom T5‑E‑CLIP text and visual encoders, ControlNet for layout control, and evaluated via FID, CLIP scores and a human rubric, enabling automated dish-image generation and UI asset creation for its vertical business.

Ele.me Technology
Ele.me Technology
Ele.me Technology
Ele.me Vertical Business AIGC Image Model: Architecture, Training Pipeline, and Evaluation

Introduction – This article describes the development of an AIGC (AI‑generated content) image model for Ele.me’s vertical business scenarios. The model is trained from scratch on Ele.me’s own data using the latest DiT architecture and natively supports “one image beats a thousand words” image prompts. It has been applied across various domains such as intelligent UI assets for search‑push, merchant‑side dish‑image generation tools, and automated visual material production.

1. Background & Pain Points

Since the release of DALL·E (2021) and StableDiffusion 1.5 (2022), text‑to‑image generation has become a hot research area. Visual AIGC models have dramatically changed how visual content is produced. Within Ele.me, visual content is needed in many scenarios (merchant side, search‑push, marketing, etc.), especially for dish images that dominate the vertical domain. The large volume and long‑tail distribution of dish images present a key challenge for AIGC deployment.

2. Self‑Developed AIGC Model

2.1 Training Process Overview – The training pipeline follows a progressive three‑stage approach:

Stage 1: Transformer Pre‑train – learns basic pixel distribution and semantic relations of food categories with low cost.

Stage 2: Prompt Condition Alignment – aligns text prompts and image prompts.

Stage 3: Aesthetic Finetune – uses high‑quality image data to improve visual aesthetics.

2.2 Model Architecture

The backbone is based on the DiT architecture, extended with both image and text conditioning. The text encoder combines a pretrained T5 encoder and a self‑developed E‑CLIP encoder to enhance domain‑specific textual understanding. The visual encoder uses an E‑CLIP image encoder trained on Ele.me’s domain data, followed by a projection layer and an image‑multi‑head cross‑attention layer. Classic LLM components such as RMSNorm, SwiGLU, RoPE, and QK‑norm are incorporated for training stability and speed.

2.2.1 Text Encoder – Utilises a pretrained T5 encoder for general semantic understanding and a custom E‑CLIP encoder to capture dish‑specific information. Their embeddings are concatenated and projected before entering the cross‑attention denoising network.

2.2.2 Visual Encoder – The E‑CLIP image encoder extracts visual semantics; a small trainable projection layer converts these features into a sequence compatible with the DiT blocks, and an image‑multi‑head cross‑attention layer enables interaction with text features.

2.3 ControlNet

After achieving basic text‑to‑image generation, a ControlNet structure is added to provide fine‑grained control over layout, dish shape, and plating. Models for Canny, depth, and HED conditions are trained, as well as a ControlNet‑inpainting model for localized re‑painting of dishes.

3. Model Capability Evaluation

A domain‑specific dish evaluation dataset was built for assessment.

3.1 Objective Metrics – Focus on FID (comparing generated images with human‑matched ground‑truth dishes) and CLIP Alignment scores.

3.2 Subjective Evaluation – A custom AIGC dish‑evaluation rubric was designed for human rating, reflecting Ele.me’s practical use cases.

The work was contributed by Luo Te, Ke Lai, Qing Chang, Mo Li, Cai Ying, and Xuan Dong.

multimodalImage GenerationAIGCControlNetDiTEle.me
Ele.me Technology
Written by

Ele.me Technology

Creating a better life through technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.