DreamO: Multi‑Condition Image Customization with a 400M Flux‑Based Model

DreamO, a collaborative effort by ByteDance and Peking University, introduces a unified 400M‑parameter framework built on Flux‑1.0‑dev that enables simultaneous control of identity, style, appearance, and virtual try‑on, offering open‑source, low‑cost, and fast image customization comparable to commercial large models.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
DreamO: Multi‑Condition Image Customization with a 400M Flux‑Based Model

Method Overview

DreamO builds on the Flux‑1.0‑dev architecture. A shared VAE encodes conditional images into latent representations, which are then serialized and concatenated with text and image tokens before being fed into the Flux transformer. Dedicated mapping layers process the conditional image inputs, and learnable condition embeddings (CE) and index embeddings (IE) are added to the latent variables. Low‑rank adaptation (LoRA) modules fine‑tune the model, enabling multi‑condition generation within a single network.

Progressive Training Strategy

Training on the full dataset at once leads to convergence difficulty because limited parameter capacity cannot capture all task‑specific abilities, and low‑quality training images degrade generation quality. DreamO therefore adopts a two‑stage progressive strategy. Stage 1 optimizes the model on subject‑driven data (including a 200 K Subject200K dataset) to establish strong identity consistency. Stage 2 expands training to the full dataset to acquire diverse task capabilities. To mitigate the impact of low‑quality samples, an image‑quality‑optimization phase generates 40 K high‑quality Flux samples and uses heavily degraded versions of these images as reconstruction targets, aligning the model’s output with the high‑quality prior.

Routing Constraint for Reference Images

Within the DiT architecture, a routing constraint precisely limits the influence region of reference images. Cross‑attention maps between the conditional image and the generated result are averaged over the conditional image dimension to obtain a global similarity response. During training, the mask of the target object in the generated image serves as ground‑truth, constraining the similarity response to the appropriate region. This concentrates the similarity response and improves fidelity.

Dataset Construction

A large‑scale multi‑task dataset is assembled covering style transfer, single‑subject preservation, multi‑subject preservation, single‑ID preservation, multi‑ID preservation, ID‑style transfer, and virtual try‑on. Detailed statistics are provided in the technical report.

Results

DreamO can perform various customization tasks—identity preservation, style transfer, and virtual try‑on—with a single model, producing high‑quality results in 8–10 seconds per image. Qualitative comparisons show that the model matches or exceeds commercial large models in consistency while being open‑source, cheaper, and faster.

Resources

Paper: https://arxiv.org/pdf/2504.16915

Project page: https://mc-e.github.io/project/DreamO/

Code repository: https://github.com/bytedance/DreamO

Hugging Face demo: https://huggingface.co/spaces/ByteDance/DreamO

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LoRAimage generationAI researchDreamOFlux modelmulti‑condition customization
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.