DreamO: Multi‑Condition Image Customization with a 400M Flux‑Based Model
DreamO, a collaborative effort by ByteDance and Peking University, introduces a unified 400M‑parameter framework built on Flux‑1.0‑dev that enables simultaneous control of identity, style, appearance, and virtual try‑on, offering open‑source, low‑cost, and fast image customization comparable to commercial large models.
Method Overview
DreamO builds on the Flux‑1.0‑dev architecture. A shared VAE encodes conditional images into latent representations, which are then serialized and concatenated with text and image tokens before being fed into the Flux transformer. Dedicated mapping layers process the conditional image inputs, and learnable condition embeddings (CE) and index embeddings (IE) are added to the latent variables. Low‑rank adaptation (LoRA) modules fine‑tune the model, enabling multi‑condition generation within a single network.
Progressive Training Strategy
Training on the full dataset at once leads to convergence difficulty because limited parameter capacity cannot capture all task‑specific abilities, and low‑quality training images degrade generation quality. DreamO therefore adopts a two‑stage progressive strategy. Stage 1 optimizes the model on subject‑driven data (including a 200 K Subject200K dataset) to establish strong identity consistency. Stage 2 expands training to the full dataset to acquire diverse task capabilities. To mitigate the impact of low‑quality samples, an image‑quality‑optimization phase generates 40 K high‑quality Flux samples and uses heavily degraded versions of these images as reconstruction targets, aligning the model’s output with the high‑quality prior.
Routing Constraint for Reference Images
Within the DiT architecture, a routing constraint precisely limits the influence region of reference images. Cross‑attention maps between the conditional image and the generated result are averaged over the conditional image dimension to obtain a global similarity response. During training, the mask of the target object in the generated image serves as ground‑truth, constraining the similarity response to the appropriate region. This concentrates the similarity response and improves fidelity.
Dataset Construction
A large‑scale multi‑task dataset is assembled covering style transfer, single‑subject preservation, multi‑subject preservation, single‑ID preservation, multi‑ID preservation, ID‑style transfer, and virtual try‑on. Detailed statistics are provided in the technical report.
Results
DreamO can perform various customization tasks—identity preservation, style transfer, and virtual try‑on—with a single model, producing high‑quality results in 8–10 seconds per image. Qualitative comparisons show that the model matches or exceeds commercial large models in consistency while being open‑source, cheaper, and faster.
Resources
Paper: https://arxiv.org/pdf/2504.16915
Project page: https://mc-e.github.io/project/DreamO/
Code repository: https://github.com/bytedance/DreamO
Hugging Face demo: https://huggingface.co/spaces/ByteDance/DreamO
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
