Breaking the Binary: FlexIP Enables Both Identity Preservation and Personalized Editing
FlexIP introduces a dual‑adapter architecture and a dynamic weight‑gating mechanism that decouple identity preservation from personalized editing, allowing continuous control over image generation and outperforming prior SOTA methods in both fidelity and flexibility.
Highlights
Dual‑Adapter Decoupling : Explicitly separates a Preservation Adapter (identity retention) from a Personalization Adapter (creative editing), avoiding feature competition.
Dynamic Weight Gating : Continuously balances the two adapters with a tunable parameter, breaking the traditional "either‑or" limitation.
Modality‑Aware Training : Adapts adapter weights to static images versus video frames, strengthening identity locking for images and temporal deformation for video.
Problem Statement
Existing zero‑shot image‑to‑text models struggle to simultaneously achieve high‑fidelity identity preservation and diverse personalized editing; they typically force a trade‑off between the two goals, suffer from insufficient cross‑modal alignment, and provide only coarse, binary control over editing intensity.
Proposed Solution
The FlexIP framework addresses these issues with three core components:
Preservation Adapter : Captures fine‑grained identity cues using learnable queries and global CLIP‑[CLS] embeddings, combined via cross‑attention on DINO block features.
Personalization Adapter : Conditions the diffusion model on text embeddings that are re‑sampled to attend to the CLIP‑[CLS] token, ensuring edits respect the subject’s visual identity.
Dynamic Weight Gating (DWG) : During inference, a continuous gate adjusts the contribution of the two adapters based on a user‑controlled scalar, enabling smooth transitions from strong identity retention to high‑style diversity.
Preservation Adapter Details
The adapter first learns query vectors that adapt to diverse subjects, then augments them with the global CLIP‑[CLS] embedding, which provides stable high‑level semantics. The two representations are concatenated (⊕) and fed into a cross‑attention module that re‑samples both sources, yielding a composite identity feature P.
Personalization Adapter Details
Standard Stable Diffusion conditions UNet on text embeddings, which lack explicit visual grounding. FlexIP injects the CLIP‑[CLS] token into the text query, allowing the language guidance to be anchored in the subject’s visual context, thus producing edits that are both semantically aligned and identity‑consistent.
Dynamic Weight Gating Mechanism
DWG computes a gate value g based on the data modality (image or video) and a learnable scalar. The final adapter output is a weighted sum: output = g \times P + (1-g) \times S, where P is the preservation output and S is the personalization output. For image‑centric training, g is biased toward P, preserving fine details; for video‑centric training, g favors S, encouraging temporal diversity.
Experiments
Training Data
FlexIP is trained on 1.23 M varied samples and 11 M invariant images, covering faces, scenes, virtual try‑ons, human actions, and multi‑view objects. Video frames are re‑sampled to maintain a 1:1 ratio with static images. Text prompts for each video frame are generated by Qwen2‑VL to improve instruction fidelity.
Evaluation Datasets & Metrics
Benchmarks are drawn from DreamBench+ and MSBench (187 subjects, 9 prompts each, 10 generations per prompt → 16 830 images). Metrics include DINO‑I and CLIP‑I for identity similarity, CLIP‑T for personalization alignment, CLIP‑IQA and CLIP‑Aesthetic for image quality, and a composite mean rank (mRank).
Quantitative Comparison
FlexIP outperforms all baselines across every metric. Notably, CLIP‑I reaches 0.873 and DINO‑I 0.739, while CLIP‑IQA and Aesthetic score 0.598 and 6.039 respectively. Although λ‑Eclipse scores slightly higher on CLIP‑T, it does so by sacrificing identity preservation.
Human Evaluation
33 samples were shown to 60 participants who selected images best matching the text (Flex) and best preserving identity (ID‑Pres). FlexIP achieved the highest preference in both categories.
Qualitative Comparison
Visual side‑by‑side comparisons with five state‑of‑the‑art methods demonstrate FlexIP’s superior fidelity, editability, and consistent identity across diverse prompts.
Ablation Study
Varying the weight of the preservation versus personalization adapters shows a smooth trade‑off: increasing the preservation weight yields near‑perfect identity reconstruction, while boosting the personalization weight enables stronger stylistic transformations without abrupt artifacts.
Conclusion
FlexIP provides a flexible framework for image synthesis that decouples identity preservation from personalized editing. Its dual‑adapter design captures both high‑level semantics and low‑level details, while the dynamic weight‑gating mechanism transforms the traditional binary trade‑off into a continuous control surface, delivering robust and controllable subject‑driven generation.
References
[1] FlexIP: Dynamic Control of Preservation and Personality for Customized Image Generation
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
