Arc2Face: Identity‑Conditioned Face Generation Model Delivering High‑Consistency, High‑Quality AI Portraits
Arc2Face is an identity‑conditioned face synthesis foundation model that projects ArcFace embeddings into the CLIP space of a fine‑tuned Stable Diffusion, using up‑sampled WebFace42M and high‑quality FFHQ/CelebA‑HQ data to achieve far‑superior facial similarity and consistency compared with existing methods such as FaceSwap and InstantID, as demonstrated by extensive quantitative and visual experiments.
Introduction
Arc2Face is a foundation model for face synthesis that conditions generation on a person’s ArcFace embedding. By feeding only the ID vector, the model produces diverse, photorealistic images whose facial similarity far exceeds that of existing approaches such as FaceSwap and InstantID.
Method
The backbone is the pre‑trained Stable Diffusion v1‑5 model. ArcFace embeddings are processed by a frozen‑prompt text encoder and projected into the CLIP latent space, enabling cross‑attention control without any textual prompt. The encoder and UNet are jointly fine‑tuned on a massive facial‑recognition (FR) dataset, then further refined on high‑quality datasets (FFHQ, CelebA‑HQ) to improve visual fidelity.
Identity Conditioning
To align the ID vector with the CLIP space, ArcFace embeddings are inserted into a placeholder token <id> within a simple prompt “a photo of a <id> person”. After tokenization, the placeholder is replaced by the embedding, zero‑padded to the maximum token length N, and fed to an auxiliary encoder τ that maps the sequence into CLIP space. This forces the attention mechanism to focus exclusively on the ID vector, ignoring unrelated context.
Dataset
Training starts from WebFace42M, a large‑scale FR dataset. Low‑resolution images are up‑sampled four‑fold to 448×448 using GFPGAN v1.4, then filtered and cropped, yielding roughly 21 million images for about 1 million identities. Because the up‑sampled data still contain tightly cropped faces, the model is subsequently fine‑tuned on FFHQ and CelebA‑HQ, which provide more loosely framed, high‑resolution faces, resulting in a final 512×512 output aligned with FFHQ resolution.
Experiments
Quantitative Comparison
Arc2Face is evaluated against several synthesis methods by measuring FR model accuracy on the generated images. The results show a clear advantage over competing pipelines, with higher identity preservation scores across all benchmarks.
Visual Comparison
Side‑by‑side visual samples illustrate that Arc2Face maintains consistent identity while producing richer textures and fewer artifacts than other methods.
FR Accuracy Across Synthesis Techniques
A bar chart compares the FR model’s identification accuracy on images generated by different pipelines, confirming that Arc2Face achieves the highest scores.
Overall, the paper demonstrates that projecting ArcFace embeddings into CLIP space and fine‑tuning Stable Diffusion yields a powerful, identity‑preserving face generator that outperforms prior art both quantitatively and qualitatively.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
