Artificial Intelligence 7 min read

Arc2Face: Identity‑Conditioned Face Generation Model Delivering High‑Consistency, High‑Quality AI Portraits

Arc2Face is an identity‑conditioned face synthesis foundation model that projects ArcFace embeddings into the CLIP space of a fine‑tuned Stable Diffusion, using up‑sampled WebFace42M and high‑quality FFHQ/CelebA‑HQ data to achieve far‑superior facial similarity and consistency compared with existing methods such as FaceSwap and InstantID, as demonstrated by extensive quantitative and visual experiments.

AIWalker

Jan 11, 2025

Arc2Face: Identity‑Conditioned Face Generation Model Delivering High‑Consistency, High‑Quality AI Portraits

Introduction

Arc2Face is a foundation model for face synthesis that conditions generation on a person’s ArcFace embedding. By feeding only the ID vector, the model produces diverse, photorealistic images whose facial similarity far exceeds that of existing approaches such as FaceSwap and InstantID.

Method

The backbone is the pre‑trained Stable Diffusion v1‑5 model. ArcFace embeddings are processed by a frozen‑prompt text encoder and projected into the CLIP latent space, enabling cross‑attention control without any textual prompt. The encoder and UNet are jointly fine‑tuned on a massive facial‑recognition (FR) dataset, then further refined on high‑quality datasets (FFHQ, CelebA‑HQ) to improve visual fidelity.

Identity Conditioning

To align the ID vector with the CLIP space, ArcFace embeddings are inserted into a placeholder token <id> within a simple prompt “a photo of a <id> person”. After tokenization, the placeholder is replaced by the embedding, zero‑padded to the maximum token length N, and fed to an auxiliary encoder τ that maps the sequence into CLIP space. This forces the attention mechanism to focus exclusively on the ID vector, ignoring unrelated context.

Dataset

Training starts from WebFace42M, a large‑scale FR dataset. Low‑resolution images are up‑sampled four‑fold to 448×448 using GFPGAN v1.4, then filtered and cropped, yielding roughly 21 million images for about 1 million identities. Because the up‑sampled data still contain tightly cropped faces, the model is subsequently fine‑tuned on FFHQ and CelebA‑HQ, which provide more loosely framed, high‑resolution faces, resulting in a final 512×512 output aligned with FFHQ resolution.

Experiments

Quantitative Comparison

Arc2Face is evaluated against several synthesis methods by measuring FR model accuracy on the generated images. The results show a clear advantage over competing pipelines, with higher identity preservation scores across all benchmarks.

Visual Comparison

Side‑by‑side visual samples illustrate that Arc2Face maintains consistent identity while producing richer textures and fewer artifacts than other methods.

FR Accuracy Across Synthesis Techniques

A bar chart compares the FR model’s identification accuracy on images generated by different pipelines, confirming that Arc2Face achieves the highest scores.

Overall, the paper demonstrates that projecting ArcFace embeddings into CLIP space and fine‑tuning Stable Diffusion yields a powerful, identity‑preserving face generator that outperforms prior art both quantitatively and qualitatively.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision Stable Diffusion Arc2Face Face Generation Identity Conditioning

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.