MagicColor: First Multi‑Instance AI Sketch‑Coloring System for Professional‑Grade Comics
MagicColor introduces a novel multi‑instance sketch‑coloring framework that uses a two‑stage self‑play training strategy, instance guidance, and edge‑aware pixel‑level color matching to automatically produce high‑quality, consistent colors for multiple line‑art instances, outperforming prior GAN and diffusion‑based methods.
Problem Definition
Multi‑instance sketch colorization with instance‑level control: given a line‑art image L, a set of reference images R_i and corresponding binary masks M_i, generate a colored output I such that each masked region matches the color palette of its paired reference.
Method Overview
MagicColor builds on a pretrained Stable Diffusion 1.5 UNet and introduces three mechanisms that enable a single forward pass to color multiple instances.
Two‑stage self‑play training : Stage 1 learns single‑reference coloring; Stage 2 adds multiple references using SAM‑extracted instances and random augmentations, mitigating the scarcity of paired multi‑instance data.
Instance Guidance Module : For each mask, ROI features are extracted from DINOv2 feature maps, 10 % of spatial tokens are randomly dropped and replaced with the global DINO embedding, then reshaped into a latent control tensor injected into the diffusion process.
Edge‑aware color matching loss : Combines a VGG‑based perceptual loss, a re‑weighted edge loss that emphasizes high‑frequency boundaries, and a cosine‑similarity based pixel‑wise color matching between reference and source features.
Architecture Details
Backbone: Stable Diffusion 1.5 UNet. A Reference Net encodes each reference image with DINOv2 followed by a small feed‑forward projector. Sketch Guider and Instance Guider are implemented as ControlNet adapters initialized from public weights. The Instance Encoder shares the DINOv2 backbone.
Instance Control Module
For each instance mask M_i, compute its bounding box, sample a dense grid of DINOv2 features inside the box (ROI‑Align‑like), randomly drop 10 % of tokens, replace them with the global DINO embedding, and reshape the result to a latent tensor C_i (shape C×H×W). C_i is concatenated with the diffusion timestep embedding and added to the UNet conditioning, providing instance‑level guidance.
Edge Loss and Color Matching
Standard diffusion training uses pixel‑wise MSE. MagicColor adds:
Perceptual loss on VGG‑19 conv layers.
Edge loss that weights pixels by an edge map derived from a Sobel filter on the ground‑truth image, encouraging accurate reconstruction of object boundaries.
Color matching loss : For each pixel, compute cosine similarity between dense DINO features of the reference and source images; a nearest‑neighbor search yields a correspondence map that enforces color consistency.
Training Procedure
Datasets: anime video frames from Sakuga and ATD‑12K plus manually curated image pairs. 1 000 pairs are held out for testing; the remainder form the training set. Images are resized to 512 px (aspect ratio preserved). Training runs for 100 k optimization steps with batch size 1, learning rate 2e‑5, on two NVIDIA A800 80 GB GPUs for ~7 days. Data augmentation includes random horizontal flip and brightness jitter.
Experimental Results
Qualitative Comparison
Compared with GAN‑based RSIC and SGA and diffusion‑based AnimeDiffusion, ColorizeDiffusion, MangaNinja, MagicColor better preserves instance‑level color correspondence, especially for small details, thanks to explicit instance guidance and edge‑aware loss.
Quantitative Evaluation
On the held‑out test set MagicColor achieves lower FID, higher PSNR and SSIM, and lower LPIPS than all baselines, indicating superior visual similarity and structural fidelity (see Table 1 in the paper).
Ablation Study
Edge loss removed → degraded edge preservation, visible color bleed at object boundaries.
Color matching removed → washed‑out colors and loss of reference palette.
Instance guider removed → failure to transfer instance‑level colors, increased noise.
User Study
Sixteen participants used a demo UI to color sketches. Across four criteria (image quality, usability, reference similarity, result relevance) MagicColor received the highest average scores; average task time exceeded 20 minutes per sketch.
Limitations
Extreme occlusions can hinder accurate color transfer.
Heavy overlap among objects reduces semantic awareness.
Memory and compute scale roughly linearly with the number of reference instances.
Conclusion
MagicColor shows that a diffusion‑based framework equipped with a two‑stage self‑play curriculum, instance guidance, and edge‑aware color matching can achieve high‑fidelity multi‑instance sketch colorization with fine‑grained control. The code and model weights will be released to foster further research.
References
[1] MagicColor: Multi‑Instance Sketch Colorization, arXiv:2503.16948 https://arxiv.org/pdf/2503.16948
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
