How HarmoniCa Boosts Diffusion Model Speed with Joint Training‑Inference Caching
HarmoniCa, a new feature‑caching framework co‑designed by HKUST, Beihang University, and SenseTime, tackles diffusion model inference bottlenecks by aligning training and inference through Step‑Wise Denoising Training and an Image Error Proxy Objective, achieving up to 2× speedup while preserving image quality.
Diffusion Acceleration Challenges
Diffusion Transformers (DiT) are state‑of‑the‑art for high‑resolution image synthesis, but inference is expensive (e.g., generating a 2048×2048 image with PIXART‑α takes ~14 s).
Existing feature‑caching methods suffer from two mismatches:
Pre‑step obliviousness: Training ignores the cache history while inference relies on it.
Misaligned training objective: Training optimizes intermediate noise error, whereas inference cares about final image quality.
HarmoniCa Framework
HarmoniCa introduces a collaborative training‑inference cache acceleration paradigm that aligns the two stages through two mechanisms.
Step‑Wise Denoising Training (SDT)
SDT constructs a full T‑step denoising trajectory during training that matches inference. A teacher‑student architecture is used: the student model accesses the cache, the teacher provides a cache‑free target. Each time‑step router is updated independently, and the student’s output at step t becomes the input for step t+1, eliminating error accumulation.
Effect: Reduces error propagation across steps, improving final image clarity and stability.
Image Error Proxy Objective (IEPO)
IEPO shifts the optimization target from intermediate noise to the final image x₀. A proxy term λ(t) estimates the impact of using the cache at step t on the final image; larger λ(t) discourages cache reuse at critical steps. λ(t) is refreshed periodically by regenerating a batch of images, keeping the objective aligned with the current model state.
Result: Enables a controllable trade‑off between image quality and acceleration.
Experimental Evaluation
Evaluated on two tasks:
Class‑conditional generation with DiT‑XL/2 on ImageNet (256×256).
Text‑to‑image generation with PIXART‑α on COCO at multiple resolutions.
Baselines include Learning‑to‑Cache (LTC), heuristic caches FORA/Δ‑DiT, DDIM step reduction, and model quantization/pruning.
Classification Conditional Generation (DiT‑XL/2 256×256)
Under a high‑compression setting (10 inference steps), HarmoniCa achieves lower FID and higher IS than LTC while using more cache and delivering higher actual speedup.
Text‑to‑Image Generation (PIXART‑α 256×256)
At 2K resolution, HarmoniCa provides a 1.69× real‑world speedup and outperforms FORA on CLIP similarity, FID and other metrics.
Combination with Quantization
When applied to a 4‑bit quantized PIXART‑α model (EfficientDM), HarmoniCa increases inference speed from 1.18× to 1.85× with only a 0.12 increase in FID, acting as an “acceleration plug‑in” compatible with quantized models.
Overhead Analysis
Training side: HarmoniCa requires no image data; training uses only the model and noise, reducing training time by ~25 % compared to LTC with comparable memory usage.
Inference side: The added router occupies ~0.03 % of parameters and adds <0.001 % of total FLOPs, resulting in a theoretical 2.07× speedup and a measured 1.69× acceleration on PIXART‑α.
Conclusion
HarmoniCa synchronizes training and inference via SDT and IEPO, delivering faster inference, higher image quality, lower training barriers, and seamless compatibility with quantization, making it suitable for industrial deployment of high‑resolution diffusion models.
Paper: https://arxiv.org/abs/2410.01723
Code: https://github.com/ModelTC/HarmoniCa
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
