Can Human Feedback Make Advertising Image Generation Reliable? Introducing RFNet
This paper presents a multimodal Reliable Feedback Network (RFNet) and a consistency regularization method that use human feedback to automatically evaluate and fine‑tune diffusion models, dramatically increasing the usable rate of e‑commerce advertising images while preserving visual quality.
Background and Motivation
Attractive advertising images are crucial for e‑commerce success, yet manual design is costly. Recent advances combine Stable Diffusion with ControlNet to generate product‑centric images, but many outputs suffer from spatial mismatches, low prominence, or hallucinated shapes, requiring extensive human review.
Reliable Feedback Network (RFNet)
To replace manual inspection, the authors propose RFNet, a multimodal network that predicts the usability of generated ads. RFNet integrates visual, textual, and product‑specific cues to detect issues such as size errors or misleading backgrounds. By feeding RFNet scores back into the generation loop (looped generation), the system repeatedly samples until a usable image is produced.
The looped generation process is illustrated with pseudocode (omitted here) and a diagram of RFNet’s architecture.
Human Feedback and Consistency Regularization
Because repeated sampling is inefficient, the authors adopt human‑feedback‑style reinforcement learning (RLHF) to fine‑tune the diffusion model. RFNet’s output is treated as a proxy for human judgment, and its gradients are back‑propagated to the generator (RFFT). To avoid degrading visual aesthetics, a KL‑divergence loss keeps the fine‑tuned model’s distribution close to the original, while a novel condition‑consistency loss L_CC preserves the text‑condition direction during optimization.
The combined loss is: L_total = L_KL + L_CC where L_KL prevents drift from the pretrained distribution and L_CC ensures that improving usability does not alter the intended textual prompt.
Experiments
Dataset : The authors construct RF1M, a collection of over one million advertising images annotated with human usability labels, used to train RFNet.
Advertising Image Audit Performance : Table 1 shows RFNet outperforms baselines on all metrics (AP, recall, etc.). Component ablation (Table 2) confirms each modality contributes significantly.
Reliability : Table 3 reports higher usable rates for RFFT compared with competing methods. Both automatic “Ava” scores and human “Human Ava” scores follow the same upward trend, demonstrating RFNet’s alignment with human feedback. Loop generation (RG) reduces the number of attempts needed, cutting production time.
Aesthetic Quality : Despite higher usability, the proposed consistency constraint maintains visual quality comparable to the original model, as shown in Figure 6.
Qualitative Comparison : Figure 7 presents example images where the method improves usability and efficiency while preserving appearance.
Generalization : The fine‑tuned ControlNet is evaluated with various LoRA adapters and diffusion model weights. Table 4 indicates a substantial increase in usable rates across all configurations, confirming the approach’s flexibility.
Conclusion
The integration of RFNet with looped generation and the consistency‑regularized fine‑tuning (RFFT) offers a reliable, efficient pipeline for advertising image generation, achieving higher usable rates without sacrificing aesthetics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
