How SICOG Enables Self‑Evolving Multimodal Models with Zero‑Label Data

The paper introduces SICOG, a three‑stage collaborative framework that combines post‑training enhancement, inference optimization, and re‑pretraining with a self‑generated data loop, allowing large multimodal models to continuously improve without massive human‑annotated datasets, and demonstrates consistent gains across dozens of benchmarks.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How SICOG Enables Self‑Evolving Multimodal Models with Zero‑Label Data

Background

Multimodal large models rely on massive high‑quality image‑text pairs, but such data are becoming scarce. Researchers have warned that the traditional pre‑training paradigm will soon reach its limits.

SICOG Framework

Structured In‑Context Optimization and Generation (SICOG) introduces a closed‑loop three‑stage pipeline:

Post‑training enhancement : fine‑tune the base model on a small set of high‑quality labeled multimodal data to improve systematic cognition and basic reasoning.

Inference optimization : run the model on large unlabeled multimodal corpora, apply a self‑consistency voting mechanism to select high‑confidence answers, and generate pseudo‑labels.

Re‑pretraining reinforcement : feed the filtered pseudo‑labels back into the pre‑training stage, continuously refining the model.

Structured reasoning mechanisms

Chain‑of‑Description (CoD) : a five‑step visual parsing process (subject → fine‑grained details → relational attributes → background → integrated description) that yields a coherent image caption.

Chain‑of‑Thought (CoT) : a task‑driven multi‑step reasoning chain (goal clarification → key information extraction → logical analysis → answer synthesis) for complex multimodal problems such as geometry.

Key contributions

Eliminates reliance on large manually annotated datasets by using self‑generated data.

Enables a lifelong‑learning style where the model continuously evolves, replacing the static “train‑once‑use‑forever” paradigm.

Integrates perception and reasoning during pre‑training, reducing hallucinations.

Experimental evaluation

Evaluated on 12 multimodal benchmarks (including ScienceQA, POPE, and various VQA suites). Results:

Average improvement of 2 %–4 % over the baseline.

Significant gains on multi‑step reasoning tasks (e.g., ScienceQA).

Hallucination error reduction of 1 %–2 % on POPE.

Scaling experiments show steady performance growth when synthetic data increase from 118 k to 213 k samples.

Additional findings

Stronger base models (e.g., LLaVA‑Qwen2‑7B‑UHD) benefit more from the self‑evolution loop, achieving roughly 50 % larger improvement than weaker models such as LLaVA‑Llama3.1‑8B‑UHD.

Synthetic data generated by SICOG follow the same scaling law as real data, confirming their effectiveness.

A variant that replaces the first‑stage supervised fine‑tuning with preference learning (RLHF‑style) yields better generalization on complex tasks, supporting the claim that reinforcement‑learning‑based fine‑tuning can surpass standard supervised fine‑tuning.

Future directions

Extending SICOG with environment feedback (e.g., embodied agents) and continuous optimization mechanisms to achieve truly autonomous lifelong‑learning agents that can identify knowledge gaps and adapt learning strategies on the fly.

Reference

Paper:

https://arxiv.org/abs/2503.12303v5
model scalinglarge multimodal modelslifelong learningself‑supervised learningsynthetic data generationchain‑of‑thought
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.