How 100 Samples Let LLMs Master New Domains – The DOMINO Agent Breakthrough

The article explains how the DOMINO method lets large language models learn a domain from just dozens of real examples instead of hand‑written prompts, describes its trainable "domain switch" architecture, and shows experimental gains on time‑varying code tasks, highlighting more robust and diverse data synthesis.

PaperAgent
PaperAgent
PaperAgent
How 100 Samples Let LLMs Master New Domains – The DOMINO Agent Breakthrough

1. The Problem: "Domain Description" Is Often Undefined

Typical data‑synthesis pipelines (Self‑Instruct, Evol‑Instruct, MAGPIE) assume you can first describe the target domain in natural language and embed that description into prompts. In many high‑value scenarios—time‑drifting topics, implicit corporate conventions, or mixed‑style domains—such a description is impossible.

"You can first use natural language to clarify the target domain, then fuse that description into prompts for the model to generate data."

When only a few dozen real samples are available, models either overfit (memorizing details) or produce mere synonym rewrites of the references.

2. DOMINO’s Approach: A Trainable "Domain Switch"

DOMINO treats the domain as a learnable switch:

The switch is trained on a small set of reference examples.

When activated, the model generates content that more closely follows the domain.

Crucially, it expands beyond the reference examples instead of copying them.

This resembles prompt tuning, but DOMINO solves its classic pain points:

Few samples → the model memorizes details, yielding low diversity.

The domain requires true underlying rules (question patterns, constraints, common pitfalls) that are shared across samples.

Key intuition: Separate "domain‑wide regularities" from "individual sample noise" so the model learns the core domain while ignoring idiosyncratic details.

Implementation-wise, DOMINO trains two groups of soft tokens: one encoding domain commonalities, the other encoding sample‑specific traits, and uses a contrastive objective to decouple them. After training, only the domain tokens are kept for data synthesis.

"The model learns a minimal sufficient representation that retains necessary domain information while discarding irrelevant noise, theoretically guaranteeing broader generation diversity."

3. Comparing DOMINO with Traditional Synthesis

Using a cooking analogy:

Traditional synthesis (write prompt): Write a recipe (domain description) and let the chef follow it.

DOMINO (learn from examples): Show the chef a few finished dishes, let them grasp the flavor profile, then create new dishes with the same taste.

Often the "flavor profile" is hard to articulate, yet a few examples convey it instantly.

4. Experimental Validation on Time‑Varying Code Tasks

The authors evaluated DOMINO on a realistic scenario: a code‑generation domain whose characteristics drift over time.

Early samples serve as "reference" to let the model understand the domain.

DOMINO synthesizes a large amount of new training data.

The synthetic data is used to fine‑tune the model.

Performance is measured on later‑time test sets to check whether the model learned domain regularities rather than memorizing old questions.

Across several strong code‑LLM backbones, DOMINO consistently improved results. An interesting observation was that fine‑tuning a base model with data generated by an instruction‑tuned model sometimes allowed the base model to surpass the original instruction model on the target domain, indicating that domain knowledge can be more effectively distilled into training data.

5. Practical Benefits for Engineering

For domain adaptation or enterprise deployment, DOMINO offers three concrete advantages:

No need to craft a potentially inaccurate textual domain description.

Reduces time spent on prompt engineering.

Works with only a handful of reference examples, even if they are private.

Thus, in domains that are hard to describe but easy to exemplify, the interaction becomes "provide examples, not write definitions."

6. Takeaway

"Instead of writing a textual domain description for the model, turn the domain into a small set of examples and let the model infer the underlying rules."

DOMINO aims to make this process reliable by learning shared domain regularities while ignoring sample‑specific noise, producing richer domain‑consistent data rather than mere copies of the references.

https://arxiv.org/abs/2605.30039
https://github.com/tongye98/DOMINO
Domain‑Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMDomain AdaptationPrompt TuningData SynthesisDOMINOKDD2026
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.