Tagged articles

Data Synthesis

21 articles · Page 1 of 1

Jun 7, 2026 · Artificial Intelligence

How 100 Samples Let LLMs Master New Domains – The DOMINO Agent Breakthrough

The article explains how the DOMINO method lets large language models learn a domain from just dozens of real examples instead of hand‑written prompts, describes its trainable "domain switch" architecture, and shows experimental gains on time‑varying code tasks, highlighting more robust and diverse data synthesis.

DOMINOData SynthesisDomain Adaptation

0 likes · 8 min read

How 100 Samples Let LLMs Master New Domains – The DOMINO Agent Breakthrough

Machine Learning Algorithms & Natural Language Processing

May 28, 2026 · Artificial Intelligence

Synthesizing Agentic Factual SFT/Mid‑train Data: Query Filtering, Trajectory Generation, and Tool Usage

The article outlines a practical pipeline for creating agentic factual SFT and mid‑train datasets, covering how to define training goals, filter and classify queries, label processing tags, format trajectory samples, differentiate SFT from mid‑train data, and avoid common pitfalls when generating evidence‑driven AI training data.

Data SynthesisSFTagentic AI

0 likes · 10 min read

Synthesizing Agentic Factual SFT/Mid‑train Data: Query Filtering, Trajectory Generation, and Tool Usage

Machine Learning Algorithms & Natural Language Processing

May 17, 2026 · Artificial Intelligence

How to Build Agentic Factual SFT and Mid‑Train Datasets: Query Selection, Trajectory Generation, and Tool Usage

This article outlines a systematic approach for creating agentic factual SFT and Mid‑train data, covering the definition of training goals, query filtering, two‑layer classification and labeling, trajectory format, differences between Mid‑train and SFT, a practical synthesis pipeline, and common pitfalls to avoid.

Data SynthesisSFTagentic AI

0 likes · 11 min read

How to Build Agentic Factual SFT and Mid‑Train Datasets: Query Selection, Trajectory Generation, and Tool Usage

360 Tech Engineering

Mar 3, 2026 · Artificial Intelligence

How MMKG‑RDS Generates High‑Quality Multimodal Reasoning Data from Knowledge Graphs

The MMKG‑RDS framework introduced by 360 AI Lab creates a complete pipeline—from multimodal document parsing and knowledge‑graph construction to customizable task synthesis and multi‑dimensional quality assessment—enabling the production of high‑quality reasoning data that significantly boosts large‑model performance across diverse domains.

AI reasoningData SynthesisKnowledge Graph

0 likes · 7 min read

How MMKG‑RDS Generates High‑Quality Multimodal Reasoning Data from Knowledge Graphs

Meituan Technology Team

Jan 23, 2026 · Artificial Intelligence

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

EvoCUA, a native computer‑use agent from Meituan, combines a verifiable data‑synthesis engine, a ten‑thousand‑level sandbox infrastructure, and an experience‑driven learning paradigm to overcome data‑scaling and feedback challenges, achieving a 56.7% success rate on the OSWorld benchmark and surpassing previous open‑source models.

AI AgentComputer UseData Synthesis

0 likes · 27 min read

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

Bighead's Algorithm Notes

Oct 9, 2025 · Artificial Intelligence

Paper Review: TradingGroup – A Multi‑Agent Quantitative Trading System with Self‑Reflection and Data Synthesis

The paper introduces TradingGroup, a five‑agent LLM‑based quantitative trading framework that incorporates a self‑reflection mechanism, dynamic risk management, and an automated data‑synthesis pipeline, and demonstrates superior cumulative returns, Sharpe ratios, and lower drawdowns than rule‑based, ML, RL, and existing LLM strategies on five real‑world stock datasets.

Data SynthesisLLMfinancial AI

0 likes · 14 min read

Paper Review: TradingGroup – A Multi‑Agent Quantitative Trading System with Self‑Reflection and Data Synthesis

DataFunTalk

Sep 29, 2025 · Artificial Intelligence

How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP

In an interview before the DACon conference, Dr. Feng Ziyong reveals how Glint‑MVT and novel data‑synthesis techniques overcome distribution gaps, improve compositional understanding, and enable billion‑scale, second‑level retrieval for city‑level surveillance, while balancing model efficiency and effectiveness.

Data SynthesisEmbedding RetrievalMultimodal AI

0 likes · 11 min read

How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP

DataFunSummit

Sep 8, 2025 · Artificial Intelligence

How High‑Quality Inference Data Is Powering the Next AI Revolution

This article explores how high‑quality inference data has become a new paradigm driving AI breakthroughs, detailing Ant Group's research on inference data paradigms, financial‑sector applications, intelligent labeling and quality inspection, and the AIGD AI data synthesis platform, followed by a technical Q&A.

AI dataAIGDData Synthesis

0 likes · 11 min read

How High‑Quality Inference Data Is Powering the Next AI Revolution

Data Party THU

Aug 20, 2025 · Artificial Intelligence

How Large-Scale Corpus Rewriting is Shaping LLM Training: A Deep Dive into K2, WRAP, and Beyond

This article surveys recent large‑scale corpus rewriting techniques for LLM pre‑training, covering K2’s token‑utilization strategies, domain‑specific methods like SwallowMath/Code, reStructured pretraining, the WRAP pipeline, Nemotron‑CC filtering, Pro‑X noise removal, and the MAGA multi‑style expansion, while highlighting challenges, experimental findings, and open research questions.

Data SynthesisLLMcorpus rewriting

0 likes · 20 min read

How Large-Scale Corpus Rewriting is Shaping LLM Training: A Deep Dive into K2, WRAP, and Beyond

Instant Consumer Technology Team

Jul 9, 2025 · Artificial Intelligence

How Easy Dataset Automates High‑Quality LLM Fine‑Tuning Data from Unstructured Docs

The article introduces Easy Dataset, a GUI‑driven framework that transforms heterogeneous documents into high‑quality, persona‑driven fine‑tuning data for large language models, details its architecture, core contributions, experimental validation on financial QA, and compares it with existing data‑synthesis tools.

Artificial IntelligenceData SynthesisGUI

0 likes · 12 min read

How Easy Dataset Automates High‑Quality LLM Fine‑Tuning Data from Unstructured Docs

DataFunSummit

Jun 6, 2025 · Artificial Intelligence

Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations

This work tackles the difficulty of incorporating extensive domain knowledge into in‑domain NL2SQL tasks by proposing an intermediate‑representation‑based data synthesis method that decouples knowledge compliance from SQL generation, enabling automated creation of high‑quality training data with 60× human efficiency and over 97% accuracy.

Data SynthesisLarge Language ModelsNL2SQL

0 likes · 2 min read

Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations

DevOps

May 18, 2025 · Artificial Intelligence

Why the Focus Has Shifted from AI Agents to Agentic Workflows

Although large language models have enabled AI agents that mimic human digital interactions, their commercial accuracy remains far below production standards, prompting the industry to pivot toward agentic workflows and data synthesis, which promise more reliable task automation, reasoning, and observable, auditable processes for knowledge work.

Data Synthesisagentic workflowsknowledge work

0 likes · 6 min read

Why the Focus Has Shifted from AI Agents to Agentic Workflows

AI Algorithm Path

Mar 15, 2025 · Artificial Intelligence

Why the Industry Is Shifting From AI Agents to Agentic Workflows

The article explains that low accuracy and security risks of current AI agents—evidenced by a Claude AI Agent achieving only 14% of human performance and an average success rate of about 20%—are driving a move toward agentic workflows, which offer observable, auditable, and data‑synthesizing pipelines that dramatically improve enterprise productivity.

AI agentsAutomationData Synthesis

0 likes · 7 min read

Why the Industry Is Shifting From AI Agents to Agentic Workflows

DataFunSummit

Feb 10, 2025 · Artificial Intelligence

Intelligent Decision-Making Large Model ORLM: Research, Training Challenges, Commercialization, and Future Directions

This article presents the ORLM intelligent decision‑making large model, detailing how real‑world decision problems are formalized and solved, the training difficulties and data synthesis methods, the transition from academic research to commercial platforms, and future technical improvement plans.

AIData SynthesisDecision Modeling

0 likes · 10 min read

Intelligent Decision-Making Large Model ORLM: Research, Training Challenges, Commercialization, and Future Directions

Baobao Algorithm Notes

Dec 16, 2024 · Artificial Intelligence

What Do Leading Open‑Source LLMs Do After Pretraining? A Deep Dive into Post‑Training Strategies

This article surveys the post‑training pipelines of major open‑source large language models released this year, detailing their alignment algorithms, data synthesis, reward modeling, DPO/GRPO variants, long‑context handling, tool use, and model‑averaging techniques, and highlights emerging trends such as data‑centric pipelines and iterative weak‑to‑strong alignment.

AI researchData SynthesisLLM

0 likes · 99 min read

What Do Leading Open‑Source LLMs Do After Pretraining? A Deep Dive into Post‑Training Strategies

Kuaishou Tech

Jul 23, 2024 · Artificial Intelligence

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models

This paper introduces Parrot, a system that enhances large language models' (LLMs) multi-turn instruction following capabilities through context-aware preference optimization (CaPO) and synthetic data generation, achieving significant performance improvements with limited training data.

CaPOData SynthesisLarge Language Models

0 likes · 9 min read

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models

DataFunTalk

Aug 27, 2023 · Artificial Intelligence

AIGC and Causal Inference: Mutual Empowerment and Practical Applications

This article explores how generative AI (AIGC) can be used to synthesize structured data, how such synthetic data enhances causal inference tasks, and how agent‑based modeling and the YLearn framework together enable a two‑way synergy between AIGC and causal learning for enterprise AI solutions.

AIGCAgent-based ModelingArtificial Intelligence

0 likes · 15 min read

AIGC and Causal Inference: Mutual Empowerment and Practical Applications

Shopee Tech Team

Nov 10, 2022 · Artificial Intelligence

ShopeeVideo OCR: Multi-language Text Recognition System for E-commerce Video

ShopeeVideo OCR is a multi‑language text‑recognition system for Southeast Asian e‑commerce videos that unifies detection, Transformer‑based recognition, layout analysis, and large‑scale synthetic data generation to handle Indonesian, Filipino, English, Vietnamese, Thai and Chinese scripts, delivering industry‑leading accuracy and winning thirteen ICDAR first‑place awards.

Data SynthesisDeep LearningMulti-language OCR

0 likes · 15 min read

ShopeeVideo OCR: Multi-language Text Recognition System for E-commerce Video

Laiye Technology Team

Sep 28, 2022 · Artificial Intelligence

Checkbox Detection and State Classification Using YOLOv5

This article describes a comprehensive solution for detecting checkboxes in document images and determining their selected or unselected status by combining YOLOv5 object detection, synthetic and semi‑synthetic data generation, specialized post‑processing, and association logic to handle varied shapes, positions, and markings.

Data SynthesisYOLOv5checkbox detection

0 likes · 13 min read

Checkbox Detection and State Classification Using YOLOv5

Alibaba Terminal Technology

Dec 15, 2021 · Artificial Intelligence

Unlock Real-Time Mobile OCR: Inside Ant’s xNN-OCR Engine and Its Tiny, Fast AI

Ant’s self‑developed xNN‑OCR demonstrates how advanced OCR can run offline on smartphones by combining GAN‑based data synthesis, lightweight ShuffleNet‑inspired detection, NAS‑optimized recognition, and aggressive model compression, delivering near‑real‑time accuracy for diverse mobile scenarios while preserving privacy and low cost.

Data SynthesisNASedge AI

0 likes · 11 min read

Unlock Real-Time Mobile OCR: Inside Ant’s xNN-OCR Engine and Its Tiny, Fast AI

Architects Research Society

Oct 2, 2016 · Artificial Intelligence

Key Takeaways from Andrew Ng’s Deep Learning Talk at the Bay Area Deep Learning School

The article summarizes Andrew Ng’s presentation at BADLS, highlighting major deep‑learning trends such as the rise of big data, end‑to‑end models, the bias‑variance tradeoff, human‑level performance benchmarks, and practical advice for improving one’s AI skills.

AI trendsData SynthesisEnd-to-End

0 likes · 10 min read

Key Takeaways from Andrew Ng’s Deep Learning Talk at the Bay Area Deep Learning School