Tagged articles
19 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 17, 2026 · Artificial Intelligence

How to Build Agentic Factual SFT and Mid‑Train Datasets: Query Selection, Trajectory Generation, and Tool Usage

This article outlines a systematic approach for creating agentic factual SFT and Mid‑train data, covering the definition of training goals, query filtering, two‑layer classification and labeling, trajectory format, differences between Mid‑train and SFT, a practical synthesis pipeline, and common pitfalls to avoid.

Agentic AISFTdata synthesis
0 likes · 11 min read
How to Build Agentic Factual SFT and Mid‑Train Datasets: Query Selection, Trajectory Generation, and Tool Usage
360 Tech Engineering
360 Tech Engineering
Mar 3, 2026 · Artificial Intelligence

How MMKG‑RDS Generates High‑Quality Multimodal Reasoning Data from Knowledge Graphs

The MMKG‑RDS framework introduced by 360 AI Lab creates a complete pipeline—from multimodal document parsing and knowledge‑graph construction to customizable task synthesis and multi‑dimensional quality assessment—enabling the production of high‑quality reasoning data that significantly boosts large‑model performance across diverse domains.

AI reasoningKnowledge Graphdata synthesis
0 likes · 7 min read
How MMKG‑RDS Generates High‑Quality Multimodal Reasoning Data from Knowledge Graphs
Meituan Technology Team
Meituan Technology Team
Jan 23, 2026 · Artificial Intelligence

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

EvoCUA, a native computer‑use agent from Meituan, combines a verifiable data‑synthesis engine, a ten‑thousand‑level sandbox infrastructure, and an experience‑driven learning paradigm to overcome data‑scaling and feedback challenges, achieving a 56.7% success rate on the OSWorld benchmark and surpassing previous open‑source models.

AI AgentComputer UseOSWorld
0 likes · 27 min read
How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 9, 2025 · Artificial Intelligence

Paper Review: TradingGroup – A Multi‑Agent Quantitative Trading System with Self‑Reflection and Data Synthesis

The paper introduces TradingGroup, a five‑agent LLM‑based quantitative trading framework that incorporates a self‑reflection mechanism, dynamic risk management, and an automated data‑synthesis pipeline, and demonstrates superior cumulative returns, Sharpe ratios, and lower drawdowns than rule‑based, ML, RL, and existing LLM strategies on five real‑world stock datasets.

Financial AILLMMulti-Agent System
0 likes · 14 min read
Paper Review: TradingGroup – A Multi‑Agent Quantitative Trading System with Self‑Reflection and Data Synthesis
DataFunTalk
DataFunTalk
Sep 29, 2025 · Artificial Intelligence

How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP

In an interview before the DACon conference, Dr. Feng Ziyong reveals how Glint‑MVT and novel data‑synthesis techniques overcome distribution gaps, improve compositional understanding, and enable billion‑scale, second‑level retrieval for city‑level surveillance, while balancing model efficiency and effectiveness.

Embedding RetrievalMultimodal AIcity surveillance
0 likes · 11 min read
How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP
DataFunSummit
DataFunSummit
Sep 8, 2025 · Artificial Intelligence

How High‑Quality Inference Data Is Powering the Next AI Revolution

This article explores how high‑quality inference data has become a new paradigm driving AI breakthroughs, detailing Ant Group's research on inference data paradigms, financial‑sector applications, intelligent labeling and quality inspection, and the AIGD AI data synthesis platform, followed by a technical Q&A.

AI dataAIGDFinancial AI
0 likes · 11 min read
How High‑Quality Inference Data Is Powering the Next AI Revolution
Data Party THU
Data Party THU
Aug 20, 2025 · Artificial Intelligence

How Large-Scale Corpus Rewriting is Shaping LLM Training: A Deep Dive into K2, WRAP, and Beyond

This article surveys recent large‑scale corpus rewriting techniques for LLM pre‑training, covering K2’s token‑utilization strategies, domain‑specific methods like SwallowMath/Code, reStructured pretraining, the WRAP pipeline, Nemotron‑CC filtering, Pro‑X noise removal, and the MAGA multi‑style expansion, while highlighting challenges, experimental findings, and open research questions.

LLMcorpus rewritingdata synthesis
0 likes · 20 min read
How Large-Scale Corpus Rewriting is Shaping LLM Training: A Deep Dive into K2, WRAP, and Beyond
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 9, 2025 · Artificial Intelligence

How Easy Dataset Automates High‑Quality LLM Fine‑Tuning Data from Unstructured Docs

The article introduces Easy Dataset, a GUI‑driven framework that transforms heterogeneous documents into high‑quality, persona‑driven fine‑tuning data for large language models, details its architecture, core contributions, experimental validation on financial QA, and compares it with existing data‑synthesis tools.

Fine-tuningGUILLM
0 likes · 12 min read
How Easy Dataset Automates High‑Quality LLM Fine‑Tuning Data from Unstructured Docs
DataFunSummit
DataFunSummit
Jun 6, 2025 · Artificial Intelligence

Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations

This work tackles the difficulty of incorporating extensive domain knowledge into in‑domain NL2SQL tasks by proposing an intermediate‑representation‑based data synthesis method that decouples knowledge compliance from SQL generation, enabling automated creation of high‑quality training data with 60× human efficiency and over 97% accuracy.

NL2SQLSQL generationdata synthesis
0 likes · 2 min read
Automating High‑Quality NL2SQL Data Synthesis with Intermediate Representations
DevOps
DevOps
May 18, 2025 · Artificial Intelligence

Why the Focus Has Shifted from AI Agents to Agentic Workflows

Although large language models have enabled AI agents that mimic human digital interactions, their commercial accuracy remains far below production standards, prompting the industry to pivot toward agentic workflows and data synthesis, which promise more reliable task automation, reasoning, and observable, auditable processes for knowledge work.

agentic workflowsdata synthesisknowledge work
0 likes · 6 min read
Why the Focus Has Shifted from AI Agents to Agentic Workflows
AI Algorithm Path
AI Algorithm Path
Mar 15, 2025 · Artificial Intelligence

Why the Industry Is Shifting From AI Agents to Agentic Workflows

The article explains that low accuracy and security risks of current AI agents—evidenced by a Claude AI Agent achieving only 14% of human performance and an average success rate of about 20%—are driving a move toward agentic workflows, which offer observable, auditable, and data‑synthesizing pipelines that dramatically improve enterprise productivity.

AI agentsAutomationLLM
0 likes · 7 min read
Why the Industry Is Shifting From AI Agents to Agentic Workflows
DataFunSummit
DataFunSummit
Feb 10, 2025 · Artificial Intelligence

Intelligent Decision-Making Large Model ORLM: Research, Training Challenges, Commercialization, and Future Directions

This article presents the ORLM intelligent decision‑making large model, detailing how real‑world decision problems are formalized and solved, the training difficulties and data synthesis methods, the transition from academic research to commercial platforms, and future technical improvement plans.

AIDecision ModelingModel Training
0 likes · 10 min read
Intelligent Decision-Making Large Model ORLM: Research, Training Challenges, Commercialization, and Future Directions
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 16, 2024 · Artificial Intelligence

What Do Leading Open‑Source LLMs Do After Pretraining? A Deep Dive into Post‑Training Strategies

This article surveys the post‑training pipelines of major open‑source large language models released this year, detailing their alignment algorithms, data synthesis, reward modeling, DPO/GRPO variants, long‑context handling, tool use, and model‑averaging techniques, and highlights emerging trends such as data‑centric pipelines and iterative weak‑to‑strong alignment.

AI researchAlignmentLLM
0 likes · 99 min read
What Do Leading Open‑Source LLMs Do After Pretraining? A Deep Dive into Post‑Training Strategies
Kuaishou Tech
Kuaishou Tech
Jul 23, 2024 · Artificial Intelligence

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models

This paper introduces Parrot, a system that enhances large language models' (LLMs) multi-turn instruction following capabilities through context-aware preference optimization (CaPO) and synthetic data generation, achieving significant performance improvements with limited training data.

CaPONLPdata synthesis
0 likes · 9 min read
Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models
DataFunTalk
DataFunTalk
Aug 27, 2023 · Artificial Intelligence

AIGC and Causal Inference: Mutual Empowerment and Practical Applications

This article explores how generative AI (AIGC) can be used to synthesize structured data, how such synthetic data enhances causal inference tasks, and how agent‑based modeling and the YLearn framework together enable a two‑way synergy between AIGC and causal learning for enterprise AI solutions.

AIGCAgent-Based ModelingYLearn
0 likes · 15 min read
AIGC and Causal Inference: Mutual Empowerment and Practical Applications
Shopee Tech Team
Shopee Tech Team
Nov 10, 2022 · Artificial Intelligence

ShopeeVideo OCR: Multi-language Text Recognition System for E-commerce Video

ShopeeVideo OCR is a multi‑language text‑recognition system for Southeast Asian e‑commerce videos that unifies detection, Transformer‑based recognition, layout analysis, and large‑scale synthetic data generation to handle Indonesian, Filipino, English, Vietnamese, Thai and Chinese scripts, delivering industry‑leading accuracy and winning thirteen ICDAR first‑place awards.

Computer VisionDeep LearningMulti-language OCR
0 likes · 15 min read
ShopeeVideo OCR: Multi-language Text Recognition System for E-commerce Video
Laiye Technology Team
Laiye Technology Team
Sep 28, 2022 · Artificial Intelligence

Checkbox Detection and State Classification Using YOLOv5

This article describes a comprehensive solution for detecting checkboxes in document images and determining their selected or unselected status by combining YOLOv5 object detection, synthetic and semi‑synthetic data generation, specialized post‑processing, and association logic to handle varied shapes, positions, and markings.

YOLOv5checkbox detectiondata synthesis
0 likes · 13 min read
Checkbox Detection and State Classification Using YOLOv5
Alibaba Terminal Technology
Alibaba Terminal Technology
Dec 15, 2021 · Artificial Intelligence

Unlock Real-Time Mobile OCR: Inside Ant’s xNN-OCR Engine and Its Tiny, Fast AI

Ant’s self‑developed xNN‑OCR demonstrates how advanced OCR can run offline on smartphones by combining GAN‑based data synthesis, lightweight ShuffleNet‑inspired detection, NAS‑optimized recognition, and aggressive model compression, delivering near‑real‑time accuracy for diverse mobile scenarios while preserving privacy and low cost.

NASdata synthesisedge AI
0 likes · 11 min read
Unlock Real-Time Mobile OCR: Inside Ant’s xNN-OCR Engine and Its Tiny, Fast AI