Step‑by‑Step EasyDataset Workflow for Building High‑Quality LLM Training Data

This guide walks readers through installing EasyDataset, creating a project, uploading documents, choosing appropriate chunking strategies, cleaning the data, generating domain tag trees, and exporting a polished pre‑training dataset, with concrete examples, configuration screenshots, and practical recommendations for each step.

Fun with Large Models
Fun with Large Models
Fun with Large Models
Step‑by‑Step EasyDataset Workflow for Building High‑Quality LLM Training Data

EasyDataset Installation

EasyDataset provides a client installer for Windows, macOS and Linux. Download the appropriate package from the GitHub Releases page, run the .exe (Windows) or the equivalent installer for other OSes, and follow the wizard to complete installation.

Project Preparation

Create a new project (e.g., test).

Configure a large‑model endpoint that will be used for tagging, question generation and cleaning. Supported options include locally deployed models, server‑deployed models, and external API services such as the DeepSeek API.

Enter the project, select the configured model, and start data‑processing tasks.

Document Upload

EasyDataset accepts Markdown, PDF, DOCX and TXT files. For PDFs three parsers are available: a basic parser, MinerU (an OCR model from Shanghai AI Lab), and a custom visual model. The author recommends converting all source files to Markdown with a dedicated OCR tool before import.

Long documents (e.g., a 158‑page contract) – use DeepSeek‑OCR (reported 89.5% annotation accuracy).

Academic papers – MinerU preserves layout better.

Formula‑heavy papers – combine MonkeyOCR (92.1% formula accuracy) with DeepSeek‑OCR for semantic understanding.

Edge‑computing or mobile scenarios – MonkeyOCR for low memory usage; PaddleOCR lightweight version for fast ID‑card recognition.

Text Chunking

Why Chunk?

Normalizes varying document lengths.

Works around model input‑length limits.

Improves semantic representation by avoiding overly long contexts.

Enhances retrieval precision.

Optimizes memory usage and enables parallel processing.

Chunking Strategies

Fixed‑length chunking (token‑based or character‑based): simple and fast but may cut sentences.

Recursive text‑structure chunking : respects paragraph → sentence → word hierarchy; better flow but higher computational cost.

Markdown‑structure chunking : leverages headings (#, ##, ###) to keep semantic units; recommended as default for well‑structured docs.

Code chunking : parses code blocks by language to preserve syntax and logical units.

Visual custom chunking : manual selection in a preview UI for precise control, suitable for special layouts.

Strategy Comparison

Fixed‑length chunking – suitable for news articles, blog posts, batch processing; advantage: implementation simplicity and speed; caution: may break sentences, requires overlap configuration.

Recursive text‑structure chunking – suitable for ordinary reports and technical documents; advantage: preserves natural language flow; caution: slightly higher computational cost.

Markdown‑structure chunking – suitable for well‑structured manuals, tutorials, book chapters; advantage: strong semantic integrity and alignment with document hierarchy; caution: depends on proper Markdown formatting.

Code chunking – suitable for programming tutorials and code‑heavy documents; advantage: keeps code syntax intact; caution: requires specifying the programming language and may need combination with other strategies for mixed content.

Visual custom chunking – suitable for special‑layout documents requiring fine‑grained control; advantage: highest precision through manual control; caution: time‑consuming, best for final tuning.

Configuration Details

For fixed‑length chunking the UI exposes three parameters:

separator : default \n\n (paragraph split); can be changed to \n or other delimiters.

chunkSize : maximum number of characters per chunk; smaller values yield finer granularity, larger values retain more context.

chunkOverlap : number of overlapping characters between adjacent chunks to preserve continuity.

Data Cleaning

After chunking, raw blocks often contain ads, duplicates or irrelevant content that degrade model learning. For large‑scale data the author suggests writing Python scripts with rule‑based filters, optionally supplemented by LLM‑driven prompts that automatically label and discard low‑quality blocks. EasyDataset also offers a built‑in cleaning task that calls the configured model to remove noisy text.

Domain Tagging

Why Build a Tag Tree?

Re‑establishes contextual relationships lost after chunking by assigning each fragment an “identity”.

Facilitates data‑balancing strategies such as up‑sampling under‑represented domains or down‑sampling over‑represented ones.

The automatically generated hierarchical tag tree (e.g., technology → sub‑technology) can be viewed under the “Domain Analysis” tab, compared with the original document outline, and manually edited if needed.

From Chunks to Pre‑training Dataset

Chunked, cleaned and tagged text blocks constitute the raw material of a pre‑training dataset. Their quality directly determines how well the model learns context dependencies. Export options include Alpaca, ShareGPT and custom field mappings that preserve tags and chain‑of‑thought (COT) answers. Post‑processing with custom scripts can further refine the dataset before large‑model pre‑training.

Summary

The workflow demonstrates end‑to‑end dataset construction with EasyDataset: installation, project setup, document upload, selection of an appropriate chunking strategy, data cleaning, automatic domain‑tag generation, and export to formats ready for large‑model pre‑training or SFT fine‑tuning.

AI modeldata cleaningMarkdowntext chunkingdomain taggingEasyDatasetLLM data preparation
Fun with Large Models
Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.