EasyDataset: End-to-End Guide for Generating QA Datasets for LLM Fine‑Tuning
This article walks through the complete workflow of using EasyDataset to create high‑quality question‑answer pairs for supervised fine‑tuning, covering question generation (single and batch), three generation algorithms, answer generation (including chain‑of‑thought and multi‑turn dialogue), a hybrid quality‑assessment pipeline, and export to Alpaca or ShareGPT formats.
1. Question Generation
EasyDataset extracts questions from pre‑segmented text blocks. The number of questions per block is controlled by the "Maximum question length" setting, which defaults to one question per 240 characters. In the example, a 3,573‑character block produced 14 questions. Generated questions appear above the block and can also be viewed in the "Questions" module, which lists each question, its source block, and automatically assigned domain‑tree tags.
1.1 Basic Operations
Single‑block generation : Click the question‑generation button on a block to create questions for that block.
Batch generation : Select multiple blocks and trigger generation, with a progress view.
Question management : The "Questions" module offers a list view (filter by name or tag, edit or add custom questions) and a tree view that groups questions by domain‑tree hierarchy.
1.2 Generation Algorithms
Prompt‑engineering method : Directly instruct a large model (e.g., "Generate a question based on the following paragraph"). EasyDataset stores the prompt in {{text}} and {{question}} placeholders; users can edit these prompts but must keep the variables intact.
Knowledge‑enhanced method : Store all text blocks in a knowledge base, retrieve related passages using domain tags, and feed multiple passages to the model. A further variant combines retrieved passages with knowledge‑graph triples (e.g., "Liu Cixin → birthplace → Shanxi → capital → Taiyuan") to generate reasoning‑heavy questions.
Data‑augmentation method : Use a small set of manually curated seed questions, generate new questions via prompting, then apply back‑translation, synonym rewriting, or context augmentation to increase diversity.
2. Answer Generation
After questions are created, EasyDataset can generate answers in several modes.
2.1 Basic Operations
Single‑question answer : Click the magic‑wand icon next to a question; the tool can generate multiple answers. Using the deepseek-chat model yields plain answers, while deepseek-reasoner adds a chain‑of‑thought (CoT) trace.
Batch answer generation : Select multiple or all questions and launch a batch job with real‑time progress.
Multi‑turn dialogue construction : Configure system prompts, dialogue scenario, number of turns, and roles in the project settings, then generate multi‑turn conversations for a single question or in bulk.
Dataset management : All generated single‑turn or multi‑turn records appear in the "Dataset List" module, showing question, creation time, model used, domain tags, CoT flag, and answer summary. Users can inspect details, edit answers or CoT, and trace back to the original text block.
2.2 Answer Generation Algorithms
Prompt‑based generation : Craft a prompt such as "Based on the following document, briefly answer: {question}". EasyDataset lets users view and edit this prompt.
Retrieval‑augmented generation : Retrieve relevant document fragments from a vector store or search engine, combine them with the question, and feed both to the model to reduce hallucinations and capture cross‑block context.
Knowledge‑graph‑enhanced generation : Encode the question, retrieve entities via graph neural networks, and generate answers that require multi‑hop reasoning.
Tool‑calling generation : Use function calling to invoke external tools. Examples include a code interpreter that runs Python for mathematical or data‑analysis queries, and API calls that fetch real‑time information such as weather or stock prices.
3. Dataset Quality Evaluation
Generated QA pairs often contain noise, hallucinations, or incompleteness, so a mixed evaluation strategy is required.
Automated pre‑evaluation : Run statistical analyses, a lightweight BERT‑based classifier, and deduplication to filter obviously bad samples.
Stratified manual sampling : Based on automated scores (e.g., low confidence or long‑tail distribution), select representative subsets for detailed human review, recording error types such as factual errors or logical inconsistencies.
Feedback loop : Categorize discovered errors, adjust prompt templates, text‑chunking strategies, or domain‑tag assignments, and regenerate data to improve quality.
Continuous monitoring : During model training, monitor loss and validation performance to indirectly assess dataset quality and trigger further analysis if degradation appears.
4. Exporting Datasets
Exporting to a standard format ensures compatibility, reproducibility, and clear semantics for downstream training.
4.1 Alpaca format (single‑turn instruction tuning)
instruction: The task description. input (optional): Additional context. output: Expected response.
EasyDataset produces JSON lines matching this schema, suitable for frameworks such as LLaMA‑Factory or MS‑SWIFT.
4.2 ShareGPT format (multi‑turn dialogue)
The format stores a list of conversations or messages, each entry containing a role (e.g., human, assistant) and the utterance text, preserving the full dialogue history for chat‑bot training.
5. Conclusion
The guide demonstrates how EasyDataset can be used to build a complete QA dataset pipeline—from question creation (with three algorithmic options), answer generation (including CoT and multi‑turn dialogue), hybrid quality assessment, to exporting in Alpaca or ShareGPT formats—providing high‑quality data for LLM supervised fine‑tuning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
