Tagged articles

Text preprocessing

8 articles · Page 1 of 1
Lisa Notes
Lisa Notes
Jun 30, 2026 · Artificial Intelligence

NLP Study Notes: 4 Essential Steps for Preprocessing Chinese Text Corpora

This article walks through the four core steps of Chinese NLP corpus preparation—collecting data, cleaning it with regex and encoding detection, tokenizing using dictionary‑based or statistical methods such as jieba, HMM and CRF, and finally standardizing with stop‑word removal, vocabulary building and one‑hot encoding—while illustrating each step with concrete code snippets and practical examples.

CRFChineseNLP
0 likes · 12 min read
NLP Study Notes: 4 Essential Steps for Preprocessing Chinese Text Corpora
Tech Musings
Tech Musings
Feb 7, 2026 · Fundamentals

How to Clean and Convert a Chinese Poetry Dataset for RAG Projects

This guide explains how to clean a Chinese poetry corpus—removing special characters, filtering short entries, and converting traditional characters to simplified Chinese—using Python validation functions, batch file processing, and WSL‑based OpenCC conversion, then persisting the results as JSON.

RAGText preprocessingdata cleaning
0 likes · 12 min read
How to Clean and Convert a Chinese Poetry Dataset for RAG Projects
JavaEdge
JavaEdge
Mar 15, 2025 · Artificial Intelligence

Boost NLP Model Performance with n-gram Feature Engineering

This article explains why feature engineering is crucial for NLP tasks, introduces n‑gram enhancements, provides Python implementations for generating bi‑gram and higher‑order features, demonstrates dynamic padding for text length standardization, and offers practical deployment tips such as feature dimension control and monitoring.

Deep LearningN-gramNLP
0 likes · 7 min read
Boost NLP Model Performance with n-gram Feature Engineering
Code DAO
Code DAO
Dec 21, 2021 · Artificial Intelligence

Four Keras Techniques for Preprocessing Text for Deep Learning

This article explains four Keras utilities—text_to_word_sequence, hashing_trick, one_hot, and Tokenizer—showing how each converts raw text into token lists, hash indices, integer encodings, or document matrices, with code examples and sample outputs.

KerasText preprocessinghashing_trick
0 likes · 6 min read
Four Keras Techniques for Preprocessing Text for Deep Learning
Yuewen Technology
Yuewen Technology
Oct 15, 2021 · Artificial Intelligence

How Yuedu's TTS Platform Automates High‑Quality Audiobook Production

This article explains how Yuedu's TTS synthesis platform tackles the booming audiobook market by using AI‑driven text preprocessing, role graph construction, content structuring, emotion and effect recognition, and a streamlined post‑processing workflow to efficiently generate multi‑character, emotionally rich audio books at scale.

Emotion RecognitionNLPTTS
0 likes · 13 min read
How Yuedu's TTS Platform Automates High‑Quality Audiobook Production
Yanxuan Tech Team
Yanxuan Tech Team
Apr 20, 2020 · Artificial Intelligence

How AI-Driven Clustering Boosts Smart Customer Service Knowledge Bases

This article outlines an AI-powered workflow for constructing and enriching a business knowledge base in intelligent customer service, covering preprocessing, intent detection, deep and shallow semantic feature engineering, hierarchical bucket clustering, and automated summary extraction to improve FAQ coverage and reduce manual workload.

AIClusteringKnowledge Base
0 likes · 15 min read
How AI-Driven Clustering Boosts Smart Customer Service Knowledge Bases