Tagged articles
127 articles
Page 1 of 2
IT Services Circle
IT Services Circle
May 15, 2026 · Artificial Intelligence

Why Your Validation Set Fails: Outliers Are Skewing Your Data

The article explains how outliers can dramatically distort training and validation results in machine learning, outlines practical detection methods such as business rules, Z‑Score, IQR and Isolation Forest, and demonstrates cleaning techniques with a complete house‑price prediction case study in Python.

Isolation ForestPythondata cleaning
0 likes · 19 min read
Why Your Validation Set Fails: Outliers Are Skewing Your Data
Smart Workplace Lab
Smart Workplace Lab
May 10, 2026 · Artificial Intelligence

When Your Internal AI Is Fed Bad Data, How to Fix It?

The article recounts a real incident where an AI‑generated SOP cited outdated policy because a knowledge base was overloaded with unchecked historical documents, then outlines a step‑by‑step protocol—including corpus cleaning, version locking, and isolation zones—to prevent data contamination and ensure reliable AI outputs.

AIData GovernanceKnowledge Base
0 likes · 7 min read
When Your Internal AI Is Fed Bad Data, How to Fix It?
AI Architect Hub
AI Architect Hub
Apr 24, 2026 · Artificial Intelligence

RAG Level 1: Avoid Dirty Data Poisoning Your AI – A Data Cleaning Guide

This article explains why noisy documents cripple Retrieval‑Augmented Generation, enumerates common garbage data types, describes three typical data‑quality problems, warns against over‑cleaning, encoding, and regex pitfalls, and provides a configurable LangChain pipeline with deduplication and validation best practices.

AIEmbeddingLangChain
0 likes · 21 min read
RAG Level 1: Avoid Dirty Data Poisoning Your AI – A Data Cleaning Guide
AI Architect Hub
AI Architect Hub
Apr 19, 2026 · Artificial Intelligence

Mastering RAG: From Data Cleaning to Vector DBs in AI Applications

This article introduces the second stage of a large‑model application series, detailing the value of Retrieval‑Augmented Generation (RAG), its architecture, and a step‑by‑step outline covering data cleaning, text chunking, vectorization, vector‑DB selection, recall strategies, reranking, and prompt construction.

AILLMRAG
0 likes · 4 min read
Mastering RAG: From Data Cleaning to Vector DBs in AI Applications
Fun with Large Models
Fun with Large Models
Feb 27, 2026 · Artificial Intelligence

Step‑by‑Step EasyDataset Workflow for Building High‑Quality LLM Training Data

This guide walks readers through installing EasyDataset, creating a project, uploading documents, choosing appropriate chunking strategies, cleaning the data, generating domain tag trees, and exporting a polished pre‑training dataset, with concrete examples, configuration screenshots, and practical recommendations for each step.

AI modelEasyDatasetLLM data preparation
0 likes · 20 min read
Step‑by‑Step EasyDataset Workflow for Building High‑Quality LLM Training Data
Tech Musings
Tech Musings
Feb 7, 2026 · Fundamentals

How to Clean and Convert a Chinese Poetry Dataset for RAG Projects

This guide explains how to clean a Chinese poetry corpus—removing special characters, filtering short entries, and converting traditional characters to simplified Chinese—using Python validation functions, batch file processing, and WSL‑based OpenCC conversion, then persisting the results as JSON.

JSONRAGdata cleaning
0 likes · 12 min read
How to Clean and Convert a Chinese Poetry Dataset for RAG Projects
php Courses
php Courses
Jan 15, 2026 · Backend Development

Master PHP’s trim(): Clean Up Strings Efficiently

This guide explains why unwanted whitespace harms data quality, introduces PHP's trim() function and its variants, shows practical code examples for cleaning user input, CSV files, and API parameters, and offers performance tips and best‑practice recommendations for robust backend development.

.trimPHPString Manipulation
0 likes · 9 min read
Master PHP’s trim(): Clean Up Strings Efficiently
IT Services Circle
IT Services Circle
Oct 15, 2025 · Fundamentals

Master Fuzzy String Matching in Python with fuzzywuzzy: A Practical Guide

Learn how to efficiently clean and deduplicate textual data using Python's fuzzywuzzy library, covering Levenshtein distance fundamentals, installation, three core matching functions, advanced process extraction, and real-world code examples for handling messy Chinese strings and standardizing company names.

LevenshteinPythondata cleaning
0 likes · 7 min read
Master Fuzzy String Matching in Python with fuzzywuzzy: A Practical Guide
DataFunSummit
DataFunSummit
Sep 22, 2025 · Artificial Intelligence

Explore Cutting-Edge AI-Driven Data Governance: A Comprehensive Resource Guide

This article presents a curated list of cutting‑edge topics covering AI‑powered data governance, large‑model applications, intelligent operations, and advanced analytics, offering readers a concise overview of emerging practices and case studies from industry leaders.

AIAnalyticsIntelligent Operations
0 likes · 2 min read
Explore Cutting-Edge AI-Driven Data Governance: A Comprehensive Resource Guide
DataFunTalk
DataFunTalk
Sep 15, 2025 · Artificial Intelligence

Unlocking the Future: AI-Driven Data Governance and Large Model Innovations

This article presents a curated catalog of cutting‑edge topics covering AI‑powered data governance, large‑model applications, data cleaning, compliance, lakehouse integration, intelligent operations, and generative analytics, inviting readers to explore the latest innovations and download the full e‑book via QR code.

AIAnalyticsData Governance
0 likes · 2 min read
Unlocking the Future: AI-Driven Data Governance and Large Model Innovations
Python Crawling & Data Mining
Python Crawling & Data Mining
Aug 26, 2025 · Fundamentals

Mastering Python Web Scraping: Clean Price Extraction with XPath Tricks

This article walks through a Python web‑scraping problem, demonstrates why the original XPath extraction returns noisy or missing price data, and provides multiple refined code solutions—including filtering empty entries, correcting XPath selectors, and using map‑filter techniques—to produce clean, formatted price lists.

Pythondata cleaninglxml
0 likes · 6 min read
Mastering Python Web Scraping: Clean Price Extraction with XPath Tricks
Alibaba Cloud Observability
Alibaba Cloud Observability
Jul 14, 2025 · Cloud Native

How to Tame Massive New‑Energy‑Vehicle Logs with Cloud‑Native SLS Data Cleaning

This article explains the five major challenges of handling heterogeneous, weakly‑structured log data in the new‑energy‑vehicle ecosystem and demonstrates how Alibaba Cloud's Log Service (SLS) provides cloud‑native, real‑time data cleaning, cross‑region aggregation, cost‑optimized storage, and visual lineage to enable comprehensive operational insight and intelligent management.

Log ProcessingSLSdata cleaning
0 likes · 18 min read
How to Tame Massive New‑Energy‑Vehicle Logs with Cloud‑Native SLS Data Cleaning
G7 EasyFlow Tech Circle
G7 EasyFlow Tech Circle
May 9, 2025 · Artificial Intelligence

How LLMs + Python Are Redefining Data Analysis: A Practical Guide

This article explains how large language models combined with Python's data‑science ecosystem can automate metadata extraction, data cleaning, and analysis tasks—illustrated with a step‑by‑step Titanic passenger dataset case study, complete prompts, code snippets, and best‑practice recommendations.

LLMPythondata analysis
0 likes · 18 min read
How LLMs + Python Are Redefining Data Analysis: A Practical Guide
php Courses
php Courses
May 7, 2025 · Fundamentals

Comprehensive Guide to Pandas Data Processing in Python

This tutorial provides a detailed overview of Pandas, covering its core data structures, data import/export, selection, cleaning, aggregation, merging, and a practical sales analysis example, with complete code snippets for each operation.

data aggregationdata cleaningdata-analysis
0 likes · 8 min read
Comprehensive Guide to Pandas Data Processing in Python
Big Data Tech Team
Big Data Tech Team
Apr 21, 2025 · Industry Insights

8 Practical Ways DeepSeek Boosts Data Quality for Better Governance

This guide outlines eight concrete methods DeepSeek uses to improve data quality—including automated cleaning, validation, classification, monitoring, governance standards, anomaly detection, integration, and intelligent analysis—providing actionable steps for organizations to enhance data accuracy, completeness, consistency, and usability.

Data IntegrationData QualityDeepSeek
0 likes · 5 min read
8 Practical Ways DeepSeek Boosts Data Quality for Better Governance
Sohu Tech Products
Sohu Tech Products
Apr 16, 2025 · Artificial Intelligence

Comprehensive Guide to Building AI Datasets: From Source Collection to Data Augmentation and Validation

This guide walks readers through every stage of building high‑quality AI training datasets—from locating open‑source data and defining goals, through collection, annotation, cleaning, large‑scale processing, optional augmentation, and splitting, to validation—using a medical QA example for fine‑tuning DeepSeek‑R1.

AI fine-tuningPythondata augmentation
0 likes · 18 min read
Comprehensive Guide to Building AI Datasets: From Source Collection to Data Augmentation and Validation
DaTaobao Tech
DaTaobao Tech
Mar 31, 2025 · Artificial Intelligence

AI Audio Generation and Voice Synthesis Practices at Taobao

The article surveys Taobao’s AI‑generated audio pipeline, detailing eight technical papers on image‑to‑video, OpenAI o1, multimodal video, and large‑model voice synthesis, while highlighting advances like VALL‑E, CosyVoice, F5‑TTS, data‑cleaning methods, and e‑commerce applications such as voice‑cloned live streams, multilingual TTS, AI video‑audio integration, and audiobook production.

AI audioTTSdata cleaning
0 likes · 11 min read
AI Audio Generation and Voice Synthesis Practices at Taobao
Alibaba Cloud Native
Alibaba Cloud Native
Mar 25, 2025 · Cloud Native

Shift Data Cleaning Server‑Side with SPL: Boost Real‑Time Log Processing

Alibaba Cloud Log Service’s new SPL‑based rule consumption lets users move complex data‑cleaning logic from client code to the server, offering low‑code configuration, high performance, precise filtering, and significant reductions in latency, bandwidth, and compute resources across typical scenarios such as Python SDK processing and Flink integration.

Log ServicePerformanceReal-time Processing
0 likes · 11 min read
Shift Data Cleaning Server‑Side with SPL: Boost Real‑Time Log Processing
Test Development Learning Exchange
Test Development Learning Exchange
Feb 11, 2025 · Fundamentals

Master Data Cleaning in Pandas: 20 Essential Scripts for E‑Commerce Sales Analysis

This guide walks you through creating a sample e‑commerce sales Excel file with Pandas and then demonstrates twenty practical data‑cleaning and transformation scripts—including handling missing values, renaming columns, filtering, grouping, and exporting—so you can efficiently prepare sales data for analysis.

Exceldata analysisdata cleaning
0 likes · 9 min read
Master Data Cleaning in Pandas: 20 Essential Scripts for E‑Commerce Sales Analysis
Java Web Project
Java Web Project
Feb 5, 2025 · Big Data

Master DeepSeek: Install, Configure, and Harness Its Data Processing Power

This guide walks you through DeepSeek’s core capabilities—including installation on Windows, macOS, and Linux, configuration of storage paths, API keys, and logging levels, as well as data import, cleaning, analysis, visualization, batch processing, scheduling, and plugin extensions—providing concrete command examples and troubleshooting tips.

DeepSeekautomationcommand-line
0 likes · 8 min read
Master DeepSeek: Install, Configure, and Harness Its Data Processing Power
Python Crawling & Data Mining
Python Crawling & Data Mining
Jan 28, 2025 · Fundamentals

Master Pandas: From Data Import to Advanced Manipulation in Python

This tutorial walks you through pandas fundamentals—including reading CSV/Excel files, creating Series and DataFrames, performing basic operations, cleaning data, using loc/iloc indexing, grouping, concatenating, merging, and handling time series—providing code examples and visual outputs for each step.

Time Seriesdata cleaninggroupby
0 likes · 14 min read
Master Pandas: From Data Import to Advanced Manipulation in Python
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 14, 2024 · Artificial Intelligence

How OpenCoder’s RefineCode Dataset Powers Next‑Gen Code LLMs

The OpenCoder technical report details the creation of the RefineCode dataset, its multi‑stage preprocessing, filtering, and sampling pipelines, the pre‑training and fine‑tuning schedules for 1.5B and 8B models, and the autonomous data selection methods that together achieve performance comparable to Qwen2.5‑Coder.

Artificial IntelligenceAutoDSCode LLM
0 likes · 18 min read
How OpenCoder’s RefineCode Dataset Powers Next‑Gen Code LLMs
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 24, 2024 · Artificial Intelligence

From Zero to One: A Practical Guide to Pretraining Large Language Models

This comprehensive guide walks you through every stage of LLM pretraining—from data sourcing, cleaning, and deduplication to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—highlighting common pitfalls and practical solutions for building robust models.

Curriculum LearningLLM PretrainingTokenizer
0 likes · 34 min read
From Zero to One: A Practical Guide to Pretraining Large Language Models
Python Crawling & Data Mining
Python Crawling & Data Mining
Aug 15, 2024 · Fundamentals

Master Pandas: Essential Data Manipulation Techniques for Beginners

This comprehensive tutorial walks you through pandas basics, including reading CSV and Excel files, creating Series and DataFrames, performing data inspection, cleaning, indexing, hierarchical indexing, time‑series handling, grouping, aggregation, concatenation, merging, and practical code examples with visual outputs.

Time Seriesdata cleaninggroupby
0 likes · 12 min read
Master Pandas: Essential Data Manipulation Techniques for Beginners
Test Development Learning Exchange
Test Development Learning Exchange
Jul 14, 2024 · Fundamentals

Using pandas fillna() to Handle Missing Data: 10 Practical Examples

This article introduces pandas' fillna() method and demonstrates ten practical examples—including basic filling, column‑specific values, forward/backward filling, limiting fills, using other DataFrames, functions, conditional fills, dictionaries, and Series—to help developers effectively handle missing data in Python data analysis.

Pythondata cleaningfillna
0 likes · 6 min read
Using pandas fillna() to Handle Missing Data: 10 Practical Examples
Ops Development & AI Practice
Ops Development & AI Practice
Jul 8, 2024 · Artificial Intelligence

Essential Denoising Techniques for Training Large AI Models

This article outlines key denoising methods—including data cleaning, augmentation, regularization, adversarial training, and self‑supervised learning—that improve the performance, generalization, and robustness of large neural network and transformer models.

DenoisingRegularizationadversarial training
0 likes · 5 min read
Essential Denoising Techniques for Training Large AI Models
Test Development Learning Exchange
Test Development Learning Exchange
May 21, 2024 · Artificial Intelligence

Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn

This guide walks through loading CSV data with pandas, cleaning missing values, filtering, grouping, visualizing, performing correlation and time‑series analysis, detecting outliers, and applying linear and logistic regression models using scikit‑learn, all illustrated with complete Python code snippets.

data cleaningmachine learningpandas
0 likes · 6 min read
Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn
Python Programming Learning Circle
Python Programming Learning Circle
Apr 28, 2024 · Fundamentals

Data Cleaning Techniques in Python: 21 Practical Examples and Code

This tutorial explains data cleaning concepts, key quality dimensions, and demonstrates 21 practical Python examples—including regex phone cleaning, temperature conversion, missing‑value detection, visualization with missingno, and record linkage using fuzzy matching—providing clear code snippets and step‑by‑step guidance for reliable data analysis.

data cleaningmissing datapandas
0 likes · 20 min read
Data Cleaning Techniques in Python: 21 Practical Examples and Code
Model Perspective
Model Perspective
Feb 13, 2024 · Big Data

Mastering Noisy Data: From Cleaning to Visualization and NLP with Python

This article reviews the key concepts from the Bad Data Handbook, covering noise identification, data validation, human readability, web data restructuring, special domain challenges, and data quality analysis, while also presenting practical data visualization techniques, popular analysis tools, Python web‑scraping libraries, and a basic NLP workflow with code examples.

Data visualizationNLPPython
0 likes · 20 min read
Mastering Noisy Data: From Cleaning to Visualization and NLP with Python
Python Programming Learning Circle
Python Programming Learning Circle
Feb 4, 2024 · Fundamentals

Using FuzzyWuzzy for Fuzzy String Matching in Python

This article introduces the FuzzyWuzzy Python library, explains its Levenshtein‑based matching functions (Ratio, Partial Ratio, Token Sort Ratio, Token Set Ratio) and the process module, and demonstrates practical applications for fuzzy matching of company and province names with complete code examples.

Levenshtein distancePythondata cleaning
0 likes · 10 min read
Using FuzzyWuzzy for Fuzzy String Matching in Python
Test Development Learning Exchange
Test Development Learning Exchange
Jan 23, 2024 · Fundamentals

Common Data Preprocessing Techniques with Python Code Examples

This article presents ten essential data preprocessing methods—including handling missing values, type conversion, standardization, encoding, smoothing, outlier treatment, text cleaning, word frequency counting, sentiment analysis, and topic modeling—each explained with clear Python code snippets.

Pythondata cleaningdata preprocessing
0 likes · 9 min read
Common Data Preprocessing Techniques with Python Code Examples
Test Development Learning Exchange
Test Development Learning Exchange
Dec 4, 2023 · Fundamentals

Common Data Cleaning Techniques with Python Code Examples

This article presents a comprehensive collection of Python code snippets demonstrating essential data cleaning methods—including handling missing values, outlier detection, type conversion, formatting, duplicate removal, normalization, one‑hot encoding, text preprocessing, and dataset merging—providing practical guidance for preparing data for analysis or machine‑learning tasks.

data cleaningdata preprocessingmachine learning
0 likes · 7 min read
Common Data Cleaning Techniques with Python Code Examples
Python Programming Learning Circle
Python Programming Learning Circle
Jun 7, 2023 · Fundamentals

Using FuzzyWuzzy for Fuzzy String Matching in Python

This article introduces the FuzzyWuzzy Python library, explains its Levenshtein‑based matching functions such as Ratio, Partial Ratio, Token Sort Ratio and Token Set Ratio, demonstrates how to install it, and provides practical code examples for merging company and province fields with fuzzy matching thresholds.

data cleaningpandasstring-matching
0 likes · 10 min read
Using FuzzyWuzzy for Fuzzy String Matching in Python
Zhuanzhuan Tech
Zhuanzhuan Tech
May 26, 2023 · Backend Development

ECP (Elasticsearch Chain Planning) System: Design, Features, and Implementation for Efficient Index Management

The article introduces the ECP system, a backend platform built on Elasticsearch that standardizes, automates, and visualizes index refresh workflows, addressing manual bottlenecks, data cleaning challenges, and coupling issues while providing task management, permission control, and environment isolation for high‑efficiency index operations.

ElasticsearchIndex Managementautomation
0 likes · 12 min read
ECP (Elasticsearch Chain Planning) System: Design, Features, and Implementation for Efficient Index Management
Python Programming Learning Circle
Python Programming Learning Circle
Mar 31, 2023 · Fundamentals

Vectorized String Operations in Pandas: Methods and Examples

This article explains how Pandas' vectorized string operations enable efficient, loop‑free processing of text data, covering basic methods like len() and lower(), advanced regex functions, and additional utilities such as split, replace, slice, and get_dummies, with code examples and usage details.

String processingdata cleaningvectorization
0 likes · 21 min read
Vectorized String Operations in Pandas: Methods and Examples
Python Programming Learning Circle
Python Programming Learning Circle
Dec 10, 2022 · Fundamentals

Using Python (pandas) to Perform Common Excel Data Processing Tasks

This article demonstrates how to replace typical Excel operations such as VLOOKUP, pivot tables, duplicate removal, missing‑value handling, multi‑condition filtering, fuzzy matching, column splitting, outlier replacement, grouping and labeling with concise Python pandas code to streamline data analysis workflows.

VLOOKUPdata cleaningdata-analysis
0 likes · 9 min read
Using Python (pandas) to Perform Common Excel Data Processing Tasks
Python Programming Learning Circle
Python Programming Learning Circle
Aug 13, 2022 · Big Data

Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by splitting the work into sub‑tasks and applying Python's multiprocessing, joblib, and tqdm libraries for serial, parallel, and batch processing, showing significant speed‑ups and best‑practice code snippets.

Big DataPythondata cleaning
0 likes · 10 min read
Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm
Model Perspective
Model Perspective
Jun 3, 2022 · Fundamentals

Master Pandas: From Installation to Data Analysis in Python

This tutorial walks you through installing Pandas, importing data from CSV and Excel, creating DataFrames from dictionaries, describing datasets, indexing with loc/iloc, and cleaning columns using apply, all illustrated with clear code examples and visual outputs.

data cleaningdata importdata-analysis
0 likes · 6 min read
Master Pandas: From Installation to Data Analysis in Python
DataFunTalk
DataFunTalk
May 25, 2022 · Artificial Intelligence

Optimizing E-commerce Product Copy Generation: Challenges, Framework, and System Practices

This article presents a comprehensive overview of the challenges in e‑commerce product copy generation, introduces a unified framework comprising a copy generation system, a copy‑cleaning subsystem, and a quality evaluation module, and details practical optimization techniques applied to short and long copy scenarios.

AIModel OptimizationText Generation
0 likes · 17 min read
Optimizing E-commerce Product Copy Generation: Challenges, Framework, and System Practices
Python Programming Learning Circle
Python Programming Learning Circle
Apr 29, 2022 · Fundamentals

Using FuzzyWuzzy for Fuzzy String Matching in Python

This tutorial explains how to use the Python FuzzyWuzzy library, which relies on Levenshtein distance, to perform fuzzy string matching for tasks such as normalizing province or company names, and provides complete code examples and practical applications.

Levenshteindata cleaningfuzzy-matching
0 likes · 10 min read
Using FuzzyWuzzy for Fuzzy String Matching in Python
Python Crawling & Data Mining
Python Crawling & Data Mining
Apr 27, 2022 · Fundamentals

Master Pandas: From CSV to Advanced Data Manipulation in Python

This comprehensive tutorial walks you through pandas fundamentals—including reading CSV/Excel files, creating Series and DataFrames, performing basic operations, cleaning data, indexing, grouping, concatenation, merging, and handling time series—using clear examples and code snippets.

data analysisdata cleaningdata manipulation
0 likes · 12 min read
Master Pandas: From CSV to Advanced Data Manipulation in Python
Laravel Tech Community
Laravel Tech Community
Mar 29, 2022 · Databases

Efficient Methods for Removing Duplicate Records in MySQL Tables

This article explains why a naïve Python‑based row‑deletion approach is slow for large MySQL tables and provides step‑by‑step SQL techniques—including identifying duplicate names, handling MySQL’s update‑from‑same‑table limitation, and deleting duplicates while preserving a single record per group—complete with executable code examples.

MySQLdata cleaningduplicate removal
0 likes · 5 min read
Efficient Methods for Removing Duplicate Records in MySQL Tables
Python Programming Learning Circle
Python Programming Learning Circle
Jan 22, 2022 · Fundamentals

Comprehensive Guide to Data Processing, Cleaning, and Visualization with Pandas

This tutorial walks through using pandas to import, review, preprocess (including integration, cleaning, transformation, handling missing and duplicate values, outlier detection, and sampling), analyze (descriptive statistics and correlation), and visualize e‑commerce data with Python, providing practical code examples for each step.

Preprocessingdata cleaningdata-analysis
0 likes · 17 min read
Comprehensive Guide to Data Processing, Cleaning, and Visualization with Pandas
Python Crawling & Data Mining
Python Crawling & Data Mining
Dec 20, 2021 · Fundamentals

What Weibo Comments Reveal About Wang Leehom’s Divorce: A Python Data Dive

This article walks through using Python to scrape Wang Leehom’s divorce‑related Weibo comments, clean the noisy dataset, visualize hourly comment trends, compare with his ex‑wife’s posts, generate word‑clouds and emoji frequency charts, and provides full code and data for reproducible analysis.

Data visualizationEmoji AnalysisWeb Scraping
0 likes · 10 min read
What Weibo Comments Reveal About Wang Leehom’s Divorce: A Python Data Dive
Python Programming Learning Circle
Python Programming Learning Circle
Oct 11, 2021 · Fundamentals

Essential Pandas Techniques for Data Analysis in Python

This article presents a comprehensive guide to essential Pandas operations, including creating Series and DataFrames, common methods for data selection, indexing, grouping, reading and writing files, handling missing values, sorting, statistical analysis, and data transformation, with practical code examples for each feature.

data analysisdata cleaningdataframe
0 likes · 16 min read
Essential Pandas Techniques for Data Analysis in Python
Python Crawling & Data Mining
Python Crawling & Data Mining
Oct 11, 2021 · Backend Development

How to Scrape and Analyze 46k Rental Listings with Python: From Crawling to Visual Insights

Learn step‑by‑step how to crawl 46,000+ rental listings from Ziroom using Python, extract house details with regex, clean and transform the data with pandas, and visualize distribution, pricing and location insights through pyecharts, matplotlib and seaborn, revealing rental market patterns in Beijing.

Data visualizationPyechartsWeb Scraping
0 likes · 24 min read
How to Scrape and Analyze 46k Rental Listings with Python: From Crawling to Visual Insights
Python Crawling & Data Mining
Python Crawling & Data Mining
Aug 23, 2021 · Fundamentals

How to Clean and Analyze Messy Taobao Data with Python Regex and Pandas

This article walks through cleaning chaotic Taobao CSV data using Python's regular expressions and pandas, removing unwanted characters with stop‑words, performing word segmentation, and generating word‑frequency statistics through both a classic approach and a pandas‑optimized method, complete with code snippets and visual results.

Word Frequencydata cleaningregex
0 likes · 10 min read
How to Clean and Analyze Messy Taobao Data with Python Regex and Pandas
Python Programming Learning Circle
Python Programming Learning Circle
Aug 6, 2021 · Fundamentals

A Comprehensive List of Commonly Used Pandas Functions Categorized by Purpose

This article presents a curated collection of 100 frequently used pandas functions, organized into six categories—statistical aggregation, data cleaning, data selection, plotting and element‑wise operations, time‑series utilities, and miscellaneous helpers—providing concise Chinese explanations for each function’s purpose.

Pythondata analysisdata cleaning
0 likes · 10 min read
A Comprehensive List of Commonly Used Pandas Functions Categorized by Purpose
MaGe Linux Operations
MaGe Linux Operations
Jul 29, 2021 · Databases

How to Efficiently Remove Duplicate Rows in Large MySQL Tables

This article explains why a naïve Python script for deduplicating millions of rows is too slow, then walks through a series of MySQL queries—including how to identify duplicate names, avoid the 1093 error, and delete duplicates while keeping a single representative row—demonstrating fast, reliable cleanup of large tables.

Database OptimizationMySQLdata cleaning
0 likes · 5 min read
How to Efficiently Remove Duplicate Rows in Large MySQL Tables
Architect's Tech Stack
Architect's Tech Stack
Jul 26, 2021 · Databases

Efficient Methods to Remove Duplicate Rows in MySQL Tables

This article explains how to identify and delete duplicate records in large MySQL tables, discusses why a naïve delete fails, and provides two robust SQL solutions—one that removes all duplicates and another that retains a single row per duplicated key—demonstrating fast execution even on tables with hundreds of thousands of rows.

MySQLdata cleaningduplicate removal
0 likes · 5 min read
Efficient Methods to Remove Duplicate Rows in MySQL Tables
Architects' Tech Alliance
Architects' Tech Alliance
Mar 1, 2021 · Cloud Computing

Practices for Data Cleaning and Cutover Consistency in Cross‑Cloud Migration

This article explains the technical details of data cleaning, dirty‑data handling, and three methods—database read‑only, application termination, and network ACL isolation—to ensure data consistency during the data‑regulation and cutover phases of cross‑cloud migration, illustrated with real‑world case studies.

ACL isolationData ConsistencyMySQL
0 likes · 12 min read
Practices for Data Cleaning and Cutover Consistency in Cross‑Cloud Migration
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Jan 14, 2021 · Big Data

How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM

Yidun’s public-opinion monitoring platform transforms massive raw web data into a unified format by separating dynamic Groovy-script-driven cleaning from static processing, achieving real-time source integration, high throughput, scalability, and high availability while addressing format diversity, team coordination, and performance-flexibility trade-offs.

Big DataETLGroovy
0 likes · 5 min read
How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM
Top Architect
Top Architect
Jan 13, 2021 · Big Data

Migrating Over 2 Billion MySQL Records to the Cloud with Kafka and BigQuery

Facing a MySQL table with over 2 billion continuously growing rows, the team designed a cloud‑based solution using Kafka to stream data into Google BigQuery, applied partitioned schemas and data cleaning to improve query performance, reduce storage costs, and avoid downtime.

BigQueryData MigrationPartitioning
0 likes · 8 min read
Migrating Over 2 Billion MySQL Records to the Cloud with Kafka and BigQuery
Python Programming Learning Circle
Python Programming Learning Circle
Dec 18, 2020 · Fundamentals

Data Exploration and Cleaning: Core Concepts, Steps, and Example Workflow

This article explains the purpose of data exploration and cleaning, outlines core analysis tasks, details missing‑value and outlier handling techniques—including various imputation methods—and illustrates the complete workflow with example images and a histogram‑based distribution analysis.

data cleaningdata explorationdata preprocessing
0 likes · 3 min read
Data Exploration and Cleaning: Core Concepts, Steps, and Example Workflow