Tagged articles

Data preprocessing

86 articles · Page 1 of 1

Jul 1, 2026 · Artificial Intelligence

How to Convert Text into Numerical Features for NLP: Tokenization, One‑Hot Encoding, and Word Embedding

This article walks through the essential steps of turning raw natural language into machine‑readable numbers, covering categorical vs. numerical features, one‑hot encoding of categorical data, tokenization, building vocabularies, and using word embeddings, illustrated with an IMDB sentiment‑analysis example in Keras.

Data preprocessingIMDB sentiment analysisKeras

0 likes · 7 min read

How to Convert Text into Numerical Features for NLP: Tokenization, One‑Hot Encoding, and Word Embedding

AI Architecture Hub

Jun 15, 2026 · Artificial Intelligence

Build Your Own LLM from Scratch: The 5 Essential Stages Behind GPT and Claude

This guide breaks down the complete workflow for building a large language model—from tokenization and pre‑training to data curation, scaling laws, alignment via RLHF/DPO, and robust evaluation—showing why architecture is less critical than data, scaling, and engineering.

AI EngineeringData preprocessingLLM training

0 likes · 12 min read

Build Your Own LLM from Scratch: The 5 Essential Stages Behind GPT and Claude

AI Large-Model Wave and Transformation Guide

Jun 1, 2026 · Artificial Intelligence

How to Build High‑Quality AI Datasets: Standards, Templates, and Practical Steps

This guide walks AI engineers and project leaders through the full lifecycle of high‑quality dataset creation—from defining requirements and setting annotation standards to data collection, preprocessing, labeling, augmentation, evaluation, and continuous iteration—providing concrete metrics, compliance rules, and tool recommendations to avoid common pitfalls.

AI datasetData QualityData preprocessing

0 likes · 16 min read

How to Build High‑Quality AI Datasets: Standards, Templates, and Practical Steps

Data Party THU

May 19, 2026 · Artificial Intelligence

Model Performance Lagging? Master Feature Engineering with a Complete Step‑by‑Step Guide

This article walks through the entire feature‑engineering pipeline—data cleaning, missing‑value imputation, encoding, outlier handling, scaling, feature construction, and selection—using Pandas and Scikit‑learn, and shows how to wrap the steps into a reproducible Scikit‑learn Pipeline.

Data preprocessingPandasScikit-learn

0 likes · 9 min read

Model Performance Lagging? Master Feature Engineering with a Complete Step‑by‑Step Guide

DeepHub IMBA

May 12, 2026 · Artificial Intelligence

Hands‑On Feature Engineering with Pandas and Scikit‑Learn: Complete Code Walkthrough

This article walks through a full feature‑engineering pipeline using Pandas and Scikit‑Learn, covering data inspection, missing‑value imputation, categorical encoding, outlier handling, scaling, feature construction, selection, and a final Pipeline that prepares clean, predictive features for a logistic‑regression model.

Data preprocessingPandasScikit-learn

0 likes · 9 min read

Hands‑On Feature Engineering with Pandas and Scikit‑Learn: Complete Code Walkthrough

Data Party THU

Feb 2, 2026 · Fundamentals

Why Standardize Data to Mean 0 and Variance 1?

The article explains that setting the mean to zero recenters data around the origin, making optimization algorithms converge faster, while scaling variance to one equalizes feature scales so no single feature dominates, illustrated with examples and visualizations of how standardization improves machine‑learning models.

Data preprocessingfeature scalingmachine learning

0 likes · 5 min read

Why Standardize Data to Mean 0 and Variance 1?

AI Cyberspace

Jan 18, 2026 · Artificial Intelligence

Understanding Supervised, Unsupervised, Self‑Supervised, Semi‑Supervised, and Reinforcement Learning for Large Language Model Training

The article explains various learning paradigms (supervised, unsupervised, self‑supervised, semi‑supervised, and reinforcement), describes dataset types and quality considerations, outlines preprocessing steps like filtering, deduplication, and tokenization, and discusses scaling laws linking model size, data volume, and compute resources, with concrete examples and code.

Data preprocessingModel TrainingScaling Laws

0 likes · 26 min read

Understanding Supervised, Unsupervised, Self‑Supervised, Semi‑Supervised, and Reinforcement Learning for Large Language Model Training

Data STUDIO

Oct 28, 2025 · Artificial Intelligence

8 Proven Ways to Boost Machine Learning Model Accuracy

This article outlines eight practical techniques—including data augmentation, handling missing values, feature engineering, algorithm selection, hyperparameter tuning, ensemble methods, and cross‑validation—to systematically improve the accuracy of Python machine‑learning models, supported by explanations, examples, and code snippets.

Data preprocessingcross-validationensemble methods

0 likes · 16 min read

8 Proven Ways to Boost Machine Learning Model Accuracy

DataFunSummit

Sep 13, 2025 · Artificial Intelligence

How Pinterest Scaled LLM Data Pipelines with Ray: Boosting Throughput and Cutting Costs

This article details how Pinterest’s senior staff engineer Dr. Luo leveraged the open‑source Ray framework to overcome LLM data‑preprocessing bottlenecks, describing the system’s architecture, key features such as map_batches, Carry‑Over Columns and Accumulators, and the dramatic performance and cost improvements achieved.

Data preprocessingDistributed ComputingLLM

0 likes · 12 min read

How Pinterest Scaled LLM Data Pipelines with Ray: Boosting Throughput and Cutting Costs

Data STUDIO

Sep 5, 2025 · Artificial Intelligence

19 Elegant Sklearn Tricks for More Efficient Machine Learning

This article presents 19 practical Sklearn functions—ranging from outlier detection to hyper‑parameter search—that replace manual data‑science steps, each illustrated with concise code examples and performance comparisons.

Data preprocessingScikit-learnfeature selection

0 likes · 24 min read

19 Elegant Sklearn Tricks for More Efficient Machine Learning

Instant Consumer Technology Team

Aug 21, 2025 · Artificial Intelligence

How Data‑Juicer Supercharges LLM Training with High‑Quality Multimodal Data

Data‑Juicer is an open‑source, one‑stop multimodal data processing system that provides fine‑grained operators, scalable pipelines, and ready‑made recipes to deliver high‑quality, diverse, and model‑friendly data for large language model pre‑training, fine‑tuning, and multimodal applications.

.aiData preprocessingLLM

0 likes · 22 min read

How Data‑Juicer Supercharges LLM Training with High‑Quality Multimodal Data

Python Programming Learning Circle

Jul 8, 2025 · Artificial Intelligence

10 One‑Line Python Tricks to Jump‑Start Your Machine Learning Projects

This article presents ten concise, practical one‑line Python code snippets—ranging from loading CSV data with Pandas to building sophisticated Scikit‑learn pipelines—that streamline common machine‑learning tasks such as data cleaning, encoding, splitting, scaling, model training, evaluation, cross‑validation, and prediction.

Data preprocessingOne-hot encodingPandas

0 likes · 10 min read

10 One‑Line Python Tricks to Jump‑Start Your Machine Learning Projects

AI Code to Success

Feb 27, 2025 · Artificial Intelligence

Master Decision Trees: Theory, Construction, and Python Implementation

This article provides a comprehensive guide to decision tree algorithms, covering their theoretical foundations, key components, construction workflow—including data preprocessing, feature selection, tree growth, stopping criteria, and pruning—followed by an overview of popular variants like ID3, C4.5, CART, practical advantages, applications, and a complete Python implementation using scikit-learn.

Data preprocessingDecision TreePython

0 likes · 29 min read

Master Decision Trees: Theory, Construction, and Python Implementation

Python Programming Learning Circle

Jan 9, 2025 · Fundamentals

Python Data Preprocessing and Visualization of Jay Chou Lyrics: From JSON to Word Cloud

This tutorial demonstrates how to convert a JSON lyric database into Excel, filter Jay Chou songs, perform Chinese word segmentation with Jieba, compute word frequencies, and create visualizations such as word clouds using Python code and online tools.

Data preprocessingPandasPython

0 likes · 9 min read

Python Data Preprocessing and Visualization of Jay Chou Lyrics: From JSON to Word Cloud

Test Development Learning Exchange

Dec 5, 2024 · Artificial Intelligence

End-to-End House Prices Prediction Project: Data Collection, Preprocessing, Modeling, Evaluation, and Deployment with Python

This tutorial walks through a complete house price prediction project, covering data collection from Kaggle, preprocessing with pandas and scikit‑learn, model training using RandomForestRegressor, evaluation, and deployment of a Flask API for real‑time predictions, providing full code examples.

Data preprocessingFlaskModel Deployment

0 likes · 9 min read

End-to-End House Prices Prediction Project: Data Collection, Preprocessing, Modeling, Evaluation, and Deployment with Python

Test Development Learning Exchange

Nov 26, 2024 · Artificial Intelligence

Comprehensive Python Tutorial for Data Preprocessing, Feature Engineering, Model Training, Evaluation, and Deployment

This tutorial walks through consolidating the first ten days of learning by covering data preprocessing, feature engineering, model training with linear regression, decision tree, and random forest, model evaluation using cross‑validation, and finally saving and loading the best model, all illustrated with complete Python code examples.

Data preprocessingModel TrainingPython

0 likes · 9 min read

Comprehensive Python Tutorial for Data Preprocessing, Feature Engineering, Model Training, Evaluation, and Deployment

Test Development Learning Exchange

Oct 29, 2024 · Artificial Intelligence

Data Preprocessing and Modeling with Pandas and Scikit‑learn

This guide walks through using Pandas for data cleaning, feature engineering, and preparation, then demonstrates building, evaluating, and persisting a machine‑learning model with Scikit‑learn's pipeline and RandomForestClassifier in Python.

Data preprocessingModel TrainingScikit-learn

0 likes · 5 min read

Data Preprocessing and Modeling with Pandas and Scikit‑learn

DaTaobao Tech

Aug 21, 2024 · Artificial Intelligence

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

This article provides a comprehensive, step‑by‑step guide to training customized large language models, covering industry‑specific needs, data privacy, meticulous data cleaning, optimal data‑ratio balancing, token budgeting, GPU memory accounting, LoRA fine‑tuning techniques, and practical evaluation metrics for robust AI deployment.

AI trainingData preprocessingGPU memory

0 likes · 23 min read

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

Rare Earth Juejin Tech Community

Jul 7, 2024 · Artificial Intelligence

Daily and Sports Activities Dataset: Description, Preprocessing Pipeline, and CNN Classification Results

This article introduces the Daily_and_Sports_Activities sensor dataset, details its structure and characteristics, provides a Python preprocessing pipeline with sliding‑window segmentation and Z‑score normalization, and reports CNN training results achieving 87.93% accuracy on activity classification.

CNNData preprocessingUCI

0 likes · 9 min read

Daily and Sports Activities Dataset: Description, Preprocessing Pipeline, and CNN Classification Results

php Courses

Jun 13, 2024 · Artificial Intelligence

Using PHP for Data Dimensionality Reduction and Feature Extraction

This article explains the importance of data dimensionality reduction and feature extraction in machine learning, and provides a step‑by‑step guide with PHP code examples—including library installation, data preprocessing, PCA‑based reduction, and feature selection techniques—demonstrating how to handle large datasets efficiently.

Data preprocessingPCAPHP

0 likes · 6 min read

Using PHP for Data Dimensionality Reduction and Feature Extraction

Python Programming Learning Circle

Apr 27, 2024 · Fundamentals

Data Cleaning Techniques in Python: 21 Practical Examples and Code

This article provides a comprehensive guide to data cleaning in Python, covering common data issues, methods for handling missing values, duplicates, categorical inconsistencies, and text normalization, illustrated with 21 detailed code examples using pandas and matplotlib.

Data preprocessingPandasdata cleaning

0 likes · 16 min read

Data Cleaning Techniques in Python: 21 Practical Examples and Code

HelloTech

Mar 14, 2024 · Artificial Intelligence

Feature Engineering: Concepts, Methods, and Automation

Feature engineering transforms existing data into new predictive variables through manual analysis or automated pipelines, encompassing single‑variable encoding, pairwise arithmetic, group‑statistics, multi‑variable combinations, time‑series and text derivations, with tools like Deep Feature Synthesis and beam‑search to generate and select useful features.

Data preprocessingautomated featuresfeature derivation

0 likes · 17 min read

Feature Engineering: Concepts, Methods, and Automation

Test Development Learning Exchange

Jan 23, 2024 · Fundamentals

Common Data Preprocessing Techniques with Python Code Examples

This article presents ten essential data preprocessing methods—including handling missing values, type conversion, standardization, encoding, smoothing, outlier treatment, text cleaning, word frequency counting, sentiment analysis, and topic modeling—each explained with clear Python code snippets.

Data preprocessingPandasPython

0 likes · 9 min read

Common Data Preprocessing Techniques with Python Code Examples

Python Crawling & Data Mining

Jan 6, 2024 · Fundamentals

How to Clean Messy CSV Data with Python Pandas: A Step-by-Step Guide

This article walks through cleaning irregular fan data—removing spaces, Chinese characters, asterisks, and missing brackets—using Python's pandas library, provides the full code snippet, demonstrates the resulting DataFrame, and shares practical tips for preparing data before analysis.

Code TutorialData preprocessingPandas

0 likes · 5 min read

How to Clean Messy CSV Data with Python Pandas: A Step-by-Step Guide

Test Development Learning Exchange

Dec 4, 2023 · Fundamentals

Common Data Cleaning Techniques with Python Code Examples

This article presents a comprehensive collection of Python code snippets demonstrating essential data cleaning methods—including handling missing values, outlier detection, type conversion, formatting, duplicate removal, normalization, one‑hot encoding, text preprocessing, and dataset merging—providing practical guidance for preparing data for analysis or machine‑learning tasks.

Data preprocessingPandasdata cleaning

0 likes · 7 min read

Common Data Cleaning Techniques with Python Code Examples

Python Programming Learning Circle

Dec 4, 2023 · Artificial Intelligence

Processing Chinese Lyrics Data with Python: From JSON Extraction to Word Cloud Visualization

This tutorial demonstrates how to preprocess a Chinese lyrics JSON dataset, extract Jay Chou's songs using Python, perform word segmentation with Jieba, compute word frequencies, and create visualizations such as word clouds both programmatically and with online tools.

Data preprocessingNLPjieba

0 likes · 9 min read

Processing Chinese Lyrics Data with Python: From JSON Extraction to Word Cloud Visualization

Test Development Learning Exchange

Oct 10, 2023 · Artificial Intelligence

Feature Engineering Techniques for Various Business Scenarios with Python Code Examples

This article presents practical feature‑engineering methods for ten common business domains, explaining the purpose of each feature, the extraction technique, and providing ready‑to‑run Python code snippets to help build more accurate predictive models.

Business AnalyticsData preprocessingfeature engineering

0 likes · 7 min read

Feature Engineering Techniques for Various Business Scenarios with Python Code Examples

Test Development Learning Exchange

Sep 29, 2023 · Fundamentals

Master Essential Data Preprocessing Techniques for Accurate Analysis

This guide walks through ten core data preprocessing methods—including handling missing values, type conversion, standardization, encoding, smoothing, outlier treatment, text cleaning, word‑frequency counting, sentiment analysis, and topic modeling—each illustrated with concise Python code examples.

Data preprocessingPandasPython

0 likes · 9 min read

Master Essential Data Preprocessing Techniques for Accurate Analysis

政采云技术

Sep 21, 2023 · Fundamentals

RFM Analysis: A Comprehensive Guide to Customer Segmentation and Marketing Optimization

RFM analysis is a powerful tool for understanding customer value and behavior, enabling businesses to optimize marketing strategies, improve customer satisfaction, and increase business revenue through data-driven customer segmentation.

Customer BehaviorCustomer Lifecycle ManagementCustomer Segmentation

0 likes · 16 min read

RFM Analysis: A Comprehensive Guide to Customer Segmentation and Marketing Optimization

DaTaobao Tech

Sep 11, 2023 · Artificial Intelligence

Large Language Model Upgrade Paths and Architecture Selection

This article analyzes upgrade paths of major LLMs—ChatGLM, LLaMA, Baichuan—detailing performance, context length, and architectural changes, then examines essential capabilities, data cleaning, tokenizer and attention design, and offers practical guidance for balanced scaling and efficient model construction.

BaichuanChatGLMData preprocessing

0 likes · 32 min read

Large Language Model Upgrade Paths and Architecture Selection

Test Development Learning Exchange

Aug 20, 2023 · Fundamentals

Key Steps and Techniques for Data Cleaning with Python Pandas

This article outlines essential data cleaning steps—including handling missing and duplicate values, type conversion, outlier treatment, text processing, standardization, sampling, and merging—providing concise Python pandas code snippets for each technique to improve data quality for analysis.

Big DataData preprocessingPandas

0 likes · 5 min read

Key Steps and Techniques for Data Cleaning with Python Pandas

Alibaba Cloud Big Data AI Platform

Jun 21, 2023 · Artificial Intelligence

How GoldMiner Boosts Deep Learning Training by Up to 12× with Elastic Data Pre‑Processing

GoldMiner, a new system from Alibaba Cloud’s PAI platform, elastically scales deep learning data pre‑processing pipelines, dramatically improving training performance up to 12.1× and GPU cluster utilization by 2.5×, and its underlying research was accepted at SIGMOD 2023.

Data preprocessingDeep LearningGPU Utilization

0 likes · 5 min read

How GoldMiner Boosts Deep Learning Training by Up to 12× with Elastic Data Pre‑Processing

Python Programming Learning Circle

Jun 17, 2023 · Big Data

Accelerating Python Data Preprocessing with Multiprocessing in Three Lines of Code

This article demonstrates how to use Python's concurrent.futures module to parallelize image resizing, turning a single‑process script into a multi‑core solution with just three additional lines of code, achieving up to a six‑fold speed‑up on typical CPUs.

Data preprocessingPerformancePython

0 likes · 7 min read

Accelerating Python Data Preprocessing with Multiprocessing in Three Lines of Code

Network Intelligence Research Center (NIRC)

May 10, 2023 · Artificial Intelligence

How LLaMA Preprocesses Training Data with CCNet Before Model Training

Before training large language models like LLaMA, MetaAI applies a multi‑stage CCNet pipeline that crawls web data, stores it in WET format, deduplicates paragraphs, detects and filters languages using fastText, and further refines content by similarity to Wikipedia and citation‑based linear models.

CCNetData preprocessingDeduplication

0 likes · 7 min read

How LLaMA Preprocesses Training Data with CCNet Before Model Training

Model Perspective

Mar 3, 2023 · Fundamentals

Unlock Hidden Patterns: A Practical Guide to Factor Analysis with Python

Factor analysis, a statistical technique for uncovering underlying common factors among variables, is explained alongside its distinction from PCA, detailed procedural steps, adequacy tests, and a hands‑on Python implementation using the factor_analyzer library with visualizations and factor rotation methods.

Data preprocessingPythonfactor analysis

0 likes · 10 min read

Unlock Hidden Patterns: A Practical Guide to Factor Analysis with Python

Python Programming Learning Circle

Dec 31, 2022 · Artificial Intelligence

A Beginner’s Guide to Data Preprocessing for Machine Learning in Python

This tutorial walks beginners through the essential steps of data preprocessing for any machine learning model, covering library imports, dataset loading, handling missing values, encoding categorical features, splitting into train‑test sets, and applying feature scaling using Python’s scikit‑learn.

Data preprocessingOne-hot encodingPython

0 likes · 11 min read

A Beginner’s Guide to Data Preprocessing for Machine Learning in Python

Python Programming Learning Circle

Dec 7, 2022 · Artificial Intelligence

Predicting the 2022 FIFA World Cup Champion Using Machine Learning Models

This article details a data‑mining project that uses historical World Cup match data, extensive feature engineering, and various machine‑learning algorithms—including neural networks, logistic regression, SVM, decision trees, and random forests—to predict the champion of the 2022 tournament, while analyzing model errors and proposing improvements.

Data preprocessingWorld Cupclassification

0 likes · 7 min read

Predicting the 2022 FIFA World Cup Champion Using Machine Learning Models

Model Perspective

Nov 28, 2022 · Fundamentals

Master R Data Preprocessing: Sorting, Merging, and Handling Missing Values

Before statistical analysis in R, you need to preprocess data by sorting vectors with sort(), rank(), order() or arrange(), merging datasets horizontally with merge() or cbind() and vertically with rbind(), and handling missing values using NA, NaN, na.rm, and na.omit functions.

Data preprocessingMergingR

0 likes · 4 min read

Master R Data Preprocessing: Sorting, Merging, and Handling Missing Values

MaGe Linux Operations

Oct 1, 2022 · Artificial Intelligence

11 Powerful Feature Selection Techniques Every Data Scientist Should Master

This guide walks through a comprehensive set of feature‑selection strategies—from removing unused or missing columns to handling multicollinearity, low‑variance features, and using PCA—complete with Python code examples and visualizations to help you build leaner, more interpretable machine‑learning models.

Data preprocessingPythonScikit-learn

0 likes · 18 min read

11 Powerful Feature Selection Techniques Every Data Scientist Should Master

ITPUB

Sep 15, 2022 · Artificial Intelligence

Why Precise Feature Engineering Still Matters in Recommendation Systems

In the era of deep learning, feature engineering remains crucial for recommendation and search advertising because it bridges raw relational data and models, improves performance, reduces complexity, and handles high‑cardinality, large‑scale, and time‑sensitive scenarios with robust transformations and statistical encoding.

.aiData preprocessingRecommendation Systems

0 likes · 20 min read

Why Precise Feature Engineering Still Matters in Recommendation Systems

NetEase LeiHuo UX Big Data Technology

Sep 5, 2022 · Artificial Intelligence

Feature Engineering in Game Data: Types, Missing Value and Outlier Handling

This article explains how feature engineering in game data involves classifying structured and unstructured, quantitative and qualitative features, and details practical methods for handling missing values and outliers to improve machine‑learning model performance.

Data preprocessingfeature engineeringgame data

0 likes · 9 min read

Feature Engineering in Game Data: Types, Missing Value and Outlier Handling

DataFunTalk

Aug 30, 2022 · Artificial Intelligence

Feature Engineering for Recommendation and Search Advertising

This article explains why meticulous feature engineering remains crucial in recommendation and search advertising, outlines what constitutes good features, describes common transformation techniques such as scaling, binning, and encoding, and provides practical examples and Q&A for practitioners.

.aiData preprocessingRecommendation Systems

0 likes · 18 min read

Feature Engineering for Recommendation and Search Advertising

Model Perspective

Aug 14, 2022 · Artificial Intelligence

Mastering Feature Binning with sklearn: Uniform, Quantile, and K‑Means Methods

This article explains why discretizing continuous variables improves model stability, introduces three common binning techniques—equal-width, equal-frequency, and clustering—and demonstrates how to implement each using scikit‑learn's KBinsDiscretizer with Python code examples on a synthetic score dataset.

Data preprocessingKBinsDiscretizerPython

0 likes · 5 min read

Mastering Feature Binning with sklearn: Uniform, Quantile, and K‑Means Methods

Python Programming Learning Circle

Jul 4, 2022 · Fundamentals

Discretizing Numerical Variables with Pandas: between, cut, qcut, and value_counts

This article demonstrates four Pandas techniques—between with loc, cut, qcut, and value_counts—to discretize numeric variables into bins, assigning grades A, B, C to exam scores, and shows how to generate synthetic data, define bin boundaries, and count records per bin.

Data preprocessingPandasPython

0 likes · 9 min read

Discretizing Numerical Variables with Pandas: between, cut, qcut, and value_counts

Model Perspective

Jun 26, 2022 · Fundamentals

How to Build and Validate a GM(1,1) Grey Prediction Model Step‑by‑Step

This article explains the GM(1,1) grey prediction model, covering data preprocessing, model construction, parameter estimation, error testing methods, and how to generate forecasts, providing a practical guide for applying the technique to time‑series data.

Data preprocessingGM(1,1)Grey Model

0 likes · 4 min read

How to Build and Validate a GM(1,1) Grey Prediction Model Step‑by‑Step

Python Programming Learning Circle

Feb 28, 2022 · Artificial Intelligence

Time Series Data Preprocessing: Missing Value Imputation, Denoising, and Outlier Detection

This article explains essential time series preprocessing techniques—including data sorting, handling missing values with interpolation methods, applying rolling averages, Fourier transform denoising, and detecting anomalies using rolling statistics, isolation forests, and K‑means clustering—illustrated with Python code on the AirPassengers and Google stock datasets.

Data preprocessingDenoisingPython

0 likes · 9 min read

Time Series Data Preprocessing: Missing Value Imputation, Denoising, and Outlier Detection

Baobao Algorithm Notes

Feb 14, 2022 · Artificial Intelligence

Mastering Feature Engineering: From AutoML Dictionaries to Business‑Driven Insights

This article presents a comprehensive, practical methodology for feature engineering that combines brute‑force AutoML‑style dictionary searches, business‑logic‑driven feature creation, and feature‑importance‑guided refinement, illustrating each approach with real Kaggle competition examples and concrete code snippets.

AutoMLData preprocessingKaggle

0 likes · 12 min read

Mastering Feature Engineering: From AutoML Dictionaries to Business‑Driven Insights

Baobao Algorithm Notes

Jan 10, 2022 · Fundamentals

Why a Simple Frequency Count Can Outperform Complex Models for Categorical Features

This article explains how a five‑line pandas script can generate frequency‑based encodings for categorical columns, why such encodings help low‑frequency categories, and how they integrate with feature crossing and industry use cases like ad exposure, PV/UV counting, and fraud detection.

Data preprocessingGBDTcategorical encoding

0 likes · 6 min read

Why a Simple Frequency Count Can Outperform Complex Models for Categorical Features

Baobao Algorithm Notes

Dec 28, 2021 · Artificial Intelligence

Why Feature Engineering Is the Secret Sauce Behind Machine Learning Success

This article explains the concept of feature engineering, illustrates it with a height‑weight classification example, compares kernel‑enhanced models to handcrafted features like BMI, discusses its impact on model performance, and highlights practical tips and domain‑specific considerations.

.aiData preprocessingfeature engineering

0 likes · 10 min read

Why Feature Engineering Is the Secret Sauce Behind Machine Learning Success

ITPUB

Sep 10, 2021 · Fundamentals

Mastering Data Cleaning: Handling Missing Values, Outliers, and Inconsistencies with Pandas

This guide explains why data cleaning is essential, categorizes common problems such as missing values, noise/outliers, and inconsistent records, and provides step‑by‑step procedures and Pandas code snippets to detect, diagnose, and remediate each issue in real‑world datasets.

Data preprocessingPandasdata cleaning

0 likes · 9 min read

Mastering Data Cleaning: Handling Missing Values, Outliers, and Inconsistencies with Pandas

Python Programming Learning Circle

Aug 3, 2021 · Fundamentals

Practical Python Data Cleaning Functions

This article presents a collection of straightforward yet practical Python functions for data cleaning tasks—including dropping columns, changing data types, converting categorical variables, handling missing values, removing unwanted characters, trimming whitespace, conditional concatenation, and converting string timestamps—designed to streamline preprocessing in data analysis projects.

Data preprocessingdata science

0 likes · 7 min read

Practical Python Data Cleaning Functions

JD Tech

Jul 30, 2021 · Databases

Practical Use of HBase in a Logistics HR Data Preprocessing Platform

This article details how the logistics HR data preprocessing platform processes around 20 million daily records by adopting HBase for high‑performance, scalable, column‑oriented storage, covering its architecture, read/write mechanisms, best practices, and performance considerations.

Big DataData preprocessingDistributed storage

0 likes · 10 min read

Practical Use of HBase in a Logistics HR Data Preprocessing Platform

58 Tech

Jul 23, 2021 · Artificial Intelligence

MMoE Model Training and Evaluation for 58.com Recruitment Recommendation Competition

This article details the background, MMoE model architecture, baseline setup, environment configuration, data preprocessing, training process, evaluation results, and department information for the 58.com recruitment recommendation AI competition using the WPAI platform.

Data preprocessingMMoEPyTorch

0 likes · 11 min read

MMoE Model Training and Evaluation for 58.com Recruitment Recommendation Competition

Architecture Digest

Jun 21, 2021 · Databases

Using HBase for HR Performance Data Preprocessing Platform: Architecture, Concepts, and Best Practices

This article introduces the HR performance data preprocessing platform’s requirements, explains why HBase was selected as the storage solution, details its core concepts, architecture, data write/read processes, best practices, limitations, and presents performance metrics demonstrating its suitability for large‑scale, high‑throughput workloads.

Big DataData preprocessingDatabase Architecture

0 likes · 12 min read

Using HBase for HR Performance Data Preprocessing Platform: Architecture, Concepts, and Best Practices

ITFLY8 Architecture Home

Jun 20, 2021 · Big Data

Why HBase Is the Ideal Choice for Large‑Scale HR Data Preprocessing

This article explains how HBase’s distributed column‑oriented architecture, high‑performance read/write capabilities, and flexible schema make it a cost‑effective solution for handling massive, unstructured HR performance data, covering its core concepts, cluster operation, best practices, and performance metrics.

Big DataData preprocessingHBase

0 likes · 11 min read

Why HBase Is the Ideal Choice for Large‑Scale HR Data Preprocessing

Python Programming Learning Circle

May 8, 2021 · Artificial Intelligence

Top 10 New Features in Scikit‑learn 0.24

The article reviews the most important additions in scikit‑learn 0.24, including faster hyper‑parameter search methods, ICE plots, histogram‑based boosting improvements, new feature‑selection tools, polynomial‑feature approximations, a semi‑supervised classifier, MAPE metric, enhanced OneHotEncoder and OrdinalEncoder handling, and a more flexible RFE interface.

Data preprocessingPythonScikit-learn

0 likes · 8 min read

Top 10 New Features in Scikit‑learn 0.24

DataFunTalk

Mar 5, 2021 · Artificial Intelligence

Feature Selection Techniques for the Kaggle Mushroom Classification Dataset Using Python

This tutorial explains why and how to reduce the number of features in the Kaggle Mushroom Classification dataset with Python, covering preprocessing, various feature‑selection methods (filter, wrapper, embedded), code examples, model training, performance impact, and visualisation of results.

Data preprocessingMushroom datasetPython

0 likes · 14 min read

Feature Selection Techniques for the Kaggle Mushroom Classification Dataset Using Python

DataFunTalk

Jan 23, 2021 · Artificial Intelligence

Feature Engineering: Mapping Raw Data to Machine‑Learning Features and Best Practices

This article explains how feature engineering transforms raw data into numerical representations for machine‑learning models, covering mapping of numeric and categorical values, one‑hot and multi‑hot encoding, sparse representations, scaling, handling outliers, binning, data quality checks, and feature interactions to capture non‑linear relationships.

Data preprocessingencodingfeature engineering

0 likes · 20 min read

Feature Engineering: Mapping Raw Data to Machine‑Learning Features and Best Practices

Python Programming Learning Circle

Dec 18, 2020 · Fundamentals

Data Exploration and Cleaning: Core Concepts, Steps, and Example Workflow

This article explains the purpose of data exploration and cleaning, outlines core analysis tasks, details missing‑value and outlier handling techniques—including various imputation methods—and illustrates the complete workflow with example images and a histogram‑based distribution analysis.

Data preprocessingdata cleaningdata exploration

0 likes · 3 min read

Data Exploration and Cleaning: Core Concepts, Steps, and Example Workflow

Python Crawling & Data Mining

Nov 3, 2020 · Databases

Master Data Cleaning: 5 Essential SQL & Python Techniques for Real-World Datasets

This article walks through five common data‑cleaning scenarios—dropping/renaming columns, handling duplicates and nulls, string manipulation, merging tables, and window‑function ranking—showing practical SQL and Python code that can be applied to large‑scale data warehouses.

Data preprocessingPythonSQL

0 likes · 9 min read

Master Data Cleaning: 5 Essential SQL & Python Techniques for Real-World Datasets

Taobao Frontend Technology

Oct 27, 2020 · Artificial Intelligence

Mastering Tensors in TensorFlow.js: From Scalars to Neural Networks

This guide explains the fundamentals of tensors in TensorFlow.js—including scalars, vectors, and higher‑dimensional tensors—demonstrates how to convert real‑world data such as the Titanic dataset into tensors, and shows how to build, compile, and train a simple neural network model using appropriate layers, loss functions, and optimizers.

Data preprocessingJavaScriptNeural Network

0 likes · 7 min read

Mastering Tensors in TensorFlow.js: From Scalars to Neural Networks

TAL Education Technology

Sep 17, 2020 · Artificial Intelligence

Comprehensive Guide to Feature Engineering and Data Preprocessing for Machine Learning

This article provides an extensive overview of feature engineering, covering feature understanding, cleaning, construction, selection, transformation, and dimensionality reduction techniques, illustrated with Python code using the Titanic dataset, and offers practical guidelines for improving data quality and model performance in machine learning projects.

Data preprocessingPythonTitanic dataset

0 likes · 44 min read

Comprehensive Guide to Feature Engineering and Data Preprocessing for Machine Learning

Alibaba Cloud Developer

Aug 23, 2020 · Artificial Intelligence

Unlocking Powerful Features: A Deep Dive into Tianchi’s Repeat Purchase Prediction

This tutorial walks through the complete feature‑engineering pipeline for the Alibaba Tianchi “Tmall User Repeat Purchase Prediction” competition, covering data acquisition, memory‑efficient preprocessing, multi‑entity feature construction, statistical aggregations, text vectorisation, embedding generation and stacking‑based model features, all illustrated with Python code and diagrams.

Data preprocessingStackingfeature engineering

0 likes · 16 min read

Unlocking Powerful Features: A Deep Dive into Tianchi’s Repeat Purchase Prediction

MaGe Linux Operations

Aug 23, 2020 · Fundamentals

Master Pandas: From Data Import to Visualization in Python

This tutorial walks through using pandas for data import, inspection, preprocessing, analysis, and visualization, providing practical code snippets and explanations to help readers understand essential data manipulation techniques in Python.

Data preprocessingPandas

0 likes · 19 min read

Master Pandas: From Data Import to Visualization in Python

DataFunTalk

Aug 14, 2020 · Artificial Intelligence

Illustrated Guide to the Complete Machine Learning Workflow

This article presents a hand‑drawn, illustrated walkthrough of the entire machine‑learning pipeline—from dataset definition, exploratory data analysis, preprocessing, and data splitting to model building, algorithm selection, hyper‑parameter tuning, feature selection, and evaluation for both classification and regression tasks.

Data preprocessingRegressionclassification

0 likes · 17 min read

Illustrated Guide to the Complete Machine Learning Workflow

Python Programming Learning Circle

Jun 11, 2020 · Artificial Intelligence

Step-by-Step Guide to Building a Movie Recommendation System with TensorFlow

This tutorial walks through collecting and cleaning the MovieLens dataset, constructing rating and record matrices, normalizing ratings, defining a collaborative‑filtering model in TensorFlow, training it with Adam optimizer, evaluating performance, and finally generating personalized movie recommendations for a chosen user.

Data preprocessingTensorFlowcollaborative filtering

0 likes · 10 min read

Step-by-Step Guide to Building a Movie Recommendation System with TensorFlow

JD Tech Talk

Jun 4, 2020 · Artificial Intelligence

The Art and Science of Feature Engineering: Importance, Methods, and Automation

Feature engineering, which occupies the majority of data scientists' time, is essential for building high‑performing machine‑learning models and involves careful data quality control, diverse construction techniques, rigorous selection, and emerging automation efforts, all of which demand domain expertise and systematic practice.

.aiData preprocessingfeature engineering

0 likes · 14 min read

The Art and Science of Feature Engineering: Importance, Methods, and Automation

JD Tech Talk

May 29, 2020 · Artificial Intelligence

The Black Art of Feature Engineering: Importance, Techniques, and Automation

This article explains why feature engineering consumes most of a data scientist's time, outlines its critical steps—including data observation, cleaning, transformation, selection, and reduction—covers practical issues such as missing‑value handling, data leakage, and feature stability, and discusses both manual and automated approaches for building effective machine‑learning models.

Data preprocessingfeature engineeringmachine learning

0 likes · 14 min read

The Black Art of Feature Engineering: Importance, Techniques, and Automation

Python Programming Learning Circle

May 21, 2020 · Artificial Intelligence

Time Series Forecasting and Anomaly Detection for API Traffic Using Seasonal Decomposition and ARIMA

The article presents a complete workflow for predicting next‑day API request volumes by exploring per‑minute traffic data, handling missing values, applying seasonal decomposition, training an ARIMA model on the trend component, and generating confidence intervals to flag anomalous spikes.

ARIMAAnomaly DetectionData preprocessing

0 likes · 12 min read

Time Series Forecasting and Anomaly Detection for API Traffic Using Seasonal Decomposition and ARIMA

Python Crawling & Data Mining

May 21, 2020 · Fundamentals

Master Pandas: Essential Data Preprocessing, Merging, and Analysis Techniques

This article provides a comprehensive, step‑by‑step guide to using pandas for data preprocessing, merging, indexing, sorting, grouping, extraction, filtering, sampling, and statistical analysis, complete with clear code examples and visual results to help readers master essential data manipulation in Python.

Data preprocessingPythondata analysis

0 likes · 13 min read

Master Pandas: Essential Data Preprocessing, Merging, and Analysis Techniques

Python Programming Learning Circle

Mar 6, 2020 · Artificial Intelligence

Introduction to Machine Learning Concepts: Data, Features, Labels, Training, and Common Algorithms

This article provides a beginner-friendly overview of machine learning fundamentals, covering the definition of data, the distinction between features and labels, types of features, dimensionality, training and test datasets, normalization, supervised and unsupervised learning methods, algorithm selection, development workflow, and recommended Python libraries such as NumPy.

Data preprocessingfeaturesmachine learning

0 likes · 12 min read

Introduction to Machine Learning Concepts: Data, Features, Labels, Training, and Common Algorithms

360 Quality & Efficiency

Jan 17, 2020 · Artificial Intelligence

File Release Application Prediction Model Using GBDT

This article describes how a GBDT‑based prediction model was built to forecast file release application parameters such as volume ratio, target audience, and gray level, covering data collection, feature engineering, model training, service deployment, and practical considerations for handling bad cases.

Data preprocessingGBDTfile release

0 likes · 8 min read

File Release Application Prediction Model Using GBDT

Tencent Cloud Developer

Dec 3, 2019 · Artificial Intelligence

Feature Engineering Practices for Short‑Video Recommendation Systems

Effective short‑video recommendation relies on meticulous feature engineering that transforms raw signals—numerical counts, categorical IDs, content and user embeddings, context and session data—through bucketization, scaling, crossing, and smoothing, then selects and evaluates them via filtering, wrapping, regularization, and importance analysis to mitigate business biases and improve multi‑objective ranking performance.

Data preprocessingEmbeddingbias mitigation

0 likes · 32 min read

Feature Engineering Practices for Short‑Video Recommendation Systems

58 Tech

Nov 11, 2019 · Artificial Intelligence

Design and Implementation of the 58 Car Price Estimation System Using Machine Learning

The article describes the end‑to‑end architecture, data collection, preprocessing, feature engineering, model selection, training, and hyper‑parameter tuning of 58’s car price estimation platform, which leverages Spark, XGBoost, LightGBM and custom business rules to predict vehicle resale values.

Data preprocessingLightGBMXGBoost

0 likes · 11 min read

Design and Implementation of the 58 Car Price Estimation System Using Machine Learning

MaGe Linux Operations

Mar 1, 2019 · Artificial Intelligence

Master Python Data Mining & Machine Learning: From Preprocessing to Classification

This comprehensive guide introduces data mining and machine learning concepts, walks through Python data preprocessing techniques, reviews common classification algorithms, demonstrates an Iris flower classification case, and offers practical tips for selecting the most suitable algorithm for a given problem.

Classification AlgorithmsData preprocessingPython

0 likes · 21 min read

Master Python Data Mining & Machine Learning: From Preprocessing to Classification

360 Quality & Efficiency

Dec 21, 2018 · Artificial Intelligence

Machine Learning-Based Test Case Step Recommendation: Data Preprocessing, N‑gram, CBOW, and RNN/LSTM Model Construction

This article explains how to use machine‑learning techniques—including data preprocessing, N‑gram, CBOW, and various RNN/LSTM models—to automatically recommend the next function in a test‑case step sequence, improving writing speed and efficiency for developers.

CBOWData preprocessingLSTM

0 likes · 4 min read

Machine Learning-Based Test Case Step Recommendation: Data Preprocessing, N‑gram, CBOW, and RNN/LSTM Model Construction

Qunar Tech Salon

Sep 19, 2018 · Artificial Intelligence

Logistic Regression Tutorial with scikit-learn

This article introduces logistic regression, explains its theoretical basis, details key scikit-learn parameters, and provides a complete Python example for breast cancer classification, covering data preprocessing, model training, prediction, and evaluation with classification reports.

Data preprocessingPythonclassification

0 likes · 7 min read

Logistic Regression Tutorial with scikit-learn

Architecture Digest

Jul 12, 2018 · Artificial Intelligence

How to Choose the Right Machine Learning Algorithm

This article explains that there is no universal solution for selecting machine learning algorithms and outlines practical factors—such as data characteristics, problem type, business constraints, and algorithm complexity—to help practitioners systematically narrow down and pick the most suitable models.

Data preprocessingalgorithm selectionmachine learning

0 likes · 14 min read

How to Choose the Right Machine Learning Algorithm

Tencent Advertising Technology

Jun 12, 2018 · Artificial Intelligence

Insights on Data Preprocessing, Modeling, and Mindset from a Tencent Advertising Algorithm Competition Participant

A participant from Harbin Institute of Technology shares practical data‑preprocessing tricks, model choices, useful feature ideas, and a resilient mindset gained while competing in the Tencent Advertising Algorithm Contest, offering tips that can help other data scientists handle large‑scale ad data.

Data preprocessingMindsetcompetition

0 likes · 5 min read

Insights on Data Preprocessing, Modeling, and Mindset from a Tencent Advertising Algorithm Competition Participant

MaGe Linux Operations

Apr 8, 2018 · Artificial Intelligence

Master Python Data Mining & Machine Learning: From Preprocessing to Classification

This comprehensive tutorial walks you through Python data mining and machine learning fundamentals, covering data preprocessing techniques, common classification algorithms, an Iris flower classification case study, and practical tips for selecting the right algorithm, all illustrated with clear code examples and visualizations.

Classification AlgorithmsData preprocessingNaive Bayes

0 likes · 22 min read

Baobao Algorithm Notes

Mar 25, 2018 · Artificial Intelligence

How to Crush the Kaggle Toxic Comment Challenge: Data Prep, Models, and Ensemble Secrets

This article breaks down the Kaggle toxic comment classification competition, detailing thorough data cleaning, advanced word‑vector techniques, pseudo‑labeling, BPE tokenization, diverse neural models and ensemble strategies, and shares practical insights and pitfalls from the author's nine‑month competition journey.

BPEData preprocessingKaggle

0 likes · 9 min read

How to Crush the Kaggle Toxic Comment Challenge: Data Prep, Models, and Ensemble Secrets

Architecture Digest

Feb 14, 2018 · Artificial Intelligence

Comparative Analysis and Optimization of Machine Learning Models on the UCI Census Income Dataset

This article walks through a complete machine‑learning workflow on the UCI Census Income dataset, covering data exploration, preprocessing (including log‑transformation and scaling), model training with Naïve Bayes, Decision Tree and SVM, performance evaluation, hyper‑parameter tuning via grid search, feature importance analysis, and feature selection, providing code snippets and visualizations.

Data preprocessingPythonfeature selection

0 likes · 24 min read

Comparative Analysis and Optimization of Machine Learning Models on the UCI Census Income Dataset

Tencent Advertising Technology

Jun 2, 2017 · Artificial Intelligence

Insights from the Tencent Social Advertising University Algorithm Competition – Week 3 Champion Team “Daodi Dui Bu Dui”

The week‑3 champion team from Peking University shares their experience in the Tencent social advertising algorithm contest, detailing data preparation, feature engineering, and model training strategies that helped them secure the top spot.

Data preprocessingTencentXGBoost

0 likes · 5 min read

Insights from the Tencent Social Advertising University Algorithm Competition – Week 3 Champion Team “Daodi Dui Bu Dui”

Architects' Tech Alliance

Dec 3, 2016 · Fundamentals

Effective Data Cleaning Practices and Tips

This article provides practical guidance on data cleaning, covering the importance of data wrangling, using assertions, handling incomplete records, checkpointing, testing on subsets, logging, optional raw data storage, and validating the cleaned dataset to ensure reliable downstream analysis.

CheckpointData preprocessingLogging

0 likes · 7 min read

Effective Data Cleaning Practices and Tips

Art of Distributed System Architecture Design

Oct 6, 2015 · Artificial Intelligence

Feature Engineering and PCA for Binary Classification in R

This article explains how feature engineering and principal component analysis (PCA) can be applied to a two‑feature binary classification problem in R, illustrating data exploration, model evaluation with ROC AUC, and the impact of dimensionality reduction on predictive performance.

Data preprocessingPCAR

0 likes · 9 min read

Feature Engineering and PCA for Binary Classification in R

Qunar Tech Salon

Aug 14, 2015 · Big Data

The Nine Laws of Data Mining: Principles, Processes, and Insights

This article presents nine fundamental laws of data mining—covering goals, knowledge, preparation, experimentation, patterns, insight, prediction, value, and change—explaining how business objectives and domain expertise drive each stage of the CRISP‑DM process and why technical metrics alone cannot guarantee success.

CRISP-DMData preprocessingPredictive Modeling

0 likes · 19 min read

The Nine Laws of Data Mining: Principles, Processes, and Insights