Tagged articles
127 articles
Page 2 of 2
Python Crawling & Data Mining
Python Crawling & Data Mining
Dec 2, 2020 · Artificial Intelligence

Scrape, Clean, and Visualize Tencent Video Comments with Python – A Full Guide

This article walks through using Python to crawl Tencent Video's "Offer" season 2 comments, merge and clean the CSV data, perform exploratory analysis, generate visualizations and word clouds, and apply Baidu's open‑source NLP model for sentiment scoring, providing complete code snippets for each step.

PythonSentiment AnalysisWeb Scraping
0 likes · 16 min read
Scrape, Clean, and Visualize Tencent Video Comments with Python – A Full Guide
Python Programming Learning Circle
Python Programming Learning Circle
Nov 10, 2020 · Backend Development

Detailed Job Description Extraction and Data Cleaning with Python and MongoDB

This article explains how to scrape detailed job description and address information from online job portals, use Python libraries such as requests, BeautifulSoup4, and pymongo for crawling, and then clean and normalize the collected data including publish dates, salaries, and work‑experience levels before storing it in MongoDB.

MongoDBPythonbeautifulsoup
0 likes · 8 min read
Detailed Job Description Extraction and Data Cleaning with Python and MongoDB
Liangxu Linux
Liangxu Linux
Oct 20, 2020 · Operations

Mastering AWK: Quick Commands to Clean, Filter, and Analyze Text Files

This guide demonstrates how to use AWK’s sub, gsub, and pattern‑matching features to delete blank lines, trim whitespace, reverse file order, add line numbers, count or filter specific strings, replace text, and combine AWK with other Linux utilities for everyday text‑processing tasks.

Shell scriptingawkdata cleaning
0 likes · 7 min read
Mastering AWK: Quick Commands to Clean, Filter, and Analyze Text Files
Efficient Ops
Efficient Ops
Jul 13, 2020 · Operations

What 13,966 Ops Job Listings Reveal About Salary, Skills, and Hot Cities

This article analyzes 13,966 Chinese operations‑engineer job postings scraped from 51job, cleaning the data with Python and Pandas, then visualizing industry demand, city concentration, salary ranges, education requirements, company size distribution, and keyword trends to guide job seekers and recruiters.

Data visualizationOperationsPython
0 likes · 14 min read
What 13,966 Ops Job Listings Reveal About Salary, Skills, and Hot Cities
DataFunTalk
DataFunTalk
Jul 3, 2020 · Artificial Intelligence

Confident Learning: Detecting and Cleaning Noisy Labels with cleanlab

This article introduces confident learning, a principled framework for identifying and correcting mislabeled data in machine‑learning datasets, explains its three‑step process (count, clean, re‑training), demonstrates usage of the open‑source cleanlab library with code examples, and presents experimental results showing its effectiveness on benchmarks such as CIFAR‑10 and ImageNet.

cleanlabconfident learningdata cleaning
0 likes · 13 min read
Confident Learning: Detecting and Cleaning Noisy Labels with cleanlab
Laravel Tech Community
Laravel Tech Community
Jun 5, 2020 · Backend Development

Using PHP trim() to Remove Whitespace and Specified Characters

This article explains PHP's trim() function, its parameters and default character list, and provides multiple code examples showing how to trim strings, binary data, and array elements, including custom functions and array_walk usage, with the resulting outputs displayed.

.trimdata cleaningphp-functions
0 likes · 3 min read
Using PHP trim() to Remove Whitespace and Specified Characters
Python Programming Learning Circle
Python Programming Learning Circle
May 14, 2020 · Fundamentals

Data Cleaning and Preprocessing for HR Attrition Dataset Using Pandas

This tutorial demonstrates how to download, read, explore, visualize, and preprocess the HR attrition dataset with pandas, covering tasks such as duplicate removal, missing‑value handling, categorical encoding, normalization, and conditional column updates to prepare the data for machine‑learning modeling.

HR datasetdata cleaningmachine learning preprocessing
0 likes · 9 min read
Data Cleaning and Preprocessing for HR Attrition Dataset Using Pandas
Python Crawling & Data Mining
Python Crawling & Data Mining
Apr 7, 2020 · Fundamentals

Master 50 Essential Pandas Exercises to Boost Your Data Skills

This article presents a comprehensive collection of 50 pandas practice problems that guide you through creating Series and DataFrames, performing basic and advanced indexing, grouping, aggregation, data cleaning, hierarchical indexing, and visualisation, each illustrated with clear Python code examples.

data cleaningdataframeseries
0 likes · 19 min read
Master 50 Essential Pandas Exercises to Boost Your Data Skills
Aikesheng Open Source Community
Aikesheng Open Source Community
Feb 10, 2020 · Databases

Handling Duplicate Data in MySQL: Techniques and Examples

This article explains how to identify and remove various kinds of duplicate data in MySQL—including fully duplicated rows, records with duplicate non‑key columns, and unwanted whitespace inside fields—by using SQL statements, table cloning, OS utilities, and regular‑expression updates, with performance measurements for each method.

MySQLdata cleaningdata deduplication
0 likes · 13 min read
Handling Duplicate Data in MySQL: Techniques and Examples
MaGe Linux Operations
MaGe Linux Operations
Nov 2, 2019 · Fundamentals

Master Pandas: Essential Data Reading, Cleaning, and Merging Techniques

This article introduces essential Pandas techniques for data import, cleaning, type conversion, and merging, providing clear code examples that demonstrate reading from MySQL, handling missing values, transforming columns, and combining multiple DataFrames for comprehensive data analysis.

Pythondata cleaningdata merging
0 likes · 6 min read
Master Pandas: Essential Data Reading, Cleaning, and Merging Techniques
Python Crawling & Data Mining
Python Crawling & Data Mining
Oct 9, 2019 · Fundamentals

Master Pandas Basics: From DataFrames to Quick Data Insights

This tutorial introduces Pandas fundamentals, covering installation, DataFrame creation, reading and storing CSV/Excel files, quick data inspection, column manipulation, handling different data types, and basic time series operations, providing a concise roadmap for beginners to start data analysis with Python.

data cleaningdataframe
0 likes · 13 min read
Master Pandas Basics: From DataFrames to Quick Data Insights
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 12, 2019 · Artificial Intelligence

How a Simple Learning‑Rate Trick Detects 90% of Noisy Labels in Image Data

Training deep neural networks on large‑scale weakly labeled image data suffers from noisy annotations that degrade performance, but a simple algorithm that adjusts the learning‑rate during training can automatically identify up to 90% of noisy samples, improving dataset cleanliness and model accuracy without manual intervention.

Deep LearningImage Classificationdata cleaning
0 likes · 15 min read
How a Simple Learning‑Rate Trick Detects 90% of Noisy Labels in Image Data
Tencent Advertising Technology
Tencent Advertising Technology
May 8, 2019 · Artificial Intelligence

Experience Sharing of the 2019 Tencent Advertising Algorithm Competition – Week 1 Champion’s Insights

The week‑1 champion of the 2019 Tencent Advertising Algorithm Competition shares practical experience on data cleaning, feature engineering, model selection (including LightGBM and deep learning), validation strategies, and tips for handling massive ad exposure logs to achieve high SMAPE and monotonicity scores.

AdvertisingLightGBMalgorithm competition
0 likes · 6 min read
Experience Sharing of the 2019 Tencent Advertising Algorithm Competition – Week 1 Champion’s Insights
DataFunTalk
DataFunTalk
Aug 21, 2018 · Artificial Intelligence

iQIYI Traffic Anti-Cheat: Techniques, System Architecture, and Future Directions

This article provides a comprehensive overview of iQIYI's traffic anti‑cheat mechanisms, covering definitions of fraudulent traffic, industry challenges, data cleaning relationships, system design, rule‑based and machine‑learning solutions, feature engineering, model evaluation, monitoring, service applications, and future prospects.

Big DataSystem ArchitectureTraffic analysis
0 likes · 11 min read
iQIYI Traffic Anti-Cheat: Techniques, System Architecture, and Future Directions
Qunar Tech Salon
Qunar Tech Salon
Jul 27, 2018 · Information Security

Design and Features of an Anti‑Crawling Platform for Large‑Scale Services

The article describes the goals, architecture, core functions, and key characteristics of a comprehensive anti‑crawling platform that systematizes strategy management, data cleaning, monitoring, and rapid response to protect APIs and improve data reliability for large‑scale online services.

Backendanti‑crawlingdata cleaning
0 likes · 10 min read
Design and Features of an Anti‑Crawling Platform for Large‑Scale Services
MaGe Linux Operations
MaGe Linux Operations
Apr 26, 2018 · Databases

Find Duplicate Rows in MySQL: Simple Queries for Beginners

This article shows how to identify and remove duplicate rows in a MySQL table by defining duplication, using GROUP BY with HAVING, creating temporary tables with MIN, and applying various techniques—including UNION, nested subqueries, and joins—to handle single‑column and multi‑column duplicate detection.

GROUP BYHAVINGMySQL
0 likes · 11 min read
Find Duplicate Rows in MySQL: Simple Queries for Beginners
21CTO
21CTO
Nov 6, 2017 · Artificial Intelligence

Predict Retail Sales Without Coding: A Complete KNIME Tutorial

This step‑by‑step guide shows beginners how to use the GUI‑driven KNIME platform to import, clean, visualize, and model the BigMart sales dataset, enabling accurate retail sales predictions without writing any code.

KNIMENo-codedata cleaning
0 likes · 12 min read
Predict Retail Sales Without Coding: A Complete KNIME Tutorial
21CTO
21CTO
Oct 14, 2017 · Backend Development

How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines

etlpy is a lightweight Python framework that lets you define web‑crawling and data‑cleaning pipelines via XML, using generators for streaming, built‑in thread pools for parallelism, and a plug‑in architecture that handles everything from regex parsing to JSON conversion, all within a single 500‑line core file.

ETLGeneratorsWeb Scraping
0 likes · 14 min read
How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines
MaGe Linux Operations
MaGe Linux Operations
Aug 18, 2017 · Backend Development

How to Scrape Douban Movie Reviews and Visualize Them with a Word Cloud in Python

This tutorial walks through using Python 3.5 to fetch the latest movies from Douban, extract their IDs and titles, crawl user comments, clean the text with regular expressions, segment Chinese words using Jieba, remove stopwords, compute word frequencies, and finally generate a word‑cloud visualization of the reviews.

Doubandata cleaningweb-scraping
0 likes · 13 min read
How to Scrape Douban Movie Reviews and Visualize Them with a Word Cloud in Python
Aotu Lab
Aotu Lab
Dec 6, 2016 · Frontend Development

How XCEL Turns Excel Data Cleaning into a High‑Performance Cross‑Platform App

XCEL is a cross‑platform Excel data‑cleaning tool built with Electron and Vue that visualizes filtering, leverages multi‑process architecture, optimizes rendering and memory usage, and provides detailed implementation steps, performance tricks, and code examples for developers.

ExcelPerformance Optimizationdata cleaning
0 likes · 25 min read
How XCEL Turns Excel Data Cleaning into a High‑Performance Cross‑Platform App
Architects' Tech Alliance
Architects' Tech Alliance
Dec 3, 2016 · Fundamentals

Effective Data Cleaning Practices and Tips

This article provides practical guidance on data cleaning, covering the importance of data wrangling, using assertions, handling incomplete records, checkpointing, testing on subsets, logging, optional raw data storage, and validating the cleaned dataset to ensure reliable downstream analysis.

Checkpointassertionsdata cleaning
0 likes · 7 min read
Effective Data Cleaning Practices and Tips