Tagged articles
466 articles
Page 5 of 5
Python Crawling & Data Mining
Python Crawling & Data Mining
Jul 31, 2018 · Big Data

Can Web‑Scraped Movie Reviews Predict Box Office? A Python Data‑Mining Case Study

Using Python to scrape over ten thousand Maoyan comments for the comedy film “The Billionaire” (西虹市首富), this article demonstrates data cleaning, geographic heat‑maps, city‑wise rating analysis, word‑cloud generation, and a simple box‑office forecast based on a comparable movie, illustrating practical web‑scraping and data‑mining techniques.

Box Office PredictionMovie ReviewsPython
0 likes · 10 min read
Can Web‑Scraped Movie Reviews Predict Box Office? A Python Data‑Mining Case Study
MaGe Linux Operations
MaGe Linux Operations
Jul 28, 2018 · Backend Development

Master Web Scraping with Beautiful Soup: A Hands‑On Python Guide

This article introduces Beautiful Soup, a Python library for parsing HTML/XML into a navigable tree, covering installation, object initialization, tag and attribute access, tree traversal, searching techniques like find_all, find, CSS selectors, and practical code examples.

Data ExtractionWeb Scrapingbeautifulsoup
0 likes · 11 min read
Master Web Scraping with Beautiful Soup: A Hands‑On Python Guide
MaGe Linux Operations
MaGe Linux Operations
May 8, 2018 · Fundamentals

How to Scrape Lagou Python Job Data and Visualize Trends with Python

This tutorial demonstrates how to collect Python job postings from Lagou using Python's requests library, process the JSON response with pandas, and create insightful visualizations—including bar charts, word clouds, and geographic heatmaps—while handling anti‑scraping measures and data cleaning steps.

Data visualizationLagouMatplotlib
0 likes · 9 min read
How to Scrape Lagou Python Job Data and Visualize Trends with Python
MaGe Linux Operations
MaGe Linux Operations
Apr 23, 2018 · Backend Development

Essential Python Libraries for Web Scraping and Data Processing

A comprehensive catalog of Python libraries covering network communication, web crawling frameworks, HTML/XML parsing, text manipulation, file format handling, natural language processing, browser automation, concurrency, cloud services, email processing, URL manipulation, multimedia extraction, WebSocket support, DNS resolution, computer vision, proxy servers, and other useful tools for developers.

AutomationPythonWeb Scraping
0 likes · 16 min read
Essential Python Libraries for Web Scraping and Data Processing
MaGe Linux Operations
MaGe Linux Operations
Mar 13, 2018 · Backend Development

Crawl Zhihu’s “Beautiful Women” Images and Filter by AI Face Scores in Python

This guide explains how to collect images from Zhihu’s “美女” topic using Python’s Requests and lxml, filter them with Baidu’s AipFace API based on gender, face presence, authenticity, and beauty score, and store the high‑quality results locally, including setup and optional customizations.

Baidu AIData FilteringFace Detection
0 likes · 7 min read
Crawl Zhihu’s “Beautiful Women” Images and Filter by AI Face Scores in Python
MaGe Linux Operations
MaGe Linux Operations
Mar 11, 2018 · Artificial Intelligence

Generate Tang Poetry with Python: Scraping, Processing, and Rhyme Creation

This tutorial explains how to build a Python program that crawls 71,000 Tang poems, extracts and tokenizes the text, analyzes word frequencies, and assembles new five‑character regulated verses with proper rhymes, including acrostic poems, while offering code snippets and future AI enhancements.

Poetry GenerationPythonRhyme Detection
0 likes · 7 min read
Generate Tang Poetry with Python: Scraping, Processing, and Rhyme Creation
Python Crawling & Data Mining
Python Crawling & Data Mining
Jan 27, 2018 · Backend Development

How to Scrape Real-Time Weather Data with Python and BeautifulSoup

This guide demonstrates how to use Python's BeautifulSoup library to crawl the Green Breath website, extract real‑time weather and PM2.5 information, handle missing data with conditional checks, and display the results directly in the PyCharm console, providing a practical example of web‑scraping for environmental monitoring.

Air QualityPythonTutorial
0 likes · 3 min read
How to Scrape Real-Time Weather Data with Python and BeautifulSoup
MaGe Linux Operations
MaGe Linux Operations
Jan 27, 2018 · Backend Development

Scrape 2018 Chinese City Job Listings with Python and Visualize the Results

This tutorial shows how to use Python to crawl all Chinese city names from Zhaopin, retrieve the number of Android job postings for each city via HTTP GET requests, parse the results with regex, store them in a dictionary, and finally plot the data with Matplotlib for clear visual comparison.

PythonWeb Scrapingjob market analysis
0 likes · 6 min read
Scrape 2018 Chinese City Job Listings with Python and Visualize the Results
MaGe Linux Operations
MaGe Linux Operations
Jan 11, 2018 · Backend Development

Master Python Web Scraping: From Basic Requests to Multithreaded Crawlers

This comprehensive guide walks you through Python web‑scraping techniques—including basic URL fetching, proxy usage, cookie and form handling, browser impersonation, gzip/deflate support, captcha processing, multithreading with thread pools and Twisted async I/O, plus practical tips on connection pooling, thread stack size, retries, timeouts and login automation—providing a solid foundation for building robust crawlers.

GzipPythonWeb Scraping
0 likes · 17 min read
Master Python Web Scraping: From Basic Requests to Multithreaded Crawlers
21CTO
21CTO
Dec 15, 2017 · Backend Development

Master Web Scraping with Python: Regex, BeautifulSoup & Selenium

This guide demonstrates how to combine Python's regex, BeautifulSoup, and Selenium (including Chrome and headless PhantomJS) for powerful web scraping, covering tag matching, handling Ajax, iFrames, cookie management, and practical code examples for extracting and interacting with dynamic web content.

Headless BrowserSeleniumWeb Scraping
0 likes · 10 min read
Master Web Scraping with Python: Regex, BeautifulSoup & Selenium
MaGe Linux Operations
MaGe Linux Operations
Nov 14, 2017 · Backend Development

How to Use Scrapy to Crawl Zhihu Users and Analyze Their Data

This tutorial explains how a Python developer can set up a Scrapy project, write spiders to crawl Zhihu user profiles, store the results in a MySQL database, adjust settings for headers and delays, and finally perform simple gender and location analysis on the collected data.

Backend DevelopmentPythonScrapy
0 likes · 14 min read
How to Use Scrapy to Crawl Zhihu Users and Analyze Their Data
21CTO
21CTO
Oct 14, 2017 · Backend Development

How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines

etlpy is a lightweight Python framework that lets you define web‑crawling and data‑cleaning pipelines via XML, using generators for streaming, built‑in thread pools for parallelism, and a plug‑in architecture that handles everything from regex parsing to JSON conversion, all within a single 500‑line core file.

ETLGeneratorsWeb Scraping
0 likes · 14 min read
How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines
MaGe Linux Operations
MaGe Linux Operations
Sep 16, 2017 · Backend Development

How to Scrape NetEase Cloud Music Hot Comments with Python

This tutorial walks through using Python and browser developer tools to locate, decode, and retrieve the hot comments from NetEase Cloud Music's hot song chart, including extracting song IDs, handling encrypted request parameters, and applying regex to gather song metadata.

APINetEase Cloud MusicPython
0 likes · 9 min read
How to Scrape NetEase Cloud Music Hot Comments with Python
MaGe Linux Operations
MaGe Linux Operations
Sep 9, 2017 · Backend Development

Build a Python Image Downloader: Step‑by‑Step Web Scraping Tutorial

This tutorial walks through building a Python web scraper that automatically downloads images from Baidu by analyzing requirements, inspecting page source, crafting regex patterns, and implementing the crawler with requests, offering step‑by‑step guidance, code snippets, and troubleshooting tips.

PythonWeb Scrapingimage-downloader
0 likes · 7 min read
Build a Python Image Downloader: Step‑by‑Step Web Scraping Tutorial
MaGe Linux Operations
MaGe Linux Operations
Sep 6, 2017 · Backend Development

How I Built a High‑Performance Novel Site Crawler with MongoDB

Inspired by a tutorial, I created a MongoDB‑backed crawler for the Yisou novel website, extracting category links, managing URL states across multiple processes, handling millions of pages, and finally deduplicating the results to obtain a clean collection of books.

MongoDBPythonWeb Scraping
0 likes · 3 min read
How I Built a High‑Performance Novel Site Crawler with MongoDB
MaGe Linux Operations
MaGe Linux Operations
Jul 29, 2017 · Backend Development

Build a Fast Python Web Scraper for Novel Rankings – Step by Step

This guide walks through building a Python web crawler to extract novel titles and URLs from the qu.la ranking page, explains the site’s clear HTML structure, shows how to deduplicate entries with a set, and provides complete code snippets plus performance tips and a Scrapy upgrade path.

CrawlerPythonScrapy
0 likes · 5 min read
Build a Fast Python Web Scraper for Novel Rankings – Step by Step
MaGe Linux Operations
MaGe Linux Operations
Jul 10, 2017 · Backend Development

How to Build a Zhihu Crawler with Python, ELK, and Visual Analytics

This article walks through creating a Python-based Zhihu web crawler, detailing the tech stack, data collection, visualization of user demographics and top contributors, the crawler architecture, authorization handling, and suggestions for performance and storage improvements.

ELKWeb Scrapingzhihu
0 likes · 6 min read
How to Build a Zhihu Crawler with Python, ELK, and Visual Analytics
MaGe Linux Operations
MaGe Linux Operations
May 20, 2017 · Backend Development

5 Must‑Use Python Libraries to Supercharge Your Projects

This article introduces five highly practical Python packages—yagmail, requests, psutil, BeautifulSoup, and a collection of utility scripts—explaining how each simplifies common tasks such as sending emails, making HTTP calls, system monitoring, web scraping, and code reuse, complete with concise code examples.

PythonWeb Scrapinglibraries
0 likes · 14 min read
5 Must‑Use Python Libraries to Supercharge Your Projects
ITPUB
ITPUB
May 2, 2017 · Backend Development

How to Bypass Common Anti‑Scraping Measures with Scrapy

This guide explains why websites employ anti‑scraping defenses, outlines the most common header checks such as User‑Agent, Referer, and Cookies, and provides practical Scrapy code snippets for rotating user agents, managing proxies, handling X‑Forwarded‑For, limiting request rates, and dealing with dynamic AJAX content using Selenium or PhantomJS.

HeadersProxyScrapy
0 likes · 7 min read
How to Bypass Common Anti‑Scraping Measures with Scrapy
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
May 28, 2016 · Backend Development

Extract All @Mentions from a Zhihu Page with Simple Scripts

This guide shows how to collect every @mentioned user on a Zhihu question page by using a JavaScript bookmarklet or a Python script, explains the extraction process, provides the necessary code snippets, and discusses why following programmers on Zhihu may not be the most effective learning method.

JavaScriptPythonWeb Scraping
0 likes · 6 min read
Extract All @Mentions from a Zhihu Page with Simple Scripts
21CTO
21CTO
Apr 12, 2016 · Backend Development

How to Build a PHP cURL Spider to Scrape Zhihu User Data and Visualize It

This article walks through using PHP's cURL extension to crawl tens of thousands of Zhihu user profiles, parse the HTML with regular expressions, store the extracted data efficiently, and present the results with responsive charts and dashboards.

Web ScrapingcURLmysql
0 likes · 9 min read
How to Build a PHP cURL Spider to Scrape Zhihu User Data and Visualize It
21CTO
21CTO
Mar 22, 2016 · Information Security

How to Outsmart AI-Powered Web Scrapers: Two Powerful Anti‑Crawling Tricks

Web crawlers, especially AI‑driven ones, threaten site performance and data ownership, so this article reviews common anti‑scraping methods—from IP and header analysis to behavior detection—and reveals two unconventional defenses: data poisoning and a deposit‑based access model that penalize malicious bots.

AIData ProtectionWeb Scraping
0 likes · 5 min read
How to Outsmart AI-Powered Web Scrapers: Two Powerful Anti‑Crawling Tricks
ITPUB
ITPUB
Mar 21, 2016 · Backend Development

How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages

This guide outlines the main anti‑scraping techniques used by websites—including header validation, user‑behavior monitoring, and dynamic content loading—and provides practical methods such as header spoofing, IP proxy rotation, request throttling, and Selenium/PhantomJS automation to overcome them.

HeadersPhantomJSSelenium
0 likes · 6 min read
How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages
Qunar Tech Salon
Qunar Tech Salon
Jan 29, 2016 · Big Data

Python Data Analysis Learning Roadmap (16‑Week Plan)

This article presents a 16‑week Python data‑analysis learning roadmap covering environment setup, basic syntax, web‑scraping techniques, data‑analysis libraries such as pandas and NumPy, and data‑visualization with matplotlib, along with curated free resources and tutorials for each stage.

NumPyRoadmapWeb Scraping
0 likes · 6 min read
Python Data Analysis Learning Roadmap (16‑Week Plan)
21CTO
21CTO
Jan 26, 2016 · Backend Development

How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages

This article summarizes common anti‑scraping techniques—including header checks, user‑behavior detection, and dynamic page defenses—and provides practical ways to circumvent them using custom headers, IP proxies, request timing, and tools like Selenium with PhantomJS to simulate real browsers.

HeadersProxySelenium
0 likes · 6 min read
How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages
ITPUB
ITPUB
Dec 17, 2015 · Backend Development

Build a Simple Python Image Scraper on macOS – Step‑by‑Step Guide

This tutorial walks you through setting up a macOS environment, inspecting a web page, and writing a Python script with the requests library to locate and download all images from a target site, complete with code explanations and execution tips.

PythonTutorialWeb Scraping
0 likes · 7 min read
Build a Simple Python Image Scraper on macOS – Step‑by‑Step Guide
21CTO
21CTO
Nov 13, 2015 · Backend Development

Essential Python Libraries for Web Scraping and Data Processing

Discover a comprehensive collection of Python libraries covering network requests, web crawling frameworks, HTML/XML parsing, text manipulation, file format handling, natural language processing, browser automation, asynchronous programming, and more, providing developers with essential tools for efficient web scraping and data processing tasks.

PythonWeb Scrapingdata-processing
0 likes · 18 min read
Essential Python Libraries for Web Scraping and Data Processing
21CTO
21CTO
Oct 1, 2015 · Backend Development

How to Scrape 1.1 Million Zhihu Users with PHP cURL, Multi‑Threading, and Redis

This tutorial walks through collecting over a million Zhihu user profiles using PHP on Ubuntu, handling cookies, bypassing image hot‑link protection, scaling requests with curl_multi, de‑duplicating MySQL inserts, and coordinating work with Redis and multi‑process pcntl for efficient large‑scale web scraping.

LinuxMulti‑processingPHP
0 likes · 15 min read
How to Scrape 1.1 Million Zhihu Users with PHP cURL, Multi‑Threading, and Redis
MaGe Linux Operations
MaGe Linux Operations
Jul 1, 2014 · Backend Development

Master Python Web Scraping: Proxies, Login, Multithreading, and Captcha Hacks

This guide walks through practical Python web‑scraping techniques using urllib2, covering basic page fetching, proxy usage, cookie handling for logins, form submission, header spoofing, anti‑hotlink tricks, multithreaded crawling, and strategies for bypassing simple captchas, all illustrated with code snippets.

CaptchaProxyWeb Scraping
0 likes · 7 min read
Master Python Web Scraping: Proxies, Login, Multithreading, and Captcha Hacks