Backend Development 14 min read

How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines

etlpy is a lightweight Python framework that lets you define web‑crawling and data‑cleaning pipelines via XML, using generators for streaming, built‑in thread pools for parallelism, and a plug‑in architecture that handles everything from regex parsing to JSON conversion, all within a single 500‑line core file.

21CTO

Oct 14, 2017

How etlpy Simplifies Python Web Scraping and Data Cleaning in Under 500 Lines

etlpy is a Python‑written web‑scraping and data‑cleaning tool whose core file etl.py is under 500 lines. It defines crawling and cleaning logic in XML, so users do not need to write code manually.

XML‑based crawler and cleaning logic eliminates hand‑coding.

Generator‑based streaming processing removes memory constraints.

Built‑in thread pool supports both serial and parallel execution.

Integrated regex parsing, HTML unescaping, JSON conversion and other cleaning functions output ready‑to‑use files.

Plug‑in design makes adding new file formats or databases straightforward.

Supports almost any website and can automatically fill cookies.

GitHub address: https://github.com/ferventdesert/etlpy

Running requires Python 3 and lxml (install via pip3 install lxml). The provided project.xml contains example configurations for Lianjia and Dazhong Dianping.

How to Use

When extracting and processing data from web pages or files, complex details such as encoding, odd HTML, and asynchronous AJAX requests often cause trouble. etlpy handles these issues automatically.

Usage is simple: load a project, obtain a generator, and request the desired number of items. The following code fetches all food listings from Dazhong Dianping in Shanghai (about 160 000 records, 30 MB) within 20 minutes:

import etl
etl.LoadProject('project.xml')
tool = etl.modules['大众点评门店']
datas = tool.QueryDatas()
for r in datas:
    print(r)

Sample output (truncated):

{'区域': '川沙', '标题': '胖哥俩肉蟹煲(川沙店)', '地址': '川沙镇川沙路5558弄绿地广场三号楼', '环境': '9.0', '口味': '9.1', '星级': '五星商户', '点评': '2205', '均价': 67}
{'区域': '金杨地区', '标题': '上海小南国(金桥店)', '地址': '张杨路3611弄金桥国际商业广场6座2楼', '环境': '8.8', '口味': '8.6', '星级': '准五星商户', '点评': '1973', '均价': 190}
...

For faster results you can enable parallel execution:

tool.mThreadExecute(threadcount=20, execute=False, callback=lambda d: print(d))

Principles

C# Version

The core idea is dynamic LINQ assembly; the data chain is an IEnumerable<IFreeDocument>. Each module contributes a LINQ transformation, e.g.:

result = source.Take(mount)
               .Where(d => module0.func(d))
               .Select(d => Module1.func(d))
               .Select(d => Module2.func(d)) …

Python Version

Python generators play the same role as LINQ streams. etlpy extends generators to support chaining, parallelism, and Cartesian products.

def Append(a, b):
    for r in a:
        yield r
    for r in b:
        yield r

def Cross(a, genefunc, tool):
    for r1 in a:
        for r2 in genefunc(tool, r1):
            for key in r1:
                r2[key] = r1[key]
            yield r2

The central generator builder works as follows:

def __generate__(self, tools, generator=None, execute=False):
    for tool in tools:
        if tool.Group == 'Generator':
            if generator is None:
                generator = tool.Func(tool, None)
        elif tool.Group == 'Transformer':
            generator = transform(tool, generator)
        elif tool.Group == 'Filter':
            generator = filter(tool, generator)
        elif tool.Group == 'Executor' and execute:
            generator = tool.Func(tool, generator)
    return generator

Modules are categorized into four types:

Generator (GE): produces dictionaries, e.g., 1‑100 keys.

Transformer (TF): modifies fields, e.g., extracts numbers from an address.

Filter (FT): removes dictionaries with empty values.

Executor (EX): persists dictionaries, e.g., stores them in MongoDB.

Defining a simple trim function illustrates how a Python function becomes a module:

def TrimTF(etl, data):
    return data.strip()

At runtime the XML configuration dynamically adds the function as a class, and the generated class participates in the stream.

Data extraction uses XPath (auto‑generated) rather than regular expressions, making the tool a domain‑specific language (DSL) that dramatically reduces the effort of building web‑scraping pipelines.

Optimization and Details

1. Switching Cities

By defining city IDs in the head of sub‑streams, a single change in the main stream propagates to all child streams, avoiding the need to edit multiple generators.

2. Parallel Optimization

Parallelism should start at the source of the stream. If the head contains only one element, split the stream into two and run them concurrently. For Dazhong Dianping, combining 14 districts with 30 food categories yields 420 elements; after the split, each element can be processed in parallel, greatly speeding up the crawl.

3. Parameter Notes

OneInput=True: function receives a single dictionary value.

OneOutput=True: function may output multiple values by modifying the dictionary in place.

IsMultiYield=True: function returns a generator.

Future Outlook

Using XML as a project configuration file is convenient across languages but noisy for manual editing. Designing a dedicated data‑cleaning DSL or a visual programming tool would greatly improve efficiency.

Source: http://www.cnblogs.com/buptzym/p/5320552.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ETL data cleaning parallel processing Web Scraping generators

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.