Big Data 20 min read

How WeChat’s WeAnalysis Powers Scalable User Segmentation with Big Data Architecture

This article explains the design and implementation of WeChat's WeAnalysis image system, covering its basic tag and user‑group modules, multi‑source data ingestion, ETL processing, storage choices such as TDSQL and ClickHouse, bitmap handling, query performance, and service APIs for flexible, high‑performance user segmentation.

21CTO
21CTO
21CTO
How WeChat’s WeAnalysis Powers Scalable User Segmentation with Big Data Architecture

Background

WeAnalysis is the official data‑analysis platform for WeChat Mini‑Programs, with the image insight module providing basic tag analysis and customizable user‑group capabilities to meet diverse analytical needs.

System Design Goals

Usability : Zero learning curve for merchants, ready‑to‑use out of the box.

Stability : Reliable tag data and timely generation of user‑group packages with fast query response.

Completeness : Rich tags, flexible rules, and comprehensive functionality supporting preset tags, user‑group tags, platform behavior, and custom reported data.

Overall Architecture

The system is divided into two main modules: the basic tag module and the user‑group module. Data flows from multiple sources (user attributes, group tags, platform behavior, custom reports) through ETL and pre‑computation, then into offline storage (TDW/HDFS) and finally into online stores (TDSQL for pre‑computed results, ClickHouse for detailed behavior).

Data Sources

Four sources feed the system: user attributes (e.g., gender, region), group tags (active, churned), platform behavior (visits, shares, transactions), and custom reported events uploaded by merchants.

Processing Pipeline

Data Ingestion & ETL : Raw data is cleaned, encoded, and aggregated. String dimensions are converted to integer IDs to reduce storage and improve query speed.

Tag Encoding & Storage : Tags are stored in vertical tables; each tag value is assigned a unique code. Bitmap (RoaringBitmap) structures represent tag‑to‑user mappings.

Online Storage : Pre‑computed results are written to TDSQL; detailed behavior and bitmap data are stored in ClickHouse using the groupBitmap aggregate function.

Data Import : Spark jobs generate per‑user bitmaps, serialize them to Base64 strings, and load them into ClickHouse tables with a materialized bitmap column.

Storage Choices

TDSQL provides up to 192 TB per instance with fast bulk import (≈40 min for 100 M rows). ClickHouse, combined with RoaringBitmap, offers efficient bitmap operations, high compression, and sub‑second query latency for large user groups.

Service Layer

Online services expose image APIs via the svrk‑javamesh RPC framework, with a middleware layer handling traffic control, async calls, monitoring, and parameter validation.

Query Performance

Local‑node execution and hash‑based sharding ensure queries run on a single machine, avoiding distributed joins. Numeric ID encoding yields >2× speedup over string‑based queries. Benchmarks show up to 5 × 10⁴ QPS for typical queries, with sampling used for very large apps to keep latency acceptable.

User‑Group Features

Real‑time Estimation : Calculates current group size based on defined rules.

Batch Creation : Nightly Spark jobs compute daily groups for all merchants, reading once and writing once to minimize resource usage.

Tracking & Analysis : Offline jobs export group members, join with metric tables, and store results for online analysis (e.g., activity, transaction trends).

AB Experiment Targeting : Groups can be used as experiment cohorts for controlled tests.

Key Takeaways

The architecture balances flexibility (rich, customizable tags) with performance (bitmap storage, local query execution) and scalability (supporting billions of daily events). By leveraging Spark for heavy ETL, ClickHouse for fast bitmap queries, and TDSQL for bulk storage, WeAnalysis delivers a robust, low‑latency user‑segmentation platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

User SegmentationClickHouseData AnalyticsWeChatSpark
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.