We Analysis User Profiling System: Architecture and Technical Implementation
We Analysis, the official data‑analysis platform for WeChat mini‑program providers, delivers a zero‑learning‑curve user‑profiling system that combines basic tag analysis and flexible, rule‑based segmentation, using an ETL pipeline to store pre‑computed data in TDSQL and online bitmap‑optimized queries in ClickHouse with RoaringBitmap, ensuring low‑latency, stable, and comprehensive analytics.
We Analysis is the official data‑analysis platform for WeChat mini‑program service providers, with the user‑profiling (画像) insight module as a core feature. It offers basic tag analysis and custom user‑segmentation capabilities to meet diverse analytical needs.
The system is designed around three goals: ease of use (zero learning curve), stability (reliable, low‑latency queries), and completeness (rich tags, flexible rules, and extensive data coverage).
Overall, the platform consists of two main modules:
Basic Tag Module – provides foundational tag analysis for mini‑programs.
User Segmentation Module – enables custom group creation, real‑time estimation, and downstream applications.
Data sources include user attributes, predefined tags, platform‑generated behavior logs, and custom‑reported events. The raw data are first processed in an ETL layer (Extract‑Transform‑Load), pre‑aggregated, and stored in the distributed TDW HDFS cluster. Pre‑computed results are then exported to online storage engines: TDSQL for relational data and ClickHouse for OLAP queries.
Storage selection : After evaluating several databases (Datacube, FeatureKV, HBase, Elasticsearch, Doris, etc.), the team chose TDSQL for offline pre‑computed results due to its OLTP performance and capacity (up to 192 TB per instance). For online analytical workloads, ClickHouse was selected for its columnar architecture and native RoaringBitmap support, which efficiently handles sparse user‑group bitmaps.
Basic Tag Module : The module stores tag data in vertical tables to avoid wide‑table bottlenecks. An example table definition is shown below:
CREATE TABLE table_xxx(
ds BIGINT COMMENT '数据日期',
label_name STRING COMMENT '标签名称',
label_id BIGINT COMMENT '标签id',
appid STRING COMMENT '小程序appid',
useruin BIGINT COMMENT 'useruin',
tag_name STRING COMMENT 'tag名称',
tag_id BIGINT COMMENT 'tag id',
tag_value BIGINT COMMENT 'tag权重值'
)
PARTITION BY LIST(ds)
SUBPARTITION BY LIST(label_name)(
SUBPARTITION sp_xxx VALUES IN ('xxx'),
SUBPARTITION sp_xxxx VALUES IN ('xxxx')
);Data are partitioned by date and sub‑partitioned by label name, enabling independent parallel generation of each tag.
User Segmentation Module : This module supports flexible rule‑based group creation, real‑time size estimation, and periodic tracking. Rules combine multiple tag, behavior, and custom‑event dimensions. Bitmap representations (RoaringBitmap) are used to map users to groups, dramatically reducing storage and accelerating set operations.
Data ingestion for segmentation follows a Spark‑based pipeline: Spark reads raw vertical tables, aggregates per user, generates per‑tag bitmaps, serializes them to Base64 strings, and writes them into ClickHouse tables with a materialized bitmap column. Example ClickHouse table definition:
CREATE TABLE xxxxx_table_local ON CLUSTER xxx (
`ds` UInt32,
`appid` String,
`label_group_id` UInt64,
`label_id` UInt64,
`bucket_num` UInt32,
`base64rbm` String,
`rbm` AggregateFunction(groupBitmap, UInt32) MATERIALIZED base64Decode(base64rbm)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/xxx_table_local', '{replica}')
PARTITION BY toYYYYMMDD(toDateTime(ds))
ORDER BY (appid, label_group_id, label_id)
TTL toDate(ds) + toIntervalDay(5)
SETTINGS index_granularity = 16;Performance considerations include write‑read speed, DDL latency, and query efficiency. The system uses hash‑based sharding to ensure that all data for a given user reside on the same node, enabling local‑only queries. For high‑DAU apps with many rules, sampling is applied to keep query latency acceptable.
All service interfaces are built on an RPC framework, with a data‑middleware layer providing traffic control, async calls, monitoring, and parameter validation. Operational features such as instance provisioning, monitoring alerts, scaling, and slow‑query analysis are exposed via the cloud console.
Conclusion : The We Analysis profiling system combines TDSQL for reliable relational storage and ClickHouse with RoaringBitmap for high‑performance online analytics. The design balances flexibility, stability, and completeness, and will continue to evolve with richer features and new application scenarios.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.