Big Data 10 min read

Tencent Game Marketing Deduplication Service: Technical Evolution from TDW to ClickHouse

Tencent’s game marketing analysis system “EAS” evolved from inefficient TDW HiveSQL jobs and file‑heavy real‑time pipelines to a scalable ClickHouse‑based deduplication service that processes hundreds of thousands of daily activity counts in sub‑second time, offering fast, reliable, and maintainable participant deduplication for massive marketing campaigns.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Tencent Game Marketing Deduplication Service: Technical Evolution from TDW to ClickHouse

This article introduces Tencent's game marketing activity analysis system "奕星" (EAS) and its technical approach to solving the challenging problem of deduplicating participant counts for massive marketing activities.

Background: Each marketing activity has a fixed time period, and different activities rarely share identical time windows. For example, Activity A runs from Jan 1-10, while Activity B runs from Jan 5-15. To get deduplicated participant counts for each activity, separate calculations must be performed within their respective time intervals. With thousands of marketing activities requiring daily calculations, plus deduplication data for non-activity links and per-channel data, the total task volume exceeds 500,000+ per day.

Previous Solutions:

1. TDW Temporary Table Solution: Using Tencent's TDW big data platform, executing HiveSQL for each activity in pysql. The main drawback was sequential execution with repeated log scanning, resulting in very low efficiency.

2. Real-time Calculation + File-based Incremental Deduplication: Using Storm for real-time computation with time-window deduplication. Testing showed that caching 5 minutes of deduplicated data could reduce raw logs by over 90%. However, this generated massive daily files (hundreds of thousands), causing inefficient small file I/O operations.

3. Real-time Calculation + LevelDB Solution: LevelDB's high random-write and sequential read/write performance made it suitable for the high-write, low-query deduplication scenario. This achieved millisecond-level query times for accurate deduplication counts and could export 10-million-level participant files within 10 seconds.

ClickHouse-based Solution: While LevelDB met most needs, it had poor scalability and difficult data backtracking. The team chose ClickHouse, an MPP OLAP system, for its flexibility and strong performance. The system can execute SQL queries like: select uniqExact(uvid) from tbUv where date='2020-09-06' and url='http://lol.qq.com/main.shtml' . Testing showed that deduplicating 1 million participants from 100 million records took less than 0.1 seconds, with file export in under 0.2 seconds.

Conclusion: The deduplication service continues to evolve with business needs and operational environment. Rather than pursuing absolute high performance (which means higher costs), the focus is on finding the most suitable, easily scalable solutions that support long-term stable operations. ClickHouse is now deployed across multiple data systems with over 500 billion records and growing.

big dataClickHouseDeduplicationOLAPLevelDBreal-time computingMPPstorm
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.