Big Data 13 min read

How FunData Scaled DOTA2 Esports Data with a Cloud‑Native Big Data Architecture

This article details the evolution of the FunData esports data platform from a simple master‑slave ETL system to a cloud‑native, distributed architecture that leverages Google Cloud Pub/Sub, Dataflow, Bigtable, and a redesigned API layer to handle petabyte‑scale, real‑time DOTA2 match data.

dbaplus Community
dbaplus Community
dbaplus Community
How FunData Scaled DOTA2 Esports Data with a Cloud‑Native Big Data Architecture

1.0 Architecture

The initial FunData system followed an MVP approach with a two‑module master‑slave design. The Master periodically called the Steam API for match IDs, dispatched analysis tasks via an in‑memory message queue, and tracked progress. The Slave listened to the queue, performed replay analysis using the open‑source projects Clarity and Manta, and stored results.

While stable at launch, the system soon faced scalability and maintainability problems: rebuilding DB indexes for new fields took hours, the tightly coupled master‑slave relationship required full restarts, there was no message persistence, scaling slaves required manual VM image creation, and the master‑slave DB schema caused lock contention.

Figure 1: 1.0 ETL architecture
Figure 1: 1.0 ETL architecture

2.0 Architecture

Learning from the 1.0 shortcomings, the 2.0 redesign focuses on three core qualities: fine‑grained, high‑concurrency task processing, distributed storage, and system decoupling.

Task granularity: each match generates up to 1.2 million DOTA2 games per day; tasks are split into multiple Pub/Sub topics and processed by independent workers.

Distributed storage: Google Cloud Bigtable stores raw and processed data, while MongoDB holds aggregated statistics.

Decoupling: Pub/Sub (Kafka‑like) replaces the in‑memory queue, allowing independent restarts and horizontal scaling.

Figure 3: 2.0 ETL overall architecture
Figure 3: 2.0 ETL overall architecture

Data Processing Flow

Basic data (match details, KDA, damage, creep score, etc.) and replay analysis results are fetched by a Supervisor, cleaned by workers, and written to Bigtable. High‑level statistics (hero usage, item builds, team fights) are produced by Dataflow pipelines and stored in both MongoDB and Bigtable.

The original single Slave node is split into four sub‑modules: league data analysis, league replay analysis, DB proxy for analysis/ mining data, and monitoring.

Figure 4: League‑ETL architecture
Figure 4: League‑ETL architecture

Distributed Storage Choice

MySQL proved inadequate for the growing data volume and schema evolution. The team adopted Google Cloud Bigtable (and HBase concepts) for its scalable, low‑latency random reads/writes. RowKey design combines a consistent hash prefix with the match_id to avoid hotspotting and enable effective sharding.

Figure 6: Bigtable/HBase data model
Figure 6: Bigtable/HBase data model

Secondary indexing is built in MySQL: workers write a timestamp‑RowKey index to MySQL, which is later used for range queries.

System Decoupling

Replacing the in‑memory queue with Pub/Sub eliminates data loss on Master failures and enables independent version upgrades. The message bus also provides visual monitoring of backlog and supports multi‑cloud resilience.

Figure 10: Data monitoring
Figure 10: Data monitoring

API Layer Redesign

The original API layer used DreamFactory on Alibaba Cloud, exposing full‑table REST endpoints without caching, leading to latency spikes and cross‑region latency. The new design splits APIs by data domain (matches, league schedules, heroes, items) and introduces CDN acceleration, multi‑cloud failover, and an internal cache that refreshes on data updates.

Figure 11: New API architecture
Figure 11: New API architecture

Conclusion

The FunData platform evolved from a monolithic master‑slave ETL pipeline to a cloud‑native, distributed system that can ingest, process, and serve petabyte‑scale esports data with low latency and high availability. Since its public launch on April 10, over 300 developers have obtained API keys, and the team continues to add new data points such as league statistics and real‑time match feeds.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ETLesportsGoogle Cloud Platform
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.