Databases 22 min read

Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

This article details NetEase Interactive Entertainment's adoption of Apache Doris for large‑scale game data analytics, covering background, Doris architecture, cluster governance, tablet and compaction tuning, scaling strategies, monitoring, alerting, and fault‑handling practices to improve performance and stability.

NetEase Game Operations Platform
NetEase Game Operations Platform
NetEase Game Operations Platform
Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

NetEase Interactive Entertainment introduced Apache Doris in April 2021 to meet the growing real‑time analytics demands of its game business, providing an EB‑level storage cluster and supporting Hive, Spark, Presto, Doris, and ClickHouse frameworks.

Background : Rapid growth in game analytics required a high‑performance query engine; Doris was chosen for its MPP architecture, columnar storage, ANSI‑SQL support, high availability, and ecosystem.

Apache Doris Overview : Doris is a Baidu‑originated MPP analytical database offering sub‑second query latency on PB‑scale data, with a simple two‑service architecture (FE and BE) and no external dependencies.

2.1 FE : The Frontend parses SQL, performs analysis, logical planning, optimization, and generates plan fragments for execution on BE nodes.

2.2 BE : Backend nodes store data and execute SQL fragments; they receive data directly, build indexes, and perform distributed execution to avoid data movement.

3. Cluster Governance

Tablet management: Excessive tablet count (≈20 M) caused metadata latency and OOM risks; target reduced to ~1 M tablets.

Compaction tuning: High Compaction Cumulative Score indicated inefficient LSM‑Tree merges; parameters such as cumulative_size_based_promotion_ratio, max_cumulative_compaction_num_singleton_deltas, and max_tablet_version_num were adjusted.

Scaling: After adding BE nodes, imbalance occurred due to load‑score classification; solutions included adjusting balance_load_score_threshold and disk reservation settings.

Tablet Governance Implementation : Scanned all tables, exported statistics, and applied two remediation methods—data re‑insertion with corrected bucket numbers and partition re‑load from Hive—using scripts such as:

con = DriverManager.getConnection()
for db in db_list:
    con.execute("use %s" % db)
    for table in table_list:
        con.execute("show create table %s" % table)
        result.append(regrex_extract_bucket(con.next()))
        con.execute("show partitions from %s" % table)
        result.append(regrex_extract_bucket(con.next()))
write_result_to_excel(result)

Performance tests showed significant reductions in tablet counts and query latency after remediation.

Monitoring & Alerting : NetEase’s internal Monitor and Blackhole systems collect Doris metrics (FE/BE ports, query latency, error rates, storage usage, compaction scores) and trigger alerts, with dashboards built on Grafana.

Fault Handling : Procedures for BE and FE crashes include log inspection, core dump analysis with gdb, thread stack tracing, and correlating query IDs via audit logs to pinpoint problematic SQL.

Overall, the adoption of Doris has improved query speed and operational simplicity, while ongoing challenges such as version management and community support are being addressed.

Monitoringbig datadatabase optimizationcluster managementApache DorisCompaction Tuning
NetEase Game Operations Platform
Written by

NetEase Game Operations Platform

The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.