Cloud Computing 23 min read

Tencent Cloud Elasticsearch Optimization Practices in Tencent Meeting: High Availability, Performance, and Cost-Effective Solutions

Tencent Meeting migrated its quality‑analysis system to Tencent Cloud Elasticsearch, tackling OOM failures, 3 M/s write spikes and scaling limits by adding multi‑AZ deployment, leaky‑bucket rate limiting, streaming aggregation checks, optimized merge and translog handling, plus hot‑warm storage, ILM, multi‑disk and off‑heap caching, cutting cluster size from 15 000 to under 300 nodes while maintaining high availability and performance.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Elasticsearch Optimization Practices in Tencent Meeting: High Availability, Performance, and Cost-Effective Solutions

This article introduces the application of Tencent Cloud Elasticsearch in Tencent Meeting's quality analysis system, along with optimizations for high availability, performance, and cost reduction in large-scale scenarios.

1. Application in Tencent Meeting

Tencent Meeting launched in December 2019, reaching 10 million daily active users within two months during the COVID-19 pandemic. The quality analysis system needed to process massive real-time data including network metrics (connection type, bitrate, packet loss rate, network/IP switches) and client metrics (CPU, memory, OS version, product version) to help teams quickly identify issues like video stuttering or audio desynchronization.

2. Pain Points and Challenges

With explosive growth (20+ versions in 100 days, 100k hosts in 8 days), the original system faced four major challenges: (1) Availability - self-developed Lucene engine caused OOM issues and cluster cascading failures; (2) Performance - peak write throughput reached 3M/s with data delays over 30 minutes; (3) Scalability - couldn't rapidly expand due to custom search engine limitations; (4) Usability - needed a solution switch within one week with minimal code changes. The solution was migrating to the classic ELK architecture.

3. High Availability and Performance Optimizations

High Concurrency Request Optimization: Implemented a memory-based leaky bucket strategy for rate limiting. Unlike ES's native request-count-based approach, this solution controls memory resource usage at the coordinator node's access layer, implementing tiered throttling (limiting writes at light yellow zone, queries at dark yellow zone, reserving red zone for in-flight requests and merge operations).

Large Aggregation Query Optimization: Phase 1 uses memory inflation coefficient to estimate memory consumption during deserialization and triggers circuit breaking when exceeding thresholds. Phase 2 performs streaming bucket count checks during reduction, checking memory every 1024 buckets and killing queries that exceed limits. This optimization was contributed to ES 7.7.0.

Multi-AZ Deployment: Provides cluster deployment across 2-3 availability zones with shard allocation awareness, ensuring data integrity and transparent failover when one zone fails.

Merge Strategy Optimization: Implemented time-sequence-based sorting at L0 layer combined with target file size-based merging at L1 layer (e.g., 20MB per file, merging every 20 small files). Added cold shard continuous merging for shards not updated for over 5 minutes.

High-Throughput Write Optimization: Identified translog lock synchronization as bottleneck. Optimized by performing disk flush before each rollGeneration, eliminating lock synchronization for every write. This improved write performance by over 20% and was contributed back to the ES community.

4. Cost Reduction Solutions

Storage Cost: (1) Hot-Warm separation - uses high-config machines (64+ cores) for recent data and low-config machines (4-12 cores) for cold data; (2) ILM (Index Lifecycle Management) - manages data across Hot/Warm/Cold phases with downsampling and index management; (3) Multi-disk strategy - attaches multiple disks per node to break through single disk IO limits (260MB/s); (4) COS cold backup - backs up data to low-cost Cloud Object Storage.

Compute Cost: Addressed FST (Finite State Transducer) cache consuming 50-70% of heap memory. Implemented two-layer cache using off-heap memory with LRU/LFU eviction policies, reducing heap memory pressure from 70-90% to safe levels.

After optimization, Tencent Cloud's monitoring cluster reduced from 15,000 machines to just 200-300 machines while handling the same workload.

Data Engineeringperformance optimizationElasticsearchCost OptimizationDistributed SystemsTencent CloudHigh Availability
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.