High-Concurrency Practices for Tencent Video Front-End Node.js Services

Tencent Video’s front‑end Node.js services achieve massive concurrency stability through a layered architecture that combines GSLB‑directed CDN, TGW, Nginx, and clustered workers, reinforced by process guardians, three‑tier disaster‑recovery fallbacks, multi‑level caching with lock mechanisms, and comprehensive logging and alerting.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
High-Concurrency Practices for Tencent Video Front-End Node.js Services

The author, a senior front‑end engineer at Tencent Video, participated in the development of the National Day parade live‑stream page, which attracted 238 million views and endured extreme traffic spikes.

This article shares practical experience on handling high concurrency for video‑side Node.js services, focusing on three dimensions: service availability, caching, and logging.

1. Video Front‑End Network Architecture

The request flow is:

User requests the service via GSLB, which returns the optimal IP and directs the client to the nearest CDN node.

If the CDN cache hits, the response is served directly; otherwise the request falls back to the Tencent Gateway (TGW) for disaster recovery and load balancing.

TGW forwards the request to the business‑layer Nginx, which applies simple disaster‑recovery settings such as max_fails and fail_timeout, as well as caching.

Nginx routes the request to a Node.js service. The Node process uses the cluster module to dispatch traffic to worker processes that handle the actual business logic.

This layered design ensures that each node cooperates to provide overall reliability.

2. Availability

Process guardianship is essential. Tencent Video uses a shell script scheduled by crontab to check every minute whether the Node.js master process and its listening ports are alive (using ps and nc). If a problem is detected, the service is restarted.

Within the cluster module, workers may encounter three common issues:

Zombie (unresponsive) processes caused by infinite loops – mitigated by sending heartbeat packets from the master and restarting workers that stop responding.

Memory leaks – monitored and workers are killed/restarted when memory usage exceeds a threshold.

Unexpected exits – automatically restarted by the guardian.

Tools such as pm2 are also mentioned for process management.

3. Three‑Layer Disaster‑Recovery Strategy

If the guardians fail or downstream services become unavailable, the H5 page adopts three fallback layers:

Interface fallback : Successful responses are cached in Redis; when the backend API fails, the cached data is used.

HTML fallback : Middleware detects 5xx responses and serves the last good HTML version stored in Redis.

Node.js fallback : When the Node service returns 5xx, Nginx redirects traffic to a static HTML backup that has been pushed to CDN.

This multi‑level approach keeps core business functional even under failure.

4. Caching

Three‑level caching is employed: CDN cache, Nginx proxy cache, and application‑layer Redis cache. CDN reduces latency and origin load, but two key concerns must be addressed:

Cache freshness : Properly update cache-control and last-modified headers, and monitor the status header to ensure timely revalidation.

Cache penetration and avalanche : Without a cache lock, a brief cache‑miss window can cause a flood of requests to the origin. Nginx’s built‑in cache lock (via proxy_cache_lock) mitigates this risk.

Redis is used for the third‑level HTML cache, providing a fallback when both CDN and Nginx caches miss. Considerations include cache version synchronization and optional cache‑lock configuration.

5. Logging and Alerting

Comprehensive monitoring spans multiple layers:

Client‑side monitoring reports page quality, CGI quality, and user‑side metrics.

Reverse‑proxy (Nginx) reports traffic spikes, error ratios (4xx/5xx), and latency.

Request logs record total requests, failures, and average response time.

Node.js process logs capture crashes, memory leaks, and zombie processes.

Custom Node request logs help with root‑cause analysis.

Module‑level monitoring tracks error rates and response times of downstream services.

These logs enable rapid detection, diagnosis, and post‑mortem analysis of incidents.

6. Conclusion

Availability is a critical metric for both framework and business developers. While no system can be 100 % reliable, Tencent Video’s architecture—process guardians, multi‑layer disaster recovery, and extensive monitoring—demonstrates how to keep high‑traffic services stable and performant under massive concurrency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringNode.jshigh concurrencyAvailability
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.