Ensuring Backend Stability: CDN, Source Site Design, and Operations in Dragonfly

This article examines how Dragonfly, a Node.js‑based source site for Taobao's CMS, achieves high stability through thoughtful system design, robust implementation, comprehensive testing, and effective logging and monitoring practices across CDN, Redis, and other dependencies.

Node Underground
Node Underground
Node Underground
Ensuring Backend Stability: CDN, Source Site Design, and Operations in Dragonfly

Source Site and CDN

The source site is the original location of content in a CDN architecture, while cache servers deliver traffic but rely on the source site for original data. Dragonfly, Taobao's CMS source site built with Node.js, renders pages for cache servers.

System Design

We outline Dragonfly's topology within the Taobao CMS ecosystem and internal workflow sketches.

The external environment of the source site is simple: it interfaces with CDN for traffic handling and core page disaster recovery, and with Redis for resource retrieval. Configuration Center and FileSync provide configuration and shared template fragments as weak dependencies.

TMS supports multi‑terminal page delivery, requiring CDN to detect terminals. Dragonfly handles this via UA detection and provides a forced switch parameter for unknown devices.

Key observations:

No input filtering module; user queries must be uniformly discarded to ensure environment consistency.

The page entry module depends on the unstable Redis system without disaster‑recovery backup.

Lack of a unified exception handling module; the disaster‑recovery module only detects exceptions without proper handling.

Design adjustments were made to address these issues.

Based on the new design, we ensured that modules depending on unstable systems are covered by disaster‑recovery mechanisms and identified verification items for implementation and operation:

Confirm external dependencies have fault‑tolerance strategies.

Ensure internal errors are logged correctly and handled appropriately per scenario.

Verify monitoring scripts are properly configured.

System Implementation

Following the design review, we evaluated the implementation.

External Dependency Disaster Recovery

Minimize the number of critical external dependencies; each must have detailed disaster‑recovery plans. Use the latest stable versions of third‑party modules to avoid bugs, integration issues, and performance degradation.

Specific safeguards:

CDN/Source Site

CDN operates many nodes; occasional node failures do not affect overall availability.

If the source site fails, CDN serves stale copies.

Terminal Detection

Uses User‑Agent to identify terminals; unknown devices may be misidentified, so a forced‑switch parameter is provided.

Configuration Center

Dragonfly's configuration push system includes multi‑level disaster recovery from server to client.

A local fallback is also implemented in the source code.

FileSync

Synchronizes shared front‑end code fragments across CMS and applications.

Local copies enable fallback and manual updates during exceptions.

Redis

Performance is good but network latency causes frequent timeouts; tests showed that enabling TCP Keep‑Alive, disabling Nagle’s algorithm, and turning off Delayed ACK significantly improved performance.

Redis serves as the primary cache; Aliyun OSS is used as a backup data source. If both Redis and OSS fail, the source site falls back to local disaster‑recovery.

Internal Fault‑Tolerance

Standardize exception formats and handling as the foundation of disaster recovery. Implement degradation strategies for critical resource bottlenecks.

Dragonfly's approaches include:

Context Exception Handling

Log occurrences and use static fallback copies.

Uncaught Exceptions

Log and trigger alerts, then restart the worker process.

Real‑time Backup

Generate a static disaster‑recovery copy every 10 seconds per page request.

Memory Monitoring

Rendering creates many temporary strings, stressing garbage collection.

When memory usage is high and cannot be reclaimed promptly, force a worker restart.

Continuous optimization efforts are ongoing.

Overload Degradation

When load is high, Dragonfly receives an Over‑Load header from Nginx and returns static copies directly.

Static Switch

A manual degradation option learned from the development team to handle large‑scale page failures caused by unknown bugs.

Testing (Acceptance)

Design should consider testability; a good design is easy to test.

Unit Testing

Unit tests should not hinder development efficiency; they are essential for sustained high‑speed development.

Coverage validates code quality and helps uncover hidden defects.

Tests must be thorough, starting from basic units, and updated promptly when requirements change.

Each test should be independent; mocking is a valuable technique.

Functional Testing

Functional tests verify that the system meets user requirements. For disaster‑recovery modules, online drills complement functional testing.

Performance Testing

Load‑testing platforms simulate real user traffic; any change causing noticeable performance degradation is blocked from release.

Continuous Integration

Automate testing by adopting a mature CI solution.

Logging and Monitoring (Maintenance)

Logging

Logs support monitoring and troubleshooting. They should be recorded from the perspective of operators, using a unified format, categorized by module, and centrally managed. Typical categories include diagnostic logs (e.g., config/redis/xtemplate), statistical logs (e.g., QPS/RT), and audit logs (e.g., user actions).

Key logging practices: Remove useless logs. Design logs to facilitate fault investigation. After resolving an issue, review and improve log definitions. Provide a detailed debug‑log switch on production machines for complex problem analysis.

Monitoring

Without monitoring, logs are ineffective. Effective monitoring should enable rapid issue resolution and be tuned based on operational experience to avoid excessive false alarms.

Conclusion

Even a small oversight can cause a massive system failure, but with proper planning, strict acceptance criteria, and continuous monitoring, ensuring system stability is achievable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

system stabilityRedisNode.jsCDNKoa
Node Underground
Written by

Node Underground

No language is immortal—Node.js isn’t either—but thoughtful reflection is priceless. This underground community for Node.js enthusiasts was started by Taobao’s Front‑End Team (FED) to share our original insights and viewpoints from working with Node.js. Follow us. BTW, we’re hiring.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.