Ensuring Backend Stability: CDN, Source Site Design, and Operations in Dragonfly
This article examines how Dragonfly, a Node.js‑based source site for Taobao's CMS, achieves high stability through thoughtful system design, robust implementation, comprehensive testing, and effective logging and monitoring practices across CDN, Redis, and other dependencies.
Source Site and CDN
The source site is the original location of content in a CDN architecture, while cache servers deliver traffic but rely on the source site for original data. Dragonfly, Taobao's CMS source site built with Node.js, renders pages for cache servers.
System Design
We outline Dragonfly's topology within the Taobao CMS ecosystem and internal workflow sketches.
The external environment of the source site is simple: it interfaces with CDN for traffic handling and core page disaster recovery, and with Redis for resource retrieval. Configuration Center and FileSync provide configuration and shared template fragments as weak dependencies.
TMS supports multi‑terminal page delivery, requiring CDN to detect terminals. Dragonfly handles this via UA detection and provides a forced switch parameter for unknown devices.
Key observations:
No input filtering module; user queries must be uniformly discarded to ensure environment consistency.
The page entry module depends on the unstable Redis system without disaster‑recovery backup.
Lack of a unified exception handling module; the disaster‑recovery module only detects exceptions without proper handling.
Design adjustments were made to address these issues.
Based on the new design, we ensured that modules depending on unstable systems are covered by disaster‑recovery mechanisms and identified verification items for implementation and operation:
Confirm external dependencies have fault‑tolerance strategies.
Ensure internal errors are logged correctly and handled appropriately per scenario.
Verify monitoring scripts are properly configured.
System Implementation
Following the design review, we evaluated the implementation.
External Dependency Disaster Recovery
Minimize the number of critical external dependencies; each must have detailed disaster‑recovery plans. Use the latest stable versions of third‑party modules to avoid bugs, integration issues, and performance degradation.
Specific safeguards:
CDN/Source Site
CDN operates many nodes; occasional node failures do not affect overall availability.
If the source site fails, CDN serves stale copies.
Terminal Detection
Uses User‑Agent to identify terminals; unknown devices may be misidentified, so a forced‑switch parameter is provided.
Configuration Center
Dragonfly's configuration push system includes multi‑level disaster recovery from server to client.
A local fallback is also implemented in the source code.
FileSync
Synchronizes shared front‑end code fragments across CMS and applications.
Local copies enable fallback and manual updates during exceptions.
Redis
Performance is good but network latency causes frequent timeouts; tests showed that enabling TCP Keep‑Alive, disabling Nagle’s algorithm, and turning off Delayed ACK significantly improved performance.
Redis serves as the primary cache; Aliyun OSS is used as a backup data source. If both Redis and OSS fail, the source site falls back to local disaster‑recovery.
Internal Fault‑Tolerance
Standardize exception formats and handling as the foundation of disaster recovery. Implement degradation strategies for critical resource bottlenecks.
Dragonfly's approaches include:
Context Exception Handling
Log occurrences and use static fallback copies.
Uncaught Exceptions
Log and trigger alerts, then restart the worker process.
Real‑time Backup
Generate a static disaster‑recovery copy every 10 seconds per page request.
Memory Monitoring
Rendering creates many temporary strings, stressing garbage collection.
When memory usage is high and cannot be reclaimed promptly, force a worker restart.
Continuous optimization efforts are ongoing.
Overload Degradation
When load is high, Dragonfly receives an Over‑Load header from Nginx and returns static copies directly.
Static Switch
A manual degradation option learned from the development team to handle large‑scale page failures caused by unknown bugs.
Testing (Acceptance)
Design should consider testability; a good design is easy to test.
Unit Testing
Unit tests should not hinder development efficiency; they are essential for sustained high‑speed development.
Coverage validates code quality and helps uncover hidden defects.
Tests must be thorough, starting from basic units, and updated promptly when requirements change.
Each test should be independent; mocking is a valuable technique.
Functional Testing
Functional tests verify that the system meets user requirements. For disaster‑recovery modules, online drills complement functional testing.
Performance Testing
Load‑testing platforms simulate real user traffic; any change causing noticeable performance degradation is blocked from release.
Continuous Integration
Automate testing by adopting a mature CI solution.
Logging and Monitoring (Maintenance)
Logging
Logs support monitoring and troubleshooting. They should be recorded from the perspective of operators, using a unified format, categorized by module, and centrally managed. Typical categories include diagnostic logs (e.g., config/redis/xtemplate), statistical logs (e.g., QPS/RT), and audit logs (e.g., user actions).
Key logging practices: Remove useless logs. Design logs to facilitate fault investigation. After resolving an issue, review and improve log definitions. Provide a detailed debug‑log switch on production machines for complex problem analysis.
Monitoring
Without monitoring, logs are ineffective. Effective monitoring should enable rapid issue resolution and be tuned based on operational experience to avoid excessive false alarms.
Conclusion
Even a small oversight can cause a massive system failure, but with proper planning, strict acceptance criteria, and continuous monitoring, ensuring system stability is achievable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Node Underground
No language is immortal—Node.js isn’t either—but thoughtful reflection is priceless. This underground community for Node.js enthusiasts was started by Taobao’s Front‑End Team (FED) to share our original insights and viewpoints from working with Node.js. Follow us. BTW, we’re hiring.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
