Operations 7 min read

Performance Optimization and Stability Enhancement of the Continuation Enrollment System

This article details the background, performance and stability requirements, strategic approach, and concrete initiatives—including full‑chain load testing, chaos engineering, monitoring, and targeted optimization projects—that were undertaken to boost the performance by over 300% and improve high‑availability of the continuation enrollment platform.

New Oriental Technology

Sep 7, 2020

Performance Optimization and Stability Enhancement of the Continuation Enrollment System

The continuation enrollment system (续班体系) serves as the primary entry point for students enrolling in New Oriental courses, integrating web‑registration, discount, and qualification services across multiple channels such as the app, WeChat, and mobile site.

Performance and stability requirements: To handle peak enrollment traffic and ensure a smooth user experience, the system needed significant performance optimization and high‑availability capabilities.

Strategic approach: The team defined two core capabilities—full‑chain load testing and production fault injection—and three concrete projects: discount performance optimization, monitoring enhancement, and post‑mortem analysis.

Capability 1 – Performance Optimization (Full‑Chain Load Testing & Baseline): Established a production‑grade load‑testing pipeline, built a virtual school environment for testing, and regularly measured performance baselines, resulting in a 300%+ overall performance increase.

Capability 2 – High‑Availability (Chaos Testing): Introduced industry‑standard chaos engineering practices to inject faults, identify reliability gaps, and validate monitoring and SOP effectiveness.

Project 1 – Discount Performance: Identified the discount service as a bottleneck, used tools like Arthas to generate CPU flame graphs, performed code reviews, and applied optimizations such as removing unnecessary deserialization, reducing object storage, optimizing ES reads, caching rules, and minimizing network I/O, achieving a 15× TPS increase and reducing average response time to 8% of its original value.

Project 2 – Monitoring Enhancement: Built a layered monitoring dashboard shared by operations and development, implemented business log aggregation, multi‑dimensional analysis, and traceability to improve issue detection and root‑cause analysis.

Project 3 – Post‑Mortem Review: Conducted systematic retrospectives of production incidents, established communication and response mechanisms, improved release quality, and promoted coding standards and architectural best practices.

Overall impact: The combined efforts delivered substantial performance gains, increased system resilience, and cultivated a team with strong architectural awareness and expertise in performance tuning and reliability engineering.

Future directions: Continue pursuing horizontal scaling, enhance monitoring and degradation strategies, eliminate single points of failure, and adopt unit‑based deployments across multiple data centers to further elevate system performance and availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring high availability load testing Stability backend optimization chaos-testing

Written by

New Oriental Technology

Practical internet development experience, tech sharing, knowledge consolidation, and forward-thinking insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.