Tagged articles
68 articles
Page 1 of 1
Digital Planet
Digital Planet
Apr 20, 2026 · Industry Insights

Can Wanglaoji’s “Five‑Code One” Digital Strategy Sustain Its Billion‑Yuan Growth?

The article analyzes Wanglaoji’s 2025 financial surge, its three‑pillar strategy anchored by digitalization, the technical mechanics and real‑time benefits of the “five‑code one” system, and the critical concurrency and data‑consistency challenges that could undermine the brand’s ambition to repeatedly break the hundred‑billion‑yuan revenue threshold.

Data ConsistencyDigital TransformationFast‑moving Consumer Goods
0 likes · 14 min read
Can Wanglaoji’s “Five‑Code One” Digital Strategy Sustain Its Billion‑Yuan Growth?
Deepin Linux
Deepin Linux
Dec 30, 2025 · Operations

Detecting and Fixing Linux Interrupt Stack Overflows

This article explains why interrupt stack overflows are dangerous in Linux, outlines their root causes, shows how to locate them using logs and debugging tools, and provides practical strategies to prevent and resolve the issue for stable kernel operation.

Linuxinterrupt stacksystem stability
0 likes · 41 min read
Detecting and Fixing Linux Interrupt Stack Overflows
Su San Talks Tech
Su San Talks Tech
Oct 10, 2025 · Operations

How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

This comprehensive guide explains how to improve system stability and reduce online incidents by building observability, implementing distributed tracing, applying rate‑limiting and circuit‑breaker patterns, adopting blue‑green and gray deployments, managing data consistency with distributed transactions, planning capacity, optimizing performance, and preparing emergency response plans.

Deployment StrategiesDistributed TracingDistributed Transactions
0 likes · 19 min read
How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies
Architect
Architect
Sep 10, 2025 · Operations

Building System Stability: A Backend Engineer’s Guide to Risk Management

This article explores system stability from a backend perspective, defining its academic and engineering meanings, quantifying metrics like SLA, MTBF and MTTR, analyzing why stability matters, outlining the challenges faced, and presenting practical steps—including resource consensus, goal setting, awareness cultivation, production standards, monitoring, emergency response, and regular inspections—to effectively build and maintain stable systems.

Operationsmonitoringrisk management
0 likes · 25 min read
Building System Stability: A Backend Engineer’s Guide to Risk Management
dbaplus Community
dbaplus Community
Sep 3, 2025 · Operations

How to Build System Stability: Definitions, Challenges, and Practical Steps

This article explains what system stability means, why it matters, the difficulties of building it, and provides a detailed, step‑by‑step framework—including risk formulas, resource planning, monitoring, and emergency response—to help backend teams improve reliability and reduce business impact.

incident responsemonitoringrisk management
0 likes · 23 min read
How to Build System Stability: Definitions, Challenges, and Practical Steps
Java Architect Essentials
Java Architect Essentials
Jul 7, 2025 · Backend Development

Mastering Backpressure in Reactive Java: Prevent OOM and Crashes

During high‑traffic spikes, traditional servlet thread‑pool systems often collapse, but simply switching to Reactor isn’t enough; without proper backpressure control you’ll still face OOM and outages—this article explains why backpressure matters and offers practical dynamic rate‑limiting, bounded buffering, and circuit‑breaker solutions.

Spring Reactorbackpressurereactive-programming
0 likes · 12 min read
Mastering Backpressure in Reactive Java: Prevent OOM and Crashes
ITPUB
ITPUB
Apr 10, 2025 · Operations

Why the KB5002700 Update Breaks Office 2016 and How to Fix It

Microsoft’s KB5002700 security update for Windows 10/11 causes Office 2016 apps to crash, lose data, and show calendar errors, prompting users to disable plugins, adjust Outlook settings, or completely uninstall the patch to restore stability.

KB5002700Office 2016Windows Update
0 likes · 5 min read
Why the KB5002700 Update Breaks Office 2016 and How to Fix It
DeWu Technology
DeWu Technology
Mar 17, 2025 · Operations

Stability and Its Significance: Challenges and Practices for Building System Reliability

Building system stability requires quantifying risk through formulas, confronting challenges like low short‑term value and resource competition, and implementing a consensus‑driven framework that sets clear goals, cultivates awareness, enforces safety standards, ensures emergency response, conducts routine inspections, and applies sound architecture governance to continuously reduce inherent and change‑related risks.

process improvementrisk managementsoftware reliability
0 likes · 25 min read
Stability and Its Significance: Challenges and Practices for Building System Reliability
dbaplus Community
dbaplus Community
Mar 5, 2025 · Operations

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

This article shares a step‑by‑step account of taking over a content‑risk stability role in early 2024, defining system stability, diagnosing recurring issues, and implementing a three‑phase framework—pre‑emptive reduction, impact mitigation, and post‑incident improvement—to boost success rates, cut incident response time, and achieve a modular architecture.

JVM OptimizationSREincident response
0 likes · 20 min read
How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement
JD Retail Technology
JD Retail Technology
Jan 3, 2025 · Backend Development

Improving Software Architecture Efficiency: Stability, Performance, and Code Quality

Improving software architecture efficiency requires stable, orthogonal module design, performance‑focused refactoring that avoids tactical shortcuts, and disciplined layered code that balances business and domain responsibilities, while fostering modularization, decoupling, strict quality standards, and a collaborative culture of continuous improvement.

Software Architecturebackend designsystem stability
0 likes · 12 min read
Improving Software Architecture Efficiency: Stability, Performance, and Code Quality
Zhuanzhuan Tech
Zhuanzhuan Tech
Nov 29, 2024 · Operations

Why Use Prometheus and How It Guarantees Business System Stability

This article explains the motivations for adopting Prometheus, introduces its core components and metric types, and demonstrates how comprehensive monitoring of business‑critical data, failure events, QPS, latency, and underlying resources can improve system stability and accelerate fault response.

JavaPrometheussystem stability
0 likes · 13 min read
Why Use Prometheus and How It Guarantees Business System Stability
Efficient Ops
Efficient Ops
Nov 19, 2024 · Operations

Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services

This article explains how system stability depends on architecture and code details, defines SLA and the “nines” metric, outlines Google’s SRE hierarchy, and provides practical governance steps—including development and release processes, high‑availability design, capacity planning, monitoring, incident response, and team culture—to achieve reliable, high‑availability services.

SREcapacity planninghigh availability
0 likes · 34 min read
Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services
Baidu Tech Salon
Baidu Tech Salon
Oct 15, 2024 · Industry Insights

How Baidu Revamped Visual Search: From PHP to Golang and Graph Engine

This article details Baidu's visual search architecture evolution, covering the shift from a PHP/HHVM stack to Golang with the GDP framework, the adoption of the ExGraph graph engine, comprehensive system redesign, and stability infrastructure built to support rapid product iteration and AI model integration.

Backend DevelopmentGolangarchitecture
0 likes · 14 min read
How Baidu Revamped Visual Search: From PHP to Golang and Graph Engine
JD Cloud Developers
JD Cloud Developers
Jul 12, 2024 · Operations

How Traffic Replay Safeguards Production Systems: Strategies and Best Practices

This article explores traffic recording and replay techniques, detailing their principles, benefits, risks, and practical guidelines—including filtering, deduplication, special‑scenario handling, real‑time vs offline diff, and mock strategies—to help teams ensure system stability and comprehensive test coverage.

Automationsystem stabilitytraffic replay
0 likes · 12 min read
How Traffic Replay Safeguards Production Systems: Strategies and Best Practices
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Jul 11, 2024 · Operations

Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study

In 2023 NetEase Cloud Music executed its largest ever data‑center migration, moving over 20,000 applications and more than one million queries per second to a new Guizhou facility while meeting zero‑downtime, strict latency and bandwidth limits through a batch‑wise, cross‑team strategy that incorporated automated upgrade platforms, standardized operations, and extensive risk‑mitigation measures.

Data Center MigrationDistributed SystemsTechnical Debt
0 likes · 27 min read
Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study
JD Tech
JD Tech
Jul 8, 2024 · Operations

System Stability Practices: From Development to Production

This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.

incident responsemonitoringsystem stability
0 likes · 26 min read
System Stability Practices: From Development to Production
Efficient Ops
Efficient Ops
Jul 7, 2024 · Operations

Boost Business Continuity and IT System Stability: Practical Strategies

This article explains business continuity concepts, outlines the risks to IT system stability, and provides actionable steps—such as expanding monitoring coverage, improving fault detection, enhancing architecture resilience, and strengthening emergency coordination—to ensure continuous operation despite inevitable failures.

business continuitydisaster recoveryfault management
0 likes · 7 min read
Boost Business Continuity and IT System Stability: Practical Strategies
Huolala Tech
Huolala Tech
Jun 13, 2024 · Operations

How Huolala Achieved Zero Failures During Business Peaks for 3 Years

Huolala’s engineering team built a systematic, multi‑layered business‑peak assurance process—covering goal definition, project management, technical risk mitigation, cloud‑provider coordination, capacity planning, and post‑mortem analysis—that has kept its platform fault‑free for over three years of high‑traffic events.

peak reliabilityrisk managementsystem stability
0 likes · 19 min read
How Huolala Achieved Zero Failures During Business Peaks for 3 Years
Beijing SF i-TECH City Technology Team
Beijing SF i-TECH City Technology Team
May 30, 2024 · Operations

Design and Practice of a Full-Link Load Testing Platform

This article describes the motivation, core design, technical choices, data and traffic isolation mechanisms, and implementation steps of a self‑developed full‑link load testing platform that enables production‑environment testing, reduces machine costs, and improves system stability and performance monitoring.

Data IsolationDistributed TestingLoad Testing
0 likes · 11 min read
Design and Practice of a Full-Link Load Testing Platform
Qunar Tech Salon
Qunar Tech Salon
Dec 20, 2023 · R&D Management

Digital Quality Measurement System at Qunar: Building, Implementing, and Operating a Comprehensive R&D Metrics Framework

This article details Qunar's end‑to‑end digital quality measurement system, describing how over 100 indicators were defined, filtered, and organized into a hierarchical model, how the platform ingests and visualizes data, and how continuous governance and PDCA cycles improve system stability and reduce complexity.

R&D metricscomplexity managementdigital measurement
0 likes · 21 min read
Digital Quality Measurement System at Qunar: Building, Implementing, and Operating a Comprehensive R&D Metrics Framework
JD Retail Technology
JD Retail Technology
Nov 8, 2023 · Operations

Technical Strategies for Ensuring System Stability During Large‑Scale Promotional Events

The article analyzes the importance of system stability during major sales promotions, presents data‑driven insights on traffic and revenue, identifies key challenges such as massive traffic, data volume, and complex workflows, and offers comprehensive operational, application, storage, and monitoring measures to guarantee reliable performance under extreme load.

Deploymentdatabaselarge‑scale promotion
0 likes · 13 min read
Technical Strategies for Ensuring System Stability During Large‑Scale Promotional Events
JD Tech
JD Tech
Oct 13, 2023 · Operations

Implementing a Real-Time Pre-Alert Monitoring System to Improve Fund Trading System Stability

This article presents a practical pre‑alert monitoring solution for a high‑volume fund trading system, detailing how simple time‑based key‑point checks and targeted alerts reduce instant and end‑of‑day alarms, improve issue detection within 15 minutes, and enhance overall system stability and reconciliation efficiency.

fund‑tradingmonitoringpre‑alert
0 likes · 11 min read
Implementing a Real-Time Pre-Alert Monitoring System to Improve Fund Trading System Stability
Programmer DD
Programmer DD
Sep 13, 2023 · Backend Development

When Microservice Refactoring Turns Into a Distributed Monolith: Risks and Remedies

The article examines why many microservice migrations end up as distributed monoliths, highlighting the hidden costs, stability issues, and common pitfalls such as excessive synchronous calls and lack of protection mechanisms, and offers practical guidance to avoid these traps.

Backend Architecturedistributed monolithservice decomposition
0 likes · 7 min read
When Microservice Refactoring Turns Into a Distributed Monolith: Risks and Remedies
JD Retail Technology
JD Retail Technology
Jul 11, 2023 · Operations

Technical Strategies for Ensuring System Stability During the 618 Promotion

The article analyzes the importance of the 618 sales event, identifies factors that threaten system stability such as traffic spikes, massive data, complex scenarios, long delivery chains and low tolerance, and proposes comprehensive application, storage, and operational measures—including unitization, monitoring, logging, fast‑fail, rate‑limiting, degradation, database and cache designs, and emergency processes—to guarantee reliable service during the promotion.

Scalabilityhigh availabilitylarge‑scale promotion
0 likes · 14 min read
Technical Strategies for Ensuring System Stability During the 618 Promotion
JD Cloud Developers
JD Cloud Developers
Jun 14, 2023 · Operations

How to Ensure System Stability During Mega Sales Events like 618

This article examines the technical and operational challenges of the 618 shopping festival, presenting data‑driven insights and detailed strategies—including modular deployment, monitoring, logging, fast‑failure, rate limiting, database and cache optimizations, and emergency response plans—to help teams maintain system stability under massive traffic spikes.

OperationsScalabilitylarge‑scale promotion
0 likes · 13 min read
How to Ensure System Stability During Mega Sales Events like 618
Architecture Digest
Architecture Digest
May 11, 2023 · Backend Development

Design and Evolution of Vivo's Points Task System

This article details the conception, architectural evolution, and technical implementation of Vivo's points task system, covering its business model, Fogg behavior model, multi‑stage development, behavior SDK, data collection, rule engine, system stability measures, and future enhancements.

Points Systembehavior SDKdata pipeline
0 likes · 14 min read
Design and Evolution of Vivo's Points Task System
Baidu Geek Talk
Baidu Geek Talk
Apr 17, 2023 · Operations

Baidu DuoLiXiong Platform Stability Construction: Practices and Insights

Baidu's DuoLiXiong platform, a SaaS suite for local services, achieves stability through comprehensive technical and business specifications, microservice best practices, rigorous code reviews, automated monitoring, eventual consistency, idempotency, and future automated scaling and intelligent fault tolerance for critical operations.

Code reviewDevOpsIdempotency
0 likes · 11 min read
Baidu DuoLiXiong Platform Stability Construction: Practices and Insights
dbaplus Community
dbaplus Community
Mar 20, 2023 · Operations

How Xianyu’s Messaging Team Built a Zero‑Incident System with Gray Releases, Monitoring, and Automated Regression

The article details how Xianyu’s messaging team systematically improved system stability by classifying risks, implementing gray‑release traffic, establishing dedicated monitoring and alerting dashboards, integrating automated regression into CI/CD, and managing strong‑weak dependencies, ultimately reducing online incidents to near zero.

Operationsautomated regressiondependency management
0 likes · 10 min read
How Xianyu’s Messaging Team Built a Zero‑Incident System with Gray Releases, Monitoring, and Automated Regression
Programmer DD
Programmer DD
Mar 20, 2023 · Backend Development

Is Your ‘Distributed Monolith’ Undermining Microservice Benefits?

The article examines the pitfalls of turning a monolithic application into a ‘distributed monolith’ during microservice migration, highlighting how improper domain splitting, excessive synchronous remote calls, and lack of protective mechanisms can degrade stability and negate expected productivity gains.

Backend ArchitectureMicroservicesdistributed monolith
0 likes · 7 min read
Is Your ‘Distributed Monolith’ Undermining Microservice Benefits?
FunTester
FunTester
Mar 13, 2023 · Operations

How Chaos Engineering Can Strengthen System Reliability: A Practical Guide

This article explains the origins and principles of chaos engineering, illustrates how fault‑injection scenarios expose system weaknesses, outlines step‑by‑step implementation—from tool selection and metric definition to execution and post‑mortem—and highlights its role in achieving high‑availability service level agreements.

DevOpsDistributed SystemsFault Injection
0 likes · 10 min read
How Chaos Engineering Can Strengthen System Reliability: A Practical Guide
Ctrip Technology
Ctrip Technology
Dec 15, 2022 · Operations

Practical Experience in Microservice Governance at Ctrip: Challenges, Strategies, and Results

This article shares Ctrip's practical experience in microservice governance, detailing the background, common pitfalls such as excessive service granularity and cyclic dependencies, and presenting concrete goals, principles, and strategies that led to significant improvements in stability, performance, and development efficiency.

MicroservicesPerformance Optimizationservice governance
0 likes · 14 min read
Practical Experience in Microservice Governance at Ctrip: Challenges, Strategies, and Results
Tencent Cloud Developer
Tencent Cloud Developer
Nov 24, 2022 · Backend Development

Kafka Stability Best Practices: Prevention, Monitoring, and Fault Resolution

This guide outlines Kafka stability best practices across three phases—pre‑prevention with tuning, producer/consumer guidelines, and cluster configuration; runtime monitoring using white‑box and black‑box metrics and alerts; and fault resolution strategies for backlogs, consumption blocks, and message loss, plus cost control and idempotence techniques.

Backend DevelopmentDistributed MessagingKafka
0 likes · 29 min read
Kafka Stability Best Practices: Prevention, Monitoring, and Fault Resolution
Xiaohe Frontend Team
Xiaohe Frontend Team
Nov 14, 2022 · Operations

How to Classify and Prioritize Online Incidents for Better System Stability

Effective incident management begins with clear classification; this guide explains how technical leaders can categorize online failures by nature, severity, and source—distinguishing usability versus financial loss incidents, ranking P0‑P3 levels, and identifying external, operational, product, and system‑quality fault types—to improve stability and learning.

Operationsfault classificationsystem stability
0 likes · 4 min read
How to Classify and Prioritize Online Incidents for Better System Stability
ByteDance Cloud Native
ByteDance Cloud Native
Aug 4, 2022 · Operations

Chaos Engineering Boosts Cloud‑Native Stability: Key Findings from China’s 2022 Survey

As cloud computing becomes essential infrastructure, cloud‑native systems gain flexibility but face stability challenges, prompting China’s Academy of Information and Communications Technology to launch a 2022 chaos engineering survey that uncovers vulnerabilities and promotes practical adoption of reliability techniques across the industry.

ChinaCloud Nativechaos engineering
0 likes · 3 min read
Chaos Engineering Boosts Cloud‑Native Stability: Key Findings from China’s 2022 Survey
Programmer DD
Programmer DD
Aug 3, 2022 · Backend Development

Why Your “Distributed Monolith” Is Still a Single Point of Failure

The article explains how many microservice migrations end up as “distributed monoliths,” detailing why splitting a monolith without proper domain design and protective mechanisms can worsen stability, increase latency, and fail to deliver the promised efficiency gains.

MicroservicesSoftware Architecturedistributed monolith
0 likes · 7 min read
Why Your “Distributed Monolith” Is Still a Single Point of Failure
Bilibili Tech
Bilibili Tech
Jul 26, 2022 · Operations

Full-Link Pressure Testing Automation Practice for Bilibili's Live Streaming Gifting Business

Bilibili automated full‑link pressure testing for its high‑traffic live‑stream gifting service by adopting traffic co‑location with storage isolation, creating shadow tables, keys and topics, and building a three‑phase, three‑layer framework that analyses links, confirms configurations, and verifies end‑to‑end behavior across all services.

Automated TestingBilibiliPerformance Testing
0 likes · 14 min read
Full-Link Pressure Testing Automation Practice for Bilibili's Live Streaming Gifting Business
AntTech
AntTech
Apr 29, 2022 · Operations

Alipay Double‑11 System Stability Practices: Distributed Architecture, Elastic Scaling, Service Mesh, Full‑Chain Load Testing, Intelligent Monitoring, and OceanBase

The presentation details Alipay's evolution through three stability phases—capacity, elastic cloud‑native architecture, and green computing—covering unit‑based deployment, elastic scaling, ServiceMesh, full‑chain load testing, intelligent monitoring, and the OceanBase distributed database, illustrating how these techniques achieved 99.99% availability during the 2021 Double‑11 peak.

Cloud NativeLoad TestingOceanBase
0 likes · 11 min read
Alipay Double‑11 System Stability Practices: Distributed Architecture, Elastic Scaling, Service Mesh, Full‑Chain Load Testing, Intelligent Monitoring, and OceanBase
JD Retail Technology
JD Retail Technology
Apr 27, 2022 · Industry Insights

How JD Achieves Seamless Stability During Massive Sales Events

The article reviews the Global Information System Stability Summit and JD's technical architect Li Junliang's detailed case study on the engineering practices, observability, chaos engineering, and resource‑scheduling innovations that enable JD’s e‑commerce platform to handle sales‑peak traffic that spikes hundreds of times over normal load.

Observabilitychaos engineeringe‑commerce
0 likes · 7 min read
How JD Achieves Seamless Stability During Massive Sales Events
Bilibili Tech
Bilibili Tech
Mar 4, 2022 · Operations

Stability Engineering Practices for Large-Scale Live Streaming: Bilibili's S11 World Championship Case Study

To deliver a flawless live broadcast of the 2021 League of Legends S11 World Championship to over 100 million viewers, Bilibili mobilized hundreds of engineers for four months, establishing strict standards, modeling dozens of user scenarios, estimating traffic, conducting layered stress and chaos tests, implementing automated and manual degradation, detailed SOPs, rate‑limiting safeguards, and on‑site monitoring, which together ensured system stability throughout the event.

degradationrate limitingstress testing
0 likes · 14 min read
Stability Engineering Practices for Large-Scale Live Streaming: Bilibili's S11 World Championship Case Study
DevOps
DevOps
Dec 27, 2021 · Operations

2021 China Chaos Engineering Survey Report: Findings and Recommendations

Based on 1,016 valid questionnaire responses and 17 enterprise interviews, the 2021 China Chaos Engineering Survey Report reveals low software system stability, limited adoption of chaos engineering, its positive impact on availability, and provides data‑driven recommendations for improving stability through mature tools, metrics, and cultural shifts.

Cloud NativeOperationschaos engineering
0 likes · 15 min read
2021 China Chaos Engineering Survey Report: Findings and Recommendations
Dada Group Technology
Dada Group Technology
Oct 8, 2021 · Industry Insights

How Dada Built an Automated Business Config Center to Boost Stability and Efficiency

This article details Dada's journey from identifying costly business‑configuration pain points to designing and deploying an automated configuration center that isolates business settings, improves system stability, enhances robustness, accelerates development, secures data, and delivers measurable performance gains.

AutomationBackend ArchitectureConfiguration Management
0 likes · 19 min read
How Dada Built an Automated Business Config Center to Boost Stability and Efficiency
HelloTech
HelloTech
Sep 27, 2021 · Operations

Fault Drills and Chaos Engineering Practices for Enhancing System Stability

The initiative introduces fault‑drill and chaos‑engineering practices—defining steady‑state metrics, injecting real‑world failures in controlled experiments, automating continuous production tests, and limiting blast radius—to detect weaknesses early, accelerate fault location and recovery, boost emergency response metrics, and foster a resilient engineering culture.

AutomationReliabilitychaos engineering
0 likes · 11 min read
Fault Drills and Chaos Engineering Practices for Enhancing System Stability
vivo Internet Technology
vivo Internet Technology
Jun 23, 2021 · Backend Development

Overview of Vivo Mall Promotion System Architecture and Technical Challenges

The article outlines Vivo Mall’s new independent promotion system architecture—introducing a unified discount model, flexible pricing engine, and scalable, high‑concurrency design—while detailing technical solutions such as Redis caching, batching, hot‑cold separation, rate‑limiting, idempotency, circuit‑breaker safeguards, and lessons learned from Redis SCAN and hot‑key issues.

Backend ArchitectureScalabilitypromotion system
0 likes · 12 min read
Overview of Vivo Mall Promotion System Architecture and Technical Challenges
Programmer DD
Programmer DD
Mar 12, 2021 · Backend Development

Why Your “Distributed Monolith” Undermines Microservice Success

Many companies attempt microservice transformations only to end up with a “distributed monolith” that looks like microservices but retains monolithic flaws, leading to increased complexity, reduced stability, and minimal performance gains, as this article explains the root causes and how to avoid them.

Software Architecturedistributed monolithsystem stability
0 likes · 8 min read
Why Your “Distributed Monolith” Undermines Microservice Success
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 8, 2021 · Operations

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

This article presents a comprehensive, step‑by‑step framework for guaranteeing system reliability during high‑traffic promotional periods, covering SRE hierarchy, stability criteria, profiling, monitoring, capacity planning, incident response, and post‑event analysis to help teams build resilient services.

SREcapacity planningincident response
0 likes · 21 min read
How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 27, 2021 · Operations

How to Build Sustainable System Stability: Architecture, Ops, and Team Practices

This article shares practical insights from a technical leader on designing robust system architecture, implementing comprehensive capacity planning, establishing reliable operations processes, strengthening security, and cultivating team awareness to achieve long‑term stability for large‑scale internet services.

OperationsSoftware Engineeringarchitecture design
0 likes · 24 min read
How to Build Sustainable System Stability: Architecture, Ops, and Team Practices
Youzan Coder
Youzan Coder
Dec 9, 2020 · Operations

A DevOps Engineer's Journey: From Middleware to Business Operations at YouZan

The article chronicles a YouZan DevOps engineer’s five‑year evolution from Alibaba‑based middleware duties to business‑operation leadership, highlighting the relentless pursuit of system stability through the 1‑minute detection, 5‑minute localization, 10‑minute resolution mantra, complex multi‑datacenter integrations, continuous learning, and a mindset of proactive problem‑solving.

Career DevelopmentDevOpsOperations Engineering
0 likes · 7 min read
A DevOps Engineer's Journey: From Middleware to Business Operations at YouZan
Didi Tech
Didi Tech
Nov 25, 2020 · Backend Development

Design and Architecture of DiDi's Pricing System

DiDi’s unified, distributed pricing platform delivers accurate pre‑trip estimates, real‑time billing, and detailed invoices across multiple ride‑hailing and bike‑sharing services by leveraging a stateless core engine, flexible Apollo‑based configuration, modular micro‑services, high‑availability data stores, and open‑pricing/price‑rights mechanisms to ensure stability, accuracy, and rapid feature rollout.

Backend EngineeringPrice OptimizationPricing System
0 likes · 18 min read
Design and Architecture of DiDi's Pricing System
Qunar Tech Salon
Qunar Tech Salon
Aug 27, 2020 · Databases

Qunar Technology Carnival Interview Series: Insights on Hotel Flow Optimization, Database Architecture, and System Stability

The article presents a series of interviews from Qunar's Technology Carnival, featuring experts Liang Zhangping, Wang Zhufeng, and Zheng Jimin who discuss hotel booking flow improvements, database architecture comparisons and migration to PXC, and comprehensive system stability governance practices.

InfrastructureQunarTechnology Carnival
0 likes · 13 min read
Qunar Technology Carnival Interview Series: Insights on Hotel Flow Optimization, Database Architecture, and System Stability
dbaplus Community
dbaplus Community
Aug 17, 2020 · Operations

Master Server Troubleshooting: Diagnose, Optimize, and Keep Your Backend Stable

This article shares practical experience on backend troubleshooting, outlining common failure types, a step‑by‑step diagnosis workflow, essential tools, and systematic optimization techniques for performance, stability and maintainability, helping engineers quickly stop losses, pinpoint root causes, and implement robust fixes.

BackendOperationsmaintainability
0 likes · 21 min read
Master Server Troubleshooting: Diagnose, Optimize, and Keep Your Backend Stable
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 11, 2020 · Backend Development

How Alibaba Handles Million‑Ticket Flash Sales with Scalable Backend Architecture

This article explains how Alibaba's entertainment cloud platform designs layered backend architecture, hotspot data isolation, flow‑shaping funnels, multi‑level caching, and comprehensive stability measures to support ultra‑high‑concurrency ticket sales while preventing oversell and ensuring system reliability.

Backend Architectureanti‑oversellhigh concurrency
0 likes · 11 min read
How Alibaba Handles Million‑Ticket Flash Sales with Scalable Backend Architecture
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 17, 2020 · Operations

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

This article details Hema's approach to guaranteeing system stability for its offline and delivery operations, covering the complete smart‑dispatch architecture, exhaustive dependency analysis, database and middleware safeguards, monitoring strategies, gray‑release practices, testing methods, and emergency response procedures that together enabled a year of zero failures.

Backend ArchitectureDatabase OptimizationMicroservices
0 likes · 24 min read
How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability
Architecture Digest
Architecture Digest
Aug 26, 2019 · Operations

Ensuring System Stability for High‑Scale Services: Full‑Link Load Testing at Gaode

The article describes how Gaode handles the challenges of supporting over 100 million daily active users by applying capacity planning, traffic control, disaster recovery, monitoring, rehearsal, and a self‑built full‑link load‑testing platform that simulates realistic traffic, manages resources, and provides detailed reporting to guarantee system stability.

GaodeLoad Testingfull‑link testing
0 likes · 20 min read
Ensuring System Stability for High‑Scale Services: Full‑Link Load Testing at Gaode
Amap Tech
Amap Tech
Aug 20, 2019 · Operations

Full‑Link Load Testing and Stability Assurance at Gaode: Architecture, Practices, and Future Directions

To guarantee stability for over 100 million daily users, Gaode combines capacity planning, traffic control, disaster recovery, monitoring, and pre‑plan drills with a self‑built full‑link load‑testing platform (TestPG) that replays realistic traffic in production‑like environments, isolates test loads, provides rapid configuration, detailed debugging, automated error capture, and comprehensive reporting, while planning future enhancements such as integrated topology monitoring, advanced pressure models, and confidence evaluation.

Distributed SystemsLoad Testingcapacity planning
0 likes · 20 min read
Full‑Link Load Testing and Stability Assurance at Gaode: Architecture, Practices, and Future Directions
DataFunTalk
DataFunTalk
May 30, 2019 · Artificial Intelligence

Data Annotation, Data‑Driven Development, and Decision‑Making in Autonomous Driving

The talk explains how massive, well‑annotated data fuels autonomous‑driving AI, covering data annotation metrics, team structure, efficiency‑boosting techniques, system stability, and how data‑driven development and decision‑making improve model training, evaluation, and product priorities.

artificial intelligenceautonomous drivingdata annotation
0 likes · 9 min read
Data Annotation, Data‑Driven Development, and Decision‑Making in Autonomous Driving
JD Tech
JD Tech
Dec 17, 2018 · Operations

Improving JD Intelligent Supply Chain Efficiency and System Stability for Major Sales Events

The article details JD's intelligent supply chain enhancements—including machine‑learning demand forecasting, a new "explosive product warehouse" model, non‑stock fulfillment visualization, blockchain‑based product traceability, and comprehensive system‑stability measures such as data‑consistency checkpoints, throughput buffering, and 24/7 incident response—to boost efficiency and reliability during large‑scale promotions.

Big DataBlockchainOperations
0 likes · 7 min read
Improving JD Intelligent Supply Chain Efficiency and System Stability for Major Sales Events
Baidu Intelligent Testing
Baidu Intelligent Testing
May 4, 2018 · Operations

Common Architectural Design Risks and Mitigation Strategies for System Stability

This article analyses fifteen typical architectural design risks—such as duplicate interactions, high‑frequency calls, redundant requests, non‑reentrant interfaces, unreasonable timeouts, retry misconfigurations, IP direct‑connect, cross‑datacenter calls, weak/strong dependencies, third‑party reliance, cache penetration, cache avalanche, and coupling issues—explaining their definitions, impacts, detection methods, and concrete mitigation measures with real‑world Baidu cases to help engineers improve system stability.

Backendarchitecturerisk management
0 likes · 27 min read
Common Architectural Design Risks and Mitigation Strategies for System Stability
Meitu Technology
Meitu Technology
Jul 27, 2017 · Backend Development

Architecture Evolution of Meipai Live Streaming Barrage System Supporting Millions of Concurrent Users

The article traces Meipai’s live‑streaming barrage system from its rapid 2016 launch through successive architectural refinements that enabled it to sustain millions of concurrent users, handle extreme read‑write loads during celebrity streams, and achieve stable, high‑performance service at massive scale.

Backend DevelopmentScalabilityarchitecture evolution
0 likes · 2 min read
Architecture Evolution of Meipai Live Streaming Barrage System Supporting Millions of Concurrent Users
Java High-Performance Architecture
Java High-Performance Architecture
Mar 4, 2016 · Backend Development

Why Service Gateways Are Essential for Scalable Microservice Architectures

The article explains how breaking a monolithic website into independent microservices improves stability, resource utilization, and deployment speed, but introduces client‑side complexity that can be solved by introducing a service gateway to aggregate APIs, enhance security, and simplify maintenance.

API AggregationBackend ArchitectureMicroservices
0 likes · 4 min read
Why Service Gateways Are Essential for Scalable Microservice Architectures