Tag

System Stability

0 views collected around this technical thread.

DeWu Technology
DeWu Technology
Mar 17, 2025 · Operations

Stability and Its Significance: Challenges and Practices for Building System Reliability

Building system stability requires quantifying risk through formulas, confronting challenges like low short‑term value and resource competition, and implementing a consensus‑driven framework that sets clear goals, cultivates awareness, enforces safety standards, ensures emergency response, conducts routine inspections, and applies sound architecture governance to continuously reduce inherent and change‑related risks.

System Stabilityoperationsprocess improvement
0 likes · 25 min read
Stability and Its Significance: Challenges and Practices for Building System Reliability
JD Retail Technology
JD Retail Technology
Jan 3, 2025 · Backend Development

Improving Software Architecture Efficiency: Stability, Performance, and Code Quality

Improving software architecture efficiency requires stable, orthogonal module design, performance‑focused refactoring that avoids tactical shortcuts, and disciplined layered code that balances business and domain responsibilities, while fostering modularization, decoupling, strict quality standards, and a collaborative culture of continuous improvement.

System Stabilitybackend designperformance optimization
0 likes · 12 min read
Improving Software Architecture Efficiency: Stability, Performance, and Code Quality
Zhuanzhuan Tech
Zhuanzhuan Tech
Nov 29, 2024 · Operations

Why Use Prometheus and How It Guarantees Business System Stability

This article explains the motivations for adopting Prometheus, introduces its core components and metric types, and demonstrates how comprehensive monitoring of business‑critical data, failure events, QPS, latency, and underlying resources can improve system stability and accelerate fault response.

JavaPrometheusSystem Stability
0 likes · 13 min read
Why Use Prometheus and How It Guarantees Business System Stability
Efficient Ops
Efficient Ops
Nov 19, 2024 · Operations

Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services

This article explains how system stability depends on architecture and code details, defines SLA and the “nines” metric, outlines Google’s SRE hierarchy, and provides practical governance steps—including development and release processes, high‑availability design, capacity planning, monitoring, incident response, and team culture—to achieve reliable, high‑availability services.

Capacity PlanningSRESystem Stability
0 likes · 34 min read
Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Jul 11, 2024 · Operations

Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study

In 2023 NetEase Cloud Music executed its largest ever data‑center migration, moving over 20,000 applications and more than one million queries per second to a new Guizhou facility while meeting zero‑downtime, strict latency and bandwidth limits through a batch‑wise, cross‑team strategy that incorporated automated upgrade platforms, standardized operations, and extensive risk‑mitigation measures.

DevOpsDistributed SystemsSystem Stability
0 likes · 27 min read
Cloud Music Guizhou Data Center Migration: A Large-Scale Infrastructure Migration Case Study
JD Tech
JD Tech
Jul 8, 2024 · Operations

System Stability Practices: From Development to Production

This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.

DeploymentSystem Stabilitybackend development
0 likes · 26 min read
System Stability Practices: From Development to Production
Efficient Ops
Efficient Ops
Jul 7, 2024 · Operations

Boost Business Continuity and IT System Stability: Practical Strategies

This article explains business continuity concepts, outlines the risks to IT system stability, and provides actionable steps—such as expanding monitoring coverage, improving fault detection, enhancing architecture resilience, and strengthening emergency coordination—to ensure continuous operation despite inevitable failures.

Business ContinuityIT OperationsSystem Stability
0 likes · 7 min read
Boost Business Continuity and IT System Stability: Practical Strategies
Beijing SF i-TECH City Technology Team
Beijing SF i-TECH City Technology Team
May 30, 2024 · Operations

Design and Practice of a Full-Link Load Testing Platform

This article describes the motivation, core design, technical choices, data and traffic isolation mechanisms, and implementation steps of a self‑developed full‑link load testing platform that enables production‑environment testing, reduces machine costs, and improves system stability and performance monitoring.

Data IsolationDistributed TestingPerformance Testing
0 likes · 11 min read
Design and Practice of a Full-Link Load Testing Platform
Qunar Tech Salon
Qunar Tech Salon
Dec 20, 2023 · R&D Management

Digital Quality Measurement System at Qunar: Building, Implementing, and Operating a Comprehensive R&D Metrics Framework

This article details Qunar's end‑to‑end digital quality measurement system, describing how over 100 indicators were defined, filtered, and organized into a hierarchical model, how the platform ingests and visualizes data, and how continuous governance and PDCA cycles improve system stability and reduce complexity.

R&D metricsSystem Stabilitycomplexity management
0 likes · 21 min read
Digital Quality Measurement System at Qunar: Building, Implementing, and Operating a Comprehensive R&D Metrics Framework
JD Retail Technology
JD Retail Technology
Nov 8, 2023 · Operations

Technical Strategies for Ensuring System Stability During Large‑Scale Promotional Events

The article analyzes the importance of system stability during major sales promotions, presents data‑driven insights on traffic and revenue, identifies key challenges such as massive traffic, data volume, and complex workflows, and offers comprehensive operational, application, storage, and monitoring measures to guarantee reliable performance under extreme load.

CachingDatabaseDeployment
0 likes · 13 min read
Technical Strategies for Ensuring System Stability During Large‑Scale Promotional Events
JD Tech
JD Tech
Oct 13, 2023 · Operations

Implementing a Real-Time Pre-Alert Monitoring System to Improve Fund Trading System Stability

This article presents a practical pre‑alert monitoring solution for a high‑volume fund trading system, detailing how simple time‑based key‑point checks and targeted alerts reduce instant and end‑of‑day alarms, improve issue detection within 15 minutes, and enhance overall system stability and reconciliation efficiency.

System Stabilityfund‑tradingmonitoring
0 likes · 11 min read
Implementing a Real-Time Pre-Alert Monitoring System to Improve Fund Trading System Stability
macrozheng
macrozheng
Aug 22, 2023 · Backend Development

Why “Distributed Monoliths” Fail: Hidden Pitfalls of Microservice Refactoring

The article examines the concept of a "distributed monolith," explains why superficial microservice migrations often degrade stability and productivity, and outlines the root causes and best‑practice strategies to avoid these common architectural mistakes.

MicroservicesSystem Stabilityarchitecture
0 likes · 8 min read
Why “Distributed Monoliths” Fail: Hidden Pitfalls of Microservice Refactoring
JD Retail Technology
JD Retail Technology
Jul 11, 2023 · Operations

Technical Strategies for Ensuring System Stability During the 618 Promotion

The article analyzes the importance of the 618 sales event, identifies factors that threaten system stability such as traffic spikes, massive data, complex scenarios, long delivery chains and low tolerance, and proposes comprehensive application, storage, and operational measures—including unitization, monitoring, logging, fast‑fail, rate‑limiting, degradation, database and cache designs, and emergency processes—to guarantee reliable service during the promotion.

System Stabilityhigh availabilitylarge-scale promotion
0 likes · 14 min read
Technical Strategies for Ensuring System Stability During the 618 Promotion
Architecture Digest
Architecture Digest
May 11, 2023 · Backend Development

Design and Evolution of Vivo's Points Task System

This article details the conception, architectural evolution, and technical implementation of Vivo's points task system, covering its business model, Fogg behavior model, multi‑stage development, behavior SDK, data collection, rule engine, system stability measures, and future enhancements.

Data PipelineSystem Stabilitybackend architecture
0 likes · 14 min read
Design and Evolution of Vivo's Points Task System
Baidu Geek Talk
Baidu Geek Talk
Apr 17, 2023 · Operations

Baidu DuoLiXiong Platform Stability Construction: Practices and Insights

Baidu's DuoLiXiong platform, a SaaS suite for local services, achieves stability through comprehensive technical and business specifications, microservice best practices, rigorous code reviews, automated monitoring, eventual consistency, idempotency, and future automated scaling and intelligent fault tolerance for critical operations.

Code ReviewDevOpsDistributed Systems
0 likes · 11 min read
Baidu DuoLiXiong Platform Stability Construction: Practices and Insights
Ctrip Technology
Ctrip Technology
Dec 15, 2022 · Operations

Practical Experience in Microservice Governance at Ctrip: Challenges, Strategies, and Results

This article shares Ctrip's practical experience in microservice governance, detailing the background, common pitfalls such as excessive service granularity and cyclic dependencies, and presenting concrete goals, principles, and strategies that led to significant improvements in stability, performance, and development efficiency.

MicroservicesSystem Stabilityperformance optimization
0 likes · 14 min read
Practical Experience in Microservice Governance at Ctrip: Challenges, Strategies, and Results
Tencent Cloud Developer
Tencent Cloud Developer
Nov 24, 2022 · Backend Development

Kafka Stability Best Practices: Prevention, Monitoring, and Fault Resolution

This guide outlines Kafka stability best practices across three phases—pre‑prevention with tuning, producer/consumer guidelines, and cluster configuration; runtime monitoring using white‑box and black‑box metrics and alerts; and fault resolution strategies for backlogs, consumption blocks, and message loss, plus cost control and idempotence techniques.

Distributed MessagingKafkaMessage Queue
0 likes · 29 min read
Kafka Stability Best Practices: Prevention, Monitoring, and Fault Resolution
Ctrip Technology
Ctrip Technology
Sep 1, 2022 · Backend Development

Improving Supplier Integration Efficiency and System Stability in Ctrip's Direct Connection Platform

This article presents Ctrip's backend engineering practices for the Direct Connection Platform, detailing how a sandbox testing environment, automated acceptance, rate‑limiting, and circuit‑breaking mechanisms were introduced to boost supplier onboarding speed and enhance overall system stability.

API IntegrationCtripSandbox
0 likes · 14 min read
Improving Supplier Integration Efficiency and System Stability in Ctrip's Direct Connection Platform
ByteDance Cloud Native
ByteDance Cloud Native
Aug 4, 2022 · Operations

Chaos Engineering Boosts Cloud‑Native Stability: Key Findings from China’s 2022 Survey

As cloud computing becomes essential infrastructure, cloud‑native systems gain flexibility but face stability challenges, prompting China’s Academy of Information and Communications Technology to launch a 2022 chaos engineering survey that uncovers vulnerabilities and promotes practical adoption of reliability techniques across the industry.

Chaos EngineeringChinaSurvey
0 likes · 3 min read
Chaos Engineering Boosts Cloud‑Native Stability: Key Findings from China’s 2022 Survey
Bilibili Tech
Bilibili Tech
Jul 26, 2022 · Operations

Full-Link Pressure Testing Automation Practice for Bilibili's Live Streaming Gifting Business

Bilibili automated full‑link pressure testing for its high‑traffic live‑stream gifting service by adopting traffic co‑location with storage isolation, creating shadow tables, keys and topics, and building a three‑phase, three‑layer framework that analyses links, confirms configurations, and verifies end‑to‑end behavior across all services.

BilibiliLive StreamingPerformance Testing
0 likes · 14 min read
Full-Link Pressure Testing Automation Practice for Bilibili's Live Streaming Gifting Business