Tagged articles
68 articles
Page 1 of 1
Cognitive Technology Team
Cognitive Technology Team
Apr 3, 2025 · Fundamentals

Understanding CAP Theory and BASE: Data Consistency in Distributed Systems

This article explains the CAP theorem and its practical extension BASE, describing their core concepts, trade‑off combinations, typical components such as Zookeeper, Eureka, and Nacos, and engineering techniques like asynchronous replication, Saga, and idempotent design for building highly available distributed systems.

AvailabilityBASECAP theorem
0 likes · 5 min read
Understanding CAP Theory and BASE: Data Consistency in Distributed Systems
Efficient Ops
Efficient Ops
Feb 6, 2025 · Operations

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

At the 2024 GOPS Global Operations Conference in Shanghai, Alipay’s monitoring lead Tang Liang presented the challenges, architecture, risk‑prevention practices, and implementation details of the company’s full‑ecosystem availability monitoring system, highlighting its role in DevOps, SRE, and AIOps initiatives.

AvailabilityCloud NativeDevOps
0 likes · 4 min read
Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices
Architecture and Beyond
Architecture and Beyond
Feb 6, 2025 · Operations

Analyzing DeepSeek’s Availability Issues and Applying Traditional Internet Reliability Strategies to AIGC

This article examines DeepSeek’s frequent service interruptions, contrasts the inherent reliability challenges of AIGC products with traditional internet applications, and proposes adopting proven isolation, rate‑limiting, and elastic‑scaling techniques to improve AI service availability and user experience.

AIGCAvailabilityDeepSeek
0 likes · 12 min read
Analyzing DeepSeek’s Availability Issues and Applying Traditional Internet Reliability Strategies to AIGC
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Fundamentals

Mastering the CAP Theorem: Balancing Consistency, Availability, and Partition Tolerance

An in‑depth guide explains the CAP theorem’s three pillars—Consistency, Availability, Partition Tolerance—illustrates why only two can be achieved simultaneously, and shows real‑world trade‑offs across e‑commerce, finance, and social platforms, while introducing the complementary BASE model for practical system design.

AvailabilityBASE modelCAP theorem
0 likes · 15 min read
Mastering the CAP Theorem: Balancing Consistency, Availability, and Partition Tolerance
Open Source Linux
Open Source Linux
Oct 11, 2024 · Operations

Essential IT Operations Metrics: Definitions, Formulas, and Benchmarks

This article explains why operations metrics are vital for businesses, describes how tracking availability, failure rate, MTTR, MTBF, response time, throughput, error rate, capacity utilization, latency, data integrity, backup success, recovery time, security patch time, server and network utilization can improve reliability, reduce costs, and boost competitiveness.

AvailabilityIT OperationsMTBF
0 likes · 7 min read
Essential IT Operations Metrics: Definitions, Formulas, and Benchmarks
JD Cloud Developers
JD Cloud Developers
Sep 4, 2024 · Backend Development

Mastering High‑Performance, High‑Concurrency Backend Systems: Methodologies & Practices

This article explores the evolution of software complexity and presents a comprehensive backend development methodology for building high‑performance, high‑concurrency, and highly available systems, covering performance optimization, read/write strategies, scaling techniques, fault isolation, and deployment practices with real‑world examples.

AvailabilityBackendSystem Design
0 likes · 25 min read
Mastering High‑Performance, High‑Concurrency Backend Systems: Methodologies & Practices
Liangxu Linux
Liangxu Linux
Aug 1, 2024 · Operations

Essential Operations Metrics Every IT Team Should Track

This guide outlines key operational metrics—availability, failure rate, MTTR, MTBF, response time, throughput, error rate, capacity utilization, latency, data integrity, and more—explaining their calculations, typical benchmark values, and practical application areas to help organizations monitor and improve IT performance.

AvailabilityMTTROperations
0 likes · 6 min read
Essential Operations Metrics Every IT Team Should Track
Tencent Cloud Developer
Tencent Cloud Developer
Jul 17, 2024 · Operations

Combining FMEA and Chaos Engineering to Improve Software Architecture Availability

By integrating the proactive, static risk assessment of Failure Mode and Effects Analysis with the dynamic fault‑injection validation of chaos engineering, the article demonstrates how cloud‑native architectures—illustrated through a Tencent‑based e‑commerce case—can systematically identify, quantify, and mitigate availability risks, leading to continuous, measurable resilience improvements.

AvailabilityFMEARisk analysis
0 likes · 16 min read
Combining FMEA and Chaos Engineering to Improve Software Architecture Availability
Efficient Ops
Efficient Ops
May 29, 2024 · Operations

Essential Operations Metrics Every IT Team Should Track

In today’s competitive business landscape, tracking key operations metrics—such as availability, failure rate, MTTR, MTBF, response time, throughput, error rate, and various utilization and data integrity measures—helps organizations monitor performance, reduce costs, ensure reliability, and maintain regulatory compliance.

AvailabilityIT performancemonitoring
0 likes · 7 min read
Essential Operations Metrics Every IT Team Should Track
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

AvailabilityFault InjectionSRE
0 likes · 7 min read
From Firefighting to Fire‑Starting: Mastering Operations for System Reliability
Efficient Ops
Efficient Ops
Feb 1, 2024 · Operations

How Tencent’s Public Gateway Overcomes Extreme Availability Challenges

The article details Tencent's Public Gateway (TGW) architecture, its forwarding and control planes, and presents two real‑world extreme failure cases— a NIC batch bug and a special IPv6 packet causing core dumps—along with the multi‑level disaster‑recovery design and mitigation strategies employed to ensure high availability.

AvailabilityTencent Clouddisaster recovery
0 likes · 8 min read
How Tencent’s Public Gateway Overcomes Extreme Availability Challenges
Senior Tony
Senior Tony
Nov 14, 2023 · Operations

Master Availability, Reliability, and Stability for High‑Availability Systems

Understanding the differences between system availability, reliability, and stability is essential for building resilient services; this guide explains each concept, illustrates their distinctions with examples, and outlines practical strategies such as rate limiting, anti‑scraping, timeout settings, system inspections, and fault post‑mortems to reduce failures and downtime.

AvailabilityReliabilityhigh availability
0 likes · 11 min read
Master Availability, Reliability, and Stability for High‑Availability Systems
Tencent Cloud Developer
Tencent Cloud Developer
Sep 13, 2023 · Cloud Native

Designing and Implementing a Payment Fund Account System

The article details how to design and implement a cloud‑native payment fund account system on Tencent Cloud, covering account definitions, fund flow and multiple account types, TDSQL storage, separated fund and account services, robust security, distributed transactions, auditing, reconciliation, and high‑availability measures for high‑concurrency merchant payments.

AvailabilityConsistencySecurity
0 likes · 35 min read
Designing and Implementing a Payment Fund Account System
JD Cloud Developers
JD Cloud Developers
Sep 13, 2023 · Operations

Stability Engineering Explained: From Entropy Theory to Practical SRE

The article explores why building system stability is crucial by linking entropy theory to software reliability, introduces the availability formula, discusses common pitfalls and industry practices, and proposes a three‑stage governance framework—prevention, mitigation, and post‑mortem—to systematically improve operational resilience.

AvailabilityOperationsReliability
0 likes · 13 min read
Stability Engineering Explained: From Entropy Theory to Practical SRE
MaGe Linux Operations
MaGe Linux Operations
Aug 26, 2023 · Backend Development

How Tencent’s PC & Mobile Payment Architecture Evolved to Support Billions

This article traces the evolution of Tencent's payment platform from its early PC‑centric design through three mobile payment phases, detailing architectural generations, availability measures, multi‑active strategies, and cloud‑native innovations that enable massive, reliable transaction processing.

AvailabilityBackend DevelopmentCloud Native
0 likes · 14 min read
How Tencent’s PC & Mobile Payment Architecture Evolved to Support Billions
MaGe Linux Operations
MaGe Linux Operations
Jun 30, 2023 · Operations

What Went Wrong When Vipshop Crashed? Lessons on High‑Concurrency Failures

The article examines the March 29 Vipshop data‑center outage that caused over a billion‑yuan loss, explains the cooling‑system failure that triggered a 12‑hour P0 incident, discusses its impact on Tencent services, and analyzes why high‑concurrency crashes remain common, offering availability tier insights and mitigation strategies.

AvailabilityOperationshigh concurrency
0 likes · 7 min read
What Went Wrong When Vipshop Crashed? Lessons on High‑Concurrency Failures
政采云技术
政采云技术
Apr 27, 2023 · Backend Development

Understanding CAP Theorem, BASE Theory, and Their Implementation with Zookeeper (CP) and Eureka (AP)

This article explains the CAP theorem and its trade‑offs, introduces the BASE model as a practical compromise, and demonstrates how Zookeeper implements a CP registration center while Eureka adopts an AP approach, illustrating the impact on consistency, availability, and partition tolerance in distributed systems.

AvailabilityBASE theoryCAP theorem
0 likes · 12 min read
Understanding CAP Theorem, BASE Theory, and Their Implementation with Zookeeper (CP) and Eureka (AP)
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 9, 2023 · Fundamentals

Ten Essential Software Architecture Quality Attributes

The article explains ten key non‑functional quality attributes of software architecture—such as scalability, availability, consistency, resilience, usability, observability, security, persistence, agility, and maintainability—describing their meanings, typical implementation techniques, and why selecting the right attributes is crucial for any system.

AvailabilityNon-functional RequirementsScalability
0 likes · 9 min read
Ten Essential Software Architecture Quality Attributes
dbaplus Community
dbaplus Community
Jan 16, 2023 · Operations

Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability

The article reviews traditional availability metrics such as Success‑Ratio, Error‑Budget, MTTR/MTTF, SLA/SLO, and highlights their limitations, then introduces Google’s User‑Uptime and Windowed User‑Uptime metrics, explains their definitions, challenges, experimental results, and why they provide a more user‑centric view of service reliability.

AvailabilitySREmetrics
0 likes · 27 min read
Beyond Success‑Ratio: How User‑Uptime Reveals Real Product Availability
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Nov 14, 2022 · Operations

Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator

The article reviews classic availability metrics such as Success‑Ratio, Incident‑Ratio, MTTR/MTTF, Error‑Budget, and SLA/SLO, then introduces User‑Uptime—a per‑user success time proportion that ignores long idle periods—and its windowed variant, showing how it complements existing indicators for more user‑centric reliability insight.

AvailabilityReliabilitySRE
0 likes · 27 min read
Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator
Architects Research Society
Architects Research Society
May 6, 2022 · Fundamentals

Understanding the CAP Theorem and How PACELC Extends It

The article explains the CAP theorem’s three properties—consistency, availability, and partition tolerance—its implications for distributed systems, highlights its limitations, introduces the PACELC extension that adds latency versus consistency trade‑offs when no partition exists, and provides real‑world database examples.

AvailabilityCAP theoremConsistency
0 likes · 7 min read
Understanding the CAP Theorem and How PACELC Extends It
Architect's Journey
Architect's Journey
Apr 13, 2022 · Fundamentals

Is Classifying Distributed Systems as CP or AP a False Dichotomy?

The article revisits the CAP theorem, explains linearizable consistency and strict availability, demonstrates with concrete data‑center examples why the CP/AP split is often misleading, and argues that latency concerns and broader failure modes make the binary classification of distributed systems impractical.

AvailabilityCAP theoremCP vs AP
0 likes · 10 min read
Is Classifying Distributed Systems as CP or AP a False Dichotomy?
DevOps Cloud Academy
DevOps Cloud Academy
Aug 24, 2021 · Cloud Computing

Key Considerations for Designing Cloud Applications: Scalability, Availability, Manageability, and Feasibility

The article outlines four essential cloud‑application design dimensions—scalability, availability, manageability, and feasibility—providing discussion points and questions for each to guide stakeholders toward robust, cost‑effective, and secure cloud solutions through comprehensive evaluation of capacity, platform constraints, load handling, SLA commitments, disaster recovery, performance tuning, and security considerations.

AvailabilityScalabilityfeasibility
0 likes · 12 min read
Key Considerations for Designing Cloud Applications: Scalability, Availability, Manageability, and Feasibility
21CTO
21CTO
Jul 11, 2021 · Operations

How Baidu Achieved 5‑9+ Availability: Inside Its Tracing and Observability Innovations

This article examines Baidu Search's massive micro‑service architecture and reveals the detailed observability, tracing, and metrics techniques—Kepler 1.0, Kepler 2.0, and Prometheus integration—that enable five‑nine‑plus availability, full‑query debugging, and efficient capacity management.

AvailabilityMicroservicestracing
0 likes · 19 min read
How Baidu Achieved 5‑9+ Availability: Inside Its Tracing and Observability Innovations
Yanxuan Tech Team
Yanxuan Tech Team
Dec 30, 2020 · Backend Development

How to Design a Scalable Procurement System Architecture for Rapid Business Changes

This article explores how the procurement system at Yanxuan adapts to shifting commercial environments and unexpected events by defining macro-level logic, designing blueprints, implementing the system, and continuously evolving it, emphasizing the importance of top‑down architecture, scalability, availability, accuracy, and the transition toward automation and intelligence.

AvailabilityScalabilitySystem Architecture
0 likes · 14 min read
How to Design a Scalable Procurement System Architecture for Rapid Business Changes
Wukong Talks Architecture
Wukong Talks Architecture
Dec 30, 2020 · Fundamentals

Understanding CAP, ACID, and BASE Theories Through the Metaphor of Tai Chi and Distributed Systems

This article uses the story of Tai Chi from the novel *The Heaven Sword and Dragon Saber* to explain the CAP theorem, ACID properties, BASE theory, and two‑phase commit in distributed systems, illustrating how consistency, availability, and partition tolerance correspond to the hard and soft aspects of Tai Chi.

ACIDAvailabilityBASE
0 likes · 14 min read
Understanding CAP, ACID, and BASE Theories Through the Metaphor of Tai Chi and Distributed Systems
Selected Java Interview Questions
Selected Java Interview Questions
Dec 28, 2020 · Backend Development

Eureka vs Zookeeper: AP vs CP Trade‑offs in Service Registry Design

The article compares Eureka and Zookeeper as service registry solutions, explaining how Eureka follows an AP model with high availability and eventual consistency, while Zookeeper adopts a CP model prioritizing strong consistency, and discusses their suitable scenarios, limitations, and design considerations for distributed systems.

AvailabilityCAP theoremConsistency
0 likes · 10 min read
Eureka vs Zookeeper: AP vs CP Trade‑offs in Service Registry Design
High Availability Architecture
High Availability Architecture
Oct 27, 2020 · Fundamentals

Quorum in Distributed Systems: Concepts, Variants, and Impact on Availability and Latency

Quorum, the core principle behind majority read/write and Paxos, can be defined in various ways—including weighted, hierarchical, and non‑majority quorums—to trade off system availability, latency, and fault tolerance, with examples illustrating how different quorum designs affect performance in distributed storage and coordination services.

AvailabilityConsensusDistributed Systems
0 likes · 18 min read
Quorum in Distributed Systems: Concepts, Variants, and Impact on Availability and Latency
iQIYI Technical Product Team
iQIYI Technical Product Team
Oct 16, 2020 · Cloud Native

Service Maturity Model and Optimization Practices for Microservices

The article presents iQIYI’s service‑maturity model for micro‑services, outlines how scores across development, deployment and operation stages reveal common deficiencies such as code style, testing, gray‑release and alert handling, and recommends concrete optimization practices—including unified coding standards, automated testing, robust rollback, circuit‑breaking, monitoring, and emergency procedures—to raise services to mature, high‑scoring levels.

Availabilitymonitoringservice maturity
0 likes · 15 min read
Service Maturity Model and Optimization Practices for Microservices
Efficient Ops
Efficient Ops
Oct 9, 2020 · Fundamentals

Understanding the CAP Theorem Through a Real‑World Memory Service Story

This article uses a relatable memory‑service scenario to illustrate the CAP theorem, explaining how consistency, availability, and partition tolerance cannot all be achieved simultaneously in distributed systems and exploring practical trade‑offs through successive design attempts.

AvailabilityCAP theoremConsistency
0 likes · 9 min read
Understanding the CAP Theorem Through a Real‑World Memory Service Story
Efficient Ops
Efficient Ops
Sep 8, 2020 · Operations

From Firefighting to Arson: Mastering Ops Availability in Three Stages

The article outlines a three‑stage ops maturity model—firefighting, fire prevention, and arson—explains how proactive fault‑injection drills, continuous availability improvements, and aligning technical metrics with business value can transform operations from reactive responders into strategic value creators.

AvailabilityFault InjectionOperations
0 likes · 8 min read
From Firefighting to Arson: Mastering Ops Availability in Three Stages
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 5, 2020 · Backend Development

How Alibaba Cloud Implements Reliable Distributed Locks for Shared Resources

Distributed locks ensure exclusive access to shared resources across multiple machines, and this article explains their evolution from single-machine locks, classifies system designs, and details Alibaba Cloud Storage’s practical implementation, covering strict mutual exclusion, availability, and lock-switching efficiency with real-world examples.

Alibaba CloudAvailabilitydistributed-lock
0 likes · 13 min read
How Alibaba Cloud Implements Reliable Distributed Locks for Shared Resources
Tencent Cloud Developer
Tencent Cloud Developer
Oct 25, 2019 · Backend Development

High-Concurrency Practices for Tencent Video Front-End Node.js Services

Tencent Video’s front‑end Node.js services achieve massive concurrency stability through a layered architecture that combines GSLB‑directed CDN, TGW, Nginx, and clustered workers, reinforced by process guardians, three‑tier disaster‑recovery fallbacks, multi‑level caching with lock mechanisms, and comprehensive logging and alerting.

AvailabilityNode.jshigh concurrency
0 likes · 11 min read
High-Concurrency Practices for Tencent Video Front-End Node.js Services
Architecture Digest
Architecture Digest
Apr 25, 2019 · Artificial Intelligence

Designing High‑Quality Recommendation Services: Principles and Strategies

This article explains how to build high‑performance, highly‑available, scalable, extensible, and secure recommendation services by outlining background concepts, defining quality criteria, discussing design challenges, and presenting concrete architectural principles and practical strategies.

AvailabilityScalabilitySecurity
0 likes · 29 min read
Designing High‑Quality Recommendation Services: Principles and Strategies
ITPUB
ITPUB
Apr 15, 2019 · Operations

Essential Practices to Prevent Operational Failures and Boost System Availability

This guide outlines six practical strategies—rollback testing, cautious destructive actions, clear command prompts, verified backups, careful handovers, and proactive monitoring—to help operations teams minimize outages and maintain high system availability.

AvailabilityOperationsbackup verification
0 likes · 6 min read
Essential Practices to Prevent Operational Failures and Boost System Availability
Youzan Coder
Youzan Coder
Jan 9, 2019 · Big Data

How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive

This article details Youzan's transition from Hive to SparkSQL, covering platform architecture, usability and performance enhancements, migration strategies, automated engine selection, and future plans that together reduced resource consumption by up to 67% while handling thousands of daily jobs.

AvailabilityBig DataData Platform
0 likes · 13 min read
How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive
DevOps
DevOps
Aug 13, 2018 · Cloud Computing

Understanding Cloud Computing SLA, Availability, and Compensation: A Comparative Analysis of Major Providers

The article explains cloud computing fundamentals, details Service Level Agreements (SLAs) and their metrics, compares the availability and compensation policies of major Chinese cloud providers, and concludes with a brief DevOps recruitment notice, highlighting both technical insights and industry context.

AvailabilityCloud providersDevOps
0 likes · 11 min read
Understanding Cloud Computing SLA, Availability, and Compensation: A Comparative Analysis of Major Providers
Meituan Technology Team
Meituan Technology Team
Aug 9, 2018 · Frontend Development

Improving Front-End Service Availability in Meituan Financial Payments

The article outlines Meituan Finance’s front‑end availability challenges in its million‑order payment service and presents a disciplined, end‑to‑end approach—standardized release processes, simple fallback designs, automated testing, robust monitoring, and regular fault‑drill simulations—to ensure stable user experiences across diverse client environments.

AvailabilityMeituanbest practices
0 likes · 17 min read
Improving Front-End Service Availability in Meituan Financial Payments
ITFLY8 Architecture Home
ITFLY8 Architecture Home
May 24, 2018 · Fundamentals

Unlock the 5 Key Architecture Metrics for High‑Performance Systems

This guide outlines the five essential architecture metrics—performance, availability, scalability, extensibility, and security—detailing practical optimization techniques for front‑end resources, server‑side caching, database tuning, load balancing, and security layers to build resilient, high‑performing web systems.

AvailabilityScalabilityarchitecture
0 likes · 13 min read
Unlock the 5 Key Architecture Metrics for High‑Performance Systems
dbaplus Community
dbaplus Community
May 8, 2018 · Operations

How to Build Reliable Operations: From BCM to Google SRE Practices

This article examines the growing challenges of system availability in modern operations, explains the concept of availability and the N‑nine metric, introduces Business Continuity Management and Google SRE approaches, and provides concrete technical and managerial methods—including architecture standardization, scaling strategies, tooling, emergency drills, and incident‑centralized management—to improve operational reliability.

AvailabilityBCMOperations
0 likes · 30 min read
How to Build Reliable Operations: From BCM to Google SRE Practices
Meituan Technology Team
Meituan Technology Team
Aug 10, 2017 · Frontend Development

Front-End Service Availability: Definition, Measurement, and Assurance Practices at Meituan-Dianping Checkout

The article outlines Meituan‑Dianping’s approach to front‑end service availability for its checkout system, defining availability across code, static resources, and network links, measuring failure duration, identifying typical bugs, and implementing a three‑stage assurance strategy using people processes, engineering tools, lightweight technology choices, and concrete practices such as TypeScript adoption, automated testing, health‑checks, DNS protection, and post‑incident monitoring.

AvailabilitySSRfrontend
0 likes · 15 min read
Front-End Service Availability: Definition, Measurement, and Assurance Practices at Meituan-Dianping Checkout
Architecture Digest
Architecture Digest
Aug 7, 2017 · Operations

Website Availability and High‑Availability Architecture Overview

This article explains website availability metrics, fault‑weight scoring, layered high‑availability architecture, session management strategies, reusable service design, data redundancy, quality assurance processes, and monitoring practices essential for maintaining reliable large‑scale web systems.

AvailabilityOperationsSession Management
0 likes · 9 min read
Website Availability and High‑Availability Architecture Overview
Architecture Digest
Architecture Digest
Aug 5, 2017 · Fundamentals

Key Architectural Concerns for Large-Scale Websites: Performance, Availability, Scalability, Extensibility, and Security

The article explains the fundamental architectural factors of large‑scale web systems—performance, availability, scalability, extensibility, and security—detailing practical optimization techniques, measurement metrics, and design principles that guide robust software architecture decisions.

AvailabilityScalabilitySecurity
0 likes · 7 min read
Key Architectural Concerns for Large-Scale Websites: Performance, Availability, Scalability, Extensibility, and Security
Baidu Waimai Technology Team
Baidu Waimai Technology Team
Jul 24, 2017 · Backend Development

Transaction System Best Practices: Event‑Driven Architecture, Document Model, and Availability Guarantees

The article recaps a technical talk by Qunar’s accommodation trading system lead, covering event‑driven design, flexible document schemas, and reliability techniques such as circuit breaking, gray releases, and automated testing to improve scalability and maintainability of backend transaction platforms.

AvailabilityBackend ArchitectureDocument Model
0 likes · 3 min read
Transaction System Best Practices: Event‑Driven Architecture, Document Model, and Availability Guarantees
21CTO
21CTO
Jul 23, 2017 · Backend Development

Comparing Kafka and RocketMQ: Architecture, Availability, and Reliability Insights

This article examines the architectures of Kafka and RocketMQ, analyzes their availability and reliability mechanisms, evaluates their strengths and weaknesses, and proposes a hybrid MQ design that combines the benefits of both systems while simplifying dependencies and improving fault tolerance.

AvailabilityKafkaMessage Queue
0 likes · 13 min read
Comparing Kafka and RocketMQ: Architecture, Availability, and Reliability Insights
Efficient Ops
Efficient Ops
Jun 22, 2017 · Cloud Computing

How to Choose the Right Cloud Host: Inside Trusted Cloud’s Rating System

This article explains the Trusted Cloud host rating framework, detailing its star‑based levels, evaluation criteria such as availability, security and disaster recovery, and how enterprises can use the standards to select the most suitable cloud host provider.

AvailabilityCloud Hostcloud computing
0 likes · 6 min read
How to Choose the Right Cloud Host: Inside Trusted Cloud’s Rating System
Qunar Tech Salon
Qunar Tech Salon
Jun 13, 2016 · Operations

Evaluating IT Operations Maturity: Core Metrics, Scoring Model, and Best Practices

This article outlines a comprehensive framework for assessing IT operations maturity by defining four core dimensions—availability, cost, efficiency, and technological advancement—along with quantitative metrics, scoring formulas, and practical methods for data collection and continuous performance improvement.

AvailabilityCost ManagementIT Operations
0 likes · 11 min read
Evaluating IT Operations Maturity: Core Metrics, Scoring Model, and Best Practices
Architect
Architect
Jan 22, 2016 · Operations

System Reliability and Availability: Insights from the Alipay Outage and YunOS

The article examines system reliability concepts such as availability, MTBF, MTTR, and outage classifications, analyzes the Alipay service interruption, discusses various redundancy and failover strategies, and explores YunOS reliability testing and design practices to improve overall system robustness.

AvailabilityMTBFYunOS
0 likes · 15 min read
System Reliability and Availability: Insights from the Alipay Outage and YunOS
Efficient Ops
Efficient Ops
Oct 7, 2015 · Information Security

Why Information Security Mirrors Protecting Your Money: 4 Core Principles Explained

The article explores the essence of information security by comparing it to safeguarding personal money, detailing the four fundamental attributes—confidentiality, integrity, availability, and controllability—and illustrating how different conditions shape security needs, from personal to enterprise contexts.

AvailabilityData Protectionconfidentiality
0 likes · 13 min read
Why Information Security Mirrors Protecting Your Money: 4 Core Principles Explained
Efficient Ops
Efficient Ops
Sep 24, 2015 · Operations

How to Scientifically Evaluate Whether a Cloud Service Is Truly Reliable

This article explains how to objectively assess cloud service reliability by examining three key aspects—availability, access control, and disaster recovery—and provides practical strategies such as redundancy design, gradual deployment, automation, and robust backup to improve overall cloud service trustworthiness.

Availabilityaccess controlcloud reliability
0 likes · 14 min read
How to Scientifically Evaluate Whether a Cloud Service Is Truly Reliable
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
May 10, 2015 · Cloud Computing

Designing Scalable and Highly Available Systems on Azure: Patterns, Anti‑Patterns, and Practical Guidance

This article examines key considerations for building highly scalable and available systems on Azure, outlining four architectural dimensions—scalability, availability, manageability, and feasibility—while discussing patterns, anti‑patterns, measurable resources, queue‑based load balancing, authentication services, and common pitfalls such as configuration errors and SQL injection.

AvailabilityAzureQueue
0 likes · 8 min read
Designing Scalable and Highly Available Systems on Azure: Patterns, Anti‑Patterns, and Practical Guidance