Tagged articles
352 articles
Page 2 of 4
ITPUB
ITPUB
Jan 12, 2024 · Databases

What the 2023 Chinese Government Database Procurement Standard Means for Vendors

The 2023 Chinese government database procurement standard defines unified requirements for government agencies, covering scope, procurement principles, mandatory indicators, response and acceptance forms, and detailed technical criteria for centralized and distributed databases, offering clear guidance for vendors on functional, reliability, security, compatibility, service, and safety expectations.

Distributed SystemsReliabilitydatabases
0 likes · 15 min read
What the 2023 Chinese Government Database Procurement Standard Means for Vendors
Java High-Performance Architecture
Java High-Performance Architecture
Jan 3, 2024 · Backend Development

How to Prevent Message Loss in Kafka: Proven Strategies and Configurations

This article explains why introducing an MQ middleware helps with system decoupling and traffic control, outlines the data‑consistency challenges it creates, and provides practical methods to detect lost messages, identify loss points in producer, broker, and consumer stages, and configure Kafka to guarantee reliable delivery.

Data ConsistencyMessage QueueReliability
0 likes · 15 min read
How to Prevent Message Loss in Kafka: Proven Strategies and Configurations
Liangxu Linux
Liangxu Linux
Dec 25, 2023 · Fundamentals

7 Proven Techniques to Boost Embedded System Reliability

This article presents seven practical, long‑term strategies—such as ROM filling, CRC checks, RAM validation, stack monitoring, MPU usage, robust watchdog design, and avoiding volatile memory allocation—to help embedded engineers build more reliable firmware and catch abnormal behavior early.

MPUMemory ManagementReliability
0 likes · 9 min read
7 Proven Techniques to Boost Embedded System Reliability
dbaplus Community
dbaplus Community
Dec 18, 2023 · Databases

Is Running Production Databases in Docker a Good Idea? Risks and Realities

The article critically examines the suitability of placing production databases in Docker containers, highlighting reliability, maintainability, performance, and operational complexities, and concludes that while containers offer some benefits for stateless services, they often compromise the essential stability required for stateful database workloads.

DockerPostgreSQLReliability
0 likes · 20 min read
Is Running Production Databases in Docker a Good Idea? Risks and Realities
IT Services Circle
IT Services Circle
Dec 10, 2023 · Databases

Should Production Databases Be Deployed in Docker/Kubernetes? A Critical Analysis

The article critically examines the drawbacks of running production databases inside Docker containers or Kubernetes, arguing that while containers excel for stateless services, they introduce reliability, performance, maintenance, and complexity challenges that make them unsuitable for critical stateful database workloads.

ContainersDockerKubernetes
0 likes · 20 min read
Should Production Databases Be Deployed in Docker/Kubernetes? A Critical Analysis
Architect
Architect
Dec 8, 2023 · Frontend Development

How to Build a Reliable, Low‑Latency IM Chat for Customer Service – Front‑End Techniques Revealed

This article dissects the end‑to‑end technical workflow of sending a customer‑service IM message, covering reliability, real‑time delivery, ordering, idempotency, performance bottlenecks, async handling, requestAnimationFrame, protobuf migration, and user‑experience optimizations, while sharing concrete metrics and real‑world solutions.

IMMessage OrderingProtobuf
0 likes · 24 min read
How to Build a Reliable, Low‑Latency IM Chat for Customer Service – Front‑End Techniques Revealed
Bilibili Tech
Bilibili Tech
Dec 1, 2023 · Operations

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

After a series of change‑induced outages in early 2023, Bilibili instituted a comprehensive change‑management framework—including a preventive change platform, a central control system, quality and monitoring tools, strict gray‑release policies, observability checks, and rapid rollback mechanisms—to dramatically cut emergency incidents and embed a reliability‑first culture.

ObservabilityReliabilitySRE
0 likes · 16 min read
Safe Production Practices: Change Management Platform Design and Implementation at Bilibili
Bilibili Tech
Bilibili Tech
Nov 24, 2023 · Cloud Native

Chaos Engineering and Fault Injection Practices at Bilibili: Architecture, Implementation, and Automation

Bilibili built a middleware‑based chaos engineering platform that injects faults into Golang microservices via AOP, supporting server‑ and client‑side, database, cache, and queue components, with fine‑grained instance, request, target, and user controls, automated dependency collection, experiment orchestration, and CI integration to boost system reliability.

GoMicroservicesReliability
0 likes · 18 min read
Chaos Engineering and Fault Injection Practices at Bilibili: Architecture, Implementation, and Automation
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Nov 23, 2023 · Backend Development

How We Built a Rock‑Solid RPC Framework for Cloud‑Native Microservices

This article details the challenges of RPC stability in a large‑scale microservice environment and explains the architectural redesign, SLO implementation, logging governance, exception dashboards, degradation, rate‑limiting, outlier removal, thread‑pool isolation, weak registry dependencies, and post‑incident knowledge‑base practices that together ensure reliable, high‑performance service communication.

Backend DevelopmentCloud NativeMicroservices
0 likes · 15 min read
How We Built a Rock‑Solid RPC Framework for Cloud‑Native Microservices
Top Architecture Tech Stack
Top Architecture Tech Stack
Nov 17, 2023 · Fundamentals

Common Misconceptions in Architecture Design and Its Real Purpose, Illustrated with a Simple Complexity Analysis Case

The article explains common misconceptions about architecture design, outlines its true goals such as maintainability, scalability, reliability, and security, and demonstrates these principles through a detailed student‑management system case study that highlights complexity analysis and practical design decisions.

ReliabilityScalabilitySecurity
0 likes · 12 min read
Common Misconceptions in Architecture Design and Its Real Purpose, Illustrated with a Simple Complexity Analysis Case
Senior Tony
Senior Tony
Nov 14, 2023 · Operations

Master Availability, Reliability, and Stability for High‑Availability Systems

Understanding the differences between system availability, reliability, and stability is essential for building resilient services; this guide explains each concept, illustrates their distinctions with examples, and outlines practical strategies such as rate limiting, anti‑scraping, timeout settings, system inspections, and fault post‑mortems to reduce failures and downtime.

AvailabilityReliabilityhigh availability
0 likes · 11 min read
Master Availability, Reliability, and Stability for High‑Availability Systems
JD Tech
JD Tech
Nov 10, 2023 · Operations

Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices

This article explains the concept and importance of MTTR (Mean Time To Repair), shows how to calculate it, and provides a comprehensive set of monitoring, alerting, rapid mitigation, tool‑assisted analysis, and team coordination techniques to significantly shorten incident resolution time and improve system reliability.

MTTROperationsReliability
0 likes · 26 min read
Reducing MTTR: Monitoring, Fast Incident Response, and Team Practices
Efficient Ops
Efficient Ops
Nov 7, 2023 · Operations

Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability

This article explains Site Reliability Engineering (SRE) as a collaborative methodology, outlines its stability goals measured by MTBF and MTTR, details how SLI/SLO and the VALET selection guide fault detection, and shows how error budgets quantify reliability work and drive precise alerting.

ErrorBudgetMTBFMTTR
0 likes · 14 min read
Mastering SRE: How MTBF, MTTR, SLI, SLO & Error Budget Drive Reliability
Model Perspective
Model Perspective
Nov 3, 2023 · Fundamentals

Why Exponential & Weibull Distributions Matter: Key Concepts and Applications

This article introduces the exponential and Weibull distributions, explains their probability density and cumulative functions, highlights key properties such as the memoryless nature of the exponential and the flexibility of Weibull, and demonstrates practical calculations for reliability and survival analysis scenarios.

Reliabilityexponential distributionprobability
0 likes · 6 min read
Why Exponential & Weibull Distributions Matter: Key Concepts and Applications
MaGe Linux Operations
MaGe Linux Operations
Oct 1, 2023 · Fundamentals

Understanding TCP vs UDP: Key Differences, Handshakes, and When to Use Each

TCP and UDP are core transport-layer protocols of the Internet; this article explains their basic concepts, characteristics, reliability mechanisms, flow and congestion control, three-way handshake and four-way termination for TCP, and highlights the strengths and weaknesses of each protocol for various applications.

NetworkingReliabilityTCP
0 likes · 8 min read
Understanding TCP vs UDP: Key Differences, Handshakes, and When to Use Each
Sanyou's Java Diary
Sanyou's Java Diary
Sep 21, 2023 · Big Data

Understanding Kafka: Core Concepts, Architecture, and Reliability Explained

This article provides a comprehensive overview of Kafka, covering its overall architecture, key components such as brokers, producers, consumers, topics, partitions, replicas, and ZooKeeper, as well as logical and physical storage mechanisms, producer and consumer workflows, configuration parameters, partition assignment strategies, rebalancing, and the replication model that ensures data reliability.

Data StreamingDistributed SystemsKafka
0 likes · 18 min read
Understanding Kafka: Core Concepts, Architecture, and Reliability Explained
Wukong Talks Architecture
Wukong Talks Architecture
Sep 21, 2023 · Backend Development

Detecting and Preventing Message Loss in Kafka Message Queues

This article explains how to detect, diagnose, and prevent message loss in Kafka-based message queue systems by covering system decoupling, traffic control, data consistency challenges, producer, broker, and consumer issues, and offering configuration, monitoring, and operational best‑practice solutions.

Data ConsistencyDistributed SystemsKafka
0 likes · 12 min read
Detecting and Preventing Message Loss in Kafka Message Queues
JD Cloud Developers
JD Cloud Developers
Sep 13, 2023 · Operations

Stability Engineering Explained: From Entropy Theory to Practical SRE

The article explores why building system stability is crucial by linking entropy theory to software reliability, introduces the availability formula, discusses common pitfalls and industry practices, and proposes a three‑stage governance framework—prevention, mitigation, and post‑mortem—to systematically improve operational resilience.

AvailabilityOperationsReliability
0 likes · 13 min read
Stability Engineering Explained: From Entropy Theory to Practical SRE
dbaplus Community
dbaplus Community
Aug 28, 2023 · Operations

How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems

This guide explains how SRE teams should define service level indicators, objectives, and agreements, design reliable and observable architectures, manage error budgets, assess risks, handle incidents, and integrate development practices to improve system stability and performance.

Error BudgetReliabilitySLI
0 likes · 15 min read
How to Define SLIs, SLOs, SLAs and Build Reliable, Observable Systems
Tech Architecture Stories
Tech Architecture Stories
Aug 8, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This comprehensive guide explains the origins, methodologies, and practical steps of fault postmortems—including PDCA, GRIA, aviation safety lessons, industrial accident theory, and software reliability metrics—to help teams systematically investigate incidents, derive actionable improvements, and continuously enhance system availability.

GRIAPDCAReliability
0 likes · 22 min read
Mastering Fault Postmortems: Proven Methods to Boost System Reliability
ITPUB
ITPUB
Aug 7, 2023 · Databases

What Exactly Is a Distributed Database? Definitions, Features, and Architecture Explained

This article defines distributed databases, examines their external traits such as write‑heavy, low‑latency, massive concurrency, massive storage and high reliability, explores internal architectures like client‑side sharding, proxy middleware and unit‑based designs, compares them with Amazon Aurora, and summarizes key takeaways.

OLTPReliabilitydistributed databases
0 likes · 19 min read
What Exactly Is a Distributed Database? Definitions, Features, and Architecture Explained
Liangxu Linux
Liangxu Linux
Aug 1, 2023 · Fundamentals

Understanding TCP vs UDP: Reliable and Unreliable Transport Explained

This article provides a comprehensive overview of the transport layer, explaining how TCP and UDP differ in reliability, connection management, port usage, handshake processes, sequence numbers, acknowledgments, retransmission, window and flow control, as well as congestion handling.

Connection ManagementNetwork ProtocolsPort Numbers
0 likes · 23 min read
Understanding TCP vs UDP: Reliable and Unreliable Transport Explained
Open Source Linux
Open Source Linux
Jul 24, 2023 · Fundamentals

Why TCP Beats UDP: Ports, Handshakes, and Flow Control Explained

This article explains the transport layer’s core protocols—TCP and UDP—detailing their functions, differences, port numbers, connection establishment, reliability mechanisms, flow and congestion control, and how applications use client‑server models to communicate over networks.

Port NumbersReliabilityTCP
0 likes · 21 min read
Why TCP Beats UDP: Ports, Handshakes, and Flow Control Explained
Cloud Native Technology Community
Cloud Native Technology Community
Jul 18, 2023 · Cloud Native

2023 Kubernetes Reliability Benchmark Highlights Common Configuration Gaps

The 2023 Fairwinds Kubernetes benchmark, analyzing over 150,000 workloads, reveals that many organizations still miss critical best‑practice configurations such as memory limits, liveness probes, proper image pull policies, replica counts, and CPU limits or requests, leading to increased security risks, uncontrolled cloud costs, and reduced reliability.

BenchmarkKubernetesReliability
0 likes · 7 min read
2023 Kubernetes Reliability Benchmark Highlights Common Configuration Gaps
Architects Research Society
Architects Research Society
Jul 7, 2023 · Operations

Design Patterns and Principles for Building Large‑Scale Systems

This article outlines key design patterns and principles—such as scalability, idempotency, asynchronous processing, health checks, circuit breakers, feature flags, bulkheads, service discovery, retries, metrics, rate limiting, back‑pressure, and canary releases—that enable large‑scale, reliable, and resilient distributed systems.

Distributed SystemsObservabilityReliability
0 likes · 16 min read
Design Patterns and Principles for Building Large‑Scale Systems
ITPUB
ITPUB
Jun 30, 2023 · Operations

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

This article details Tencent Search’s end‑to‑end stability engineering framework, covering a layered reliability architecture, disaster‑recovery mechanisms, fast detection and monitoring, emergency response acceleration, pre‑release interception, automated defense, and collaborative governance that together improve MTTD and MTTR by an order of magnitude.

AutomationReliabilitydisaster-recovery
0 likes · 30 min read
How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook
Sanyou's Java Diary
Sanyou's Java Diary
Jun 26, 2023 · Big Data

Master Kafka Interview Questions: Architecture, Partitioning, and Reliability Explained

This article provides a comprehensive overview of Kafka, covering its core architecture, message queue models, communication process, partition selection, consumer groups, rebalancing strategies, partition assignment algorithms, reliability guarantees, replica synchronization, and reasons for removing Zookeeper in newer versions.

KafkaPartitioningReliability
0 likes · 20 min read
Master Kafka Interview Questions: Architecture, Partitioning, and Reliability Explained
vivo Internet Technology
vivo Internet Technology
Jun 7, 2023 · Big Data

Erasure Coding Technology in the Evolution of Vivo Storage Systems

Combining academic advances and industry practice, the article surveys erasure‑coding techniques, then details Vivo’s optimized storage stack—enhancing Reed‑Solomon with bit‑matrix scheduling, parallel cross‑AZ repair, LRC and MSR layers, and intermediate‑result optimization—to achieve high reliability while minimizing bandwidth and storage overhead.

Regenerating CodesReliabilitydata redundancy
0 likes · 48 min read
Erasure Coding Technology in the Evolution of Vivo Storage Systems
Top Architect
Top Architect
Jun 5, 2023 · Big Data

Deep Dive into Kafka’s High Reliability and High Performance Mechanisms

This article comprehensively explores Kafka’s core concepts, architecture, and the techniques it employs—such as ack strategies, replica synchronization, high‑watermark, leader‑epoch, zero‑copy, batch sending, compression, and reactor‑based networking—to achieve both strong reliability and high throughput in distributed messaging systems.

Distributed SystemsKafkaMessage Queue
0 likes · 31 min read
Deep Dive into Kafka’s High Reliability and High Performance Mechanisms
DevOps
DevOps
May 12, 2023 · Operations

Evolution of Chaos Engineering at Netflix: From Chaos Monkey to ChAP

This article examines how Netflix has progressively refined its chaos engineering practices—from the early Chaos Monkey tool to the sophisticated Chaos Automation Platform (ChAP)—to improve system resilience, automate experiments, and safely validate changes in large‑scale microservice environments.

Fault InjectionNetflixReliability
0 likes · 26 min read
Evolution of Chaos Engineering at Netflix: From Chaos Monkey to ChAP
MaGe Linux Operations
MaGe Linux Operations
May 7, 2023 · Operations

How Meta’s SLICK Transforms SLO Management for Reliable Services

This article explains how Meta built SLICK, a centralized SLO/SLI platform that improves service reliability through discoverability, long‑term insights, integrated workflows, and scalable architecture, and shares real‑world examples and lessons learned from its deployment across thousands of services.

MetaObservabilityReliability
0 likes · 13 min read
How Meta’s SLICK Transforms SLO Management for Reliable Services
Big Data Technology Architecture
Big Data Technology Architecture
Apr 22, 2023 · Big Data

Deep Dive into Kafka’s High Reliability and High Performance Mechanisms

This article comprehensively explores Kafka’s core architecture, explaining how asynchronous decoupling and traffic shaping are achieved, detailing the roles of producers, brokers, consumers, and ZooKeeper, and analyzing the reliability and performance techniques such as ACK policies, replication, idempotent and transactional producers, page‑cache flushing, zero‑copy, compression, batching, and load‑balancing strategies.

Distributed SystemsMessage QueueReliability
0 likes · 31 min read
Deep Dive into Kafka’s High Reliability and High Performance Mechanisms
dbaplus Community
dbaplus Community
Apr 3, 2023 · Operations

How to Guarantee Zero Message Loss in Kafka: Practical Detection and Prevention Strategies

This article explains why MQ middleware like Kafka is introduced for system decoupling and traffic control, outlines the three key challenges of message loss detection, loss points, and prevention, and provides detailed configurations, monitoring tips, and code examples to ensure reliable, loss‑free message delivery.

ConfigurationConsumerData Consistency
0 likes · 12 min read
How to Guarantee Zero Message Loss in Kafka: Practical Detection and Prevention Strategies
Code Ape Tech Column
Code Ape Tech Column
Mar 30, 2023 · Backend Development

How to Ensure No Message Loss in MQ Systems – Interview Guide and Practical Solutions

This article explains the common interview question of guaranteeing 100% message reliability in MQ middleware such as Kafka or RabbitMQ, outlines the three lifecycle stages of a message, discusses detection mechanisms, id generation, idempotent consumption, and handling message backlog, providing concrete design patterns and practical examples.

Distributed SystemsIdempotencyKafka
0 likes · 12 min read
How to Ensure No Message Loss in MQ Systems – Interview Guide and Practical Solutions
dbaplus Community
dbaplus Community
Mar 15, 2023 · Backend Development

How to Prevent Message Loss in Kafka: Practical Tips and Configurations

This guide explains why message queues are introduced for decoupling and traffic control, identifies three key areas where message loss can occur—in producers, brokers, and consumers—and provides concrete Kafka configurations, monitoring practices, and operational steps to ensure reliable, loss‑free message delivery.

Consumer MonitoringKafkaMessage Loss
0 likes · 12 min read
How to Prevent Message Loss in Kafka: Practical Tips and Configurations
FunTester
FunTester
Mar 13, 2023 · Operations

How Chaos Engineering Can Strengthen System Reliability: A Practical Guide

This article explains the origins and principles of chaos engineering, illustrates how fault‑injection scenarios expose system weaknesses, outlines step‑by‑step implementation—from tool selection and metric definition to execution and post‑mortem—and highlights its role in achieving high‑availability service level agreements.

DevOpsDistributed SystemsFault Injection
0 likes · 10 min read
How Chaos Engineering Can Strengthen System Reliability: A Practical Guide
dbaplus Community
dbaplus Community
Feb 28, 2023 · Operations

How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks

This article details DeWu's container SRE approach, covering SRE fundamentals, on‑call response, SLO/SLA design, change management, capacity planning, kernel‑parameter monitoring, security safeguards, and a real‑world incident analysis, providing actionable insights for building resilient cloud‑native services.

CapacityPlanningIncidentResponseKubernetes
0 likes · 24 min read
How Container SRE at DeWu Boosts Reliability: Practices, Metrics, and Incident Playbooks
Architecture Digest
Architecture Digest
Jan 19, 2023 · Backend Development

Designing High‑Availability Backend Interfaces

The article explains why high availability is essential for backend services, defines its core concepts, and outlines key design principles such as minimizing dependencies, avoiding single points of failure, load balancing, resource isolation, rate limiting, circuit breaking, asynchronous processing, degradation strategies, gray releases, and chaos engineering to build resilient APIs.

Reliabilityfault toleranceservice design
0 likes · 9 min read
Designing High‑Availability Backend Interfaces
58UXD
58UXD
Jan 12, 2023 · Product Management

Boosting UX Evaluation Credibility with Rater Reliability in QMD 3.0

This article explains how 58 Tongcheng’s design team upgraded the QMD evaluation framework to QMD 3.0, using reliability testing, ICC analysis, and systematic process controls to make subjective UX scores more trustworthy and actionable across product lines.

ICCQMDReliability
0 likes · 9 min read
Boosting UX Evaluation Credibility with Rater Reliability in QMD 3.0
Open Source Linux
Open Source Linux
Jan 9, 2023 · Fundamentals

Why TCP Matters: A Deep Dive into Reliable Transport and Network Layers

This article explains the essential concepts of TCP and UDP, covering the OSI model layers, socket communication, reliable transmission mechanisms such as stop‑and‑wait and sliding‑window, congestion control, connection establishment, and practical differences between TCP and UDP, providing a comprehensive networking fundamentals overview.

NetworkingReliabilityTCP
0 likes · 26 min read
Why TCP Matters: A Deep Dive into Reliable Transport and Network Layers
DevOps Cloud Academy
DevOps Cloud Academy
Dec 31, 2022 · Operations

Google Site Reliability Engineering (SRE) Principles and Engagement Model

The article explains Google’s Site Reliability Engineering (SRE) team, its mission to balance reliability and velocity through automation, the engagement model with development teams, funding principles, and a set of guiding principles that shape how SRE collaborates, scopes, and delivers value across services.

Engagement ModelGoogleReliability
0 likes · 29 min read
Google Site Reliability Engineering (SRE) Principles and Engagement Model
Architecture Digest
Architecture Digest
Dec 30, 2022 · Operations

Vivo Monitoring Platform: Architecture, Evolution, and Future Directions

The article details the evolution, architecture, capabilities, challenges, and future plans of Vivo's comprehensive monitoring platform, covering its transition from simple Zabbix setups to a cloud‑native, AI‑ops enabled system that ensures service availability across massive infrastructure.

ObservabilityReliabilityaiops
0 likes · 16 min read
Vivo Monitoring Platform: Architecture, Evolution, and Future Directions
Programmer DD
Programmer DD
Dec 26, 2022 · Operations

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

On December 18, 2022, Alibaba Cloud's Hong Kong Region Zone C suffered a massive service interruption—the longest in its operational history—prompting a detailed incident response, extensive service impact across compute, storage, and networking, and a thorough analysis that led to concrete infrastructure and communication improvements.

Alibaba CloudIncident ReportInfrastructure
0 likes · 13 min read
Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned
Java High-Performance Architecture
Java High-Performance Architecture
Dec 6, 2022 · Cloud Native

How to Build Resilient Microservices: Patterns for Fault Tolerance and High Availability

Learn essential techniques for designing fault‑tolerant microservices, including graceful degradation, change management, health checks, self‑healing, failover caching, retry strategies, rate limiting, circuit breakers, and testing failures, to ensure high availability and reliability in distributed cloud‑native systems.

OperationsReliabilitycloud-native
0 likes · 15 min read
How to Build Resilient Microservices: Patterns for Fault Tolerance and High Availability
DevOps
DevOps
Dec 5, 2022 · Operations

Key Findings from the 2022 Accelerate State of DevOps Report: Software Delivery, Organizational Performance, and Software Supply Chain Security

The 2022 Accelerate State of DevOps report, based on surveys of 33,000 professionals, reveals that software delivery performance, operational reliability, and organizational culture—especially high‑trust, low‑blame environments—drive organizational outcomes, while secure software supply chain practices such as SLSA and NIST SSDF further boost performance and reduce burnout.

DevOpsReliabilitySLSA
0 likes · 8 min read
Key Findings from the 2022 Accelerate State of DevOps Report: Software Delivery, Organizational Performance, and Software Supply Chain Security
21CTO
21CTO
Nov 15, 2022 · Operations

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

This article explains how SRE teams should define Service Level Indicators, Objectives and Agreements, manage reliability, performance, saturation and observability, use proper metrics and tracing, handle error budgets, assess risks, and implement effective incident and project management to create robust, cloud‑native services.

Error BudgetObservabilityReliability
0 likes · 14 min read
Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Nov 14, 2022 · Operations

Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator

The article reviews classic availability metrics such as Success‑Ratio, Incident‑Ratio, MTTR/MTTF, Error‑Budget, and SLA/SLO, then introduces User‑Uptime—a per‑user success time proportion that ignores long idle periods—and its windowed variant, showing how it complements existing indicators for more user‑centric reliability insight.

AvailabilityReliabilitySRE
0 likes · 27 min read
Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator
Su San Talks Tech
Su San Talks Tech
Nov 12, 2022 · Fundamentals

When Is UDP Faster Than TCP? Deep Dive into Socket Protocols

While UDP is often assumed to be faster than TCP, this article explains socket basics, the reliability mechanisms of TCP, scenarios where UDP can be slower, and how application-layer solutions like KCP or QUIC add reliability, helping readers understand when each protocol truly excels.

Network ProtocolsPacket LossReliability
0 likes · 13 min read
When Is UDP Faster Than TCP? Deep Dive into Socket Protocols
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Oct 31, 2022 · Industry Insights

How a Brand Tackles B2B System Architecture: Challenges and Solutions

This article examines the unique challenges of building B2B systems for a brand—covering multi‑role permissions, complex supply‑chain workflows, logistics integration, data modeling, and reliability—while sharing concrete architectural solutions such as cloud‑edge services, decision‑center IPC, end‑to‑end monitoring, and industry‑specific adaptations.

B2BLogisticsReliability
0 likes · 15 min read
How a Brand Tackles B2B System Architecture: Challenges and Solutions
Top Architect
Top Architect
Oct 15, 2022 · Backend Development

Designing Fault‑Tolerant Microservices: Patterns and Practices

The article explains how microservice architectures can achieve high availability by isolating failures, employing graceful degradation, change‑management strategies, health checks, fallback caching, retry logic, rate limiting, circuit breakers, and chaos testing, while acknowledging the added complexity and cost of such reliability engineering.

BackendOperationsReliability
0 likes · 13 min read
Designing Fault‑Tolerant Microservices: Patterns and Practices
Architecture Digest
Architecture Digest
Sep 25, 2022 · Cloud Native

Designing Microservices Architecture for Failure: Patterns and Practices

This article explains how to build highly available microservices by addressing the inherent risks of distributed systems and presenting fault‑tolerance patterns such as graceful degradation, change management, health checks, self‑healing, failover caching, retries, rate limiting, bulkheads, circuit breakers, and systematic failure testing.

Cloud NativeMicroservicesReliability
0 likes · 14 min read
Designing Microservices Architecture for Failure: Patterns and Practices
Architects Research Society
Architects Research Society
Sep 16, 2022 · Operations

Building a Reliability Culture: Practices, Benefits, and Implementation

This article explains what a reliability culture is, why it matters, how to cultivate it through mission statements, early‑stage reliability testing, chaos‑engineering practices like GameDays and FireDrills, and how organizations can continuously learn from incidents to improve system availability and customer trust.

CultureOperationsReliability
0 likes · 18 min read
Building a Reliability Culture: Practices, Benefits, and Implementation
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 14, 2022 · Operations

Mastering System Stability: From Fault Prevention to Emergency Response

This article outlines a comprehensive safety‑production framework that covers pre‑incident fault prevention, incident response, and post‑mortem improvement, detailing design‑for‑failure principles such as redundancy, isolation, idempotence, monitoring, automation, disaster recovery, scaling, rate‑limiting, and continuous testing to ensure reliable, resilient services.

Reliabilitydisaster recoveryincident management
0 likes · 16 min read
Mastering System Stability: From Fault Prevention to Emergency Response
Architect
Architect
Sep 8, 2022 · Backend Development

Ensuring No Message Loss in MQ Systems: Interview Guide and Practical Solutions

This article explains how to answer interview questions about guaranteeing 100% message delivery in MQ middleware such as Kafka, RabbitMQ, or RocketMQ, covering system decoupling, message lifecycle stages, reliability mechanisms, idempotent consumption, unique ID generation, and handling message back‑pressure.

IdempotencyInterview PreparationMessage Queue
0 likes · 12 min read
Ensuring No Message Loss in MQ Systems: Interview Guide and Practical Solutions
IT Architects Alliance
IT Architects Alliance
Sep 6, 2022 · Operations

How to Guarantee Zero Message Loss with Kafka: Best Practices and Configurations

This article explains why introducing a message queue like Kafka helps decouple systems and control traffic, then dives into the three key questions of detecting, locating, and preventing message loss, offering concrete monitoring methods, configuration settings, and troubleshooting steps for producers, brokers, and consumers.

BrokerConfigurationConsumer
0 likes · 13 min read
How to Guarantee Zero Message Loss with Kafka: Best Practices and Configurations
Xiao Lou's Tech Notes
Xiao Lou's Tech Notes
Sep 5, 2022 · Backend Development

How to Prevent Payment Order Loss (Drop Orders) in E‑Commerce Systems

This article explains why payment orders can disappear after a successful wallet transaction, outlines the complete payment flow, distinguishes internal and external drop‑order scenarios, and provides practical server‑side and client‑side strategies—including retry mechanisms, reliable async messaging, scheduled queries, and delayed‑message approaches—to reliably prevent such issues.

BackendMessage QueueReliability
0 likes · 13 min read
How to Prevent Payment Order Loss (Drop Orders) in E‑Commerce Systems
Wukong Talks Architecture
Wukong Talks Architecture
Sep 2, 2022 · Big Data

Preventing Data Loss in Kafka: Message Semantics, Failure Scenarios, and Reliability Solutions

This article explains Kafka's message delivery semantics, analyzes potential data‑loss scenarios across producer, broker, and consumer components, and provides concrete configuration and coding practices—such as idempotent producers, proper ACK settings, replication factors, and manual offset commits—to maximize message durability and reliability.

BrokerConsumerData loss
0 likes · 18 min read
Preventing Data Loss in Kafka: Message Semantics, Failure Scenarios, and Reliability Solutions
Architects Research Society
Architects Research Society
Sep 1, 2022 · Cloud Computing

Cloud Design Patterns: Challenges and a Comprehensive Catalog for Building Reliable, Scalable, and Secure Applications

This article explains the key challenges of cloud development—data management, design and implementation, and messaging—then presents a detailed catalog of cloud design patterns, each with problem statements, Azure examples, and categories such as reliability, security, and performance efficiency.

Design PatternsReliabilityScalability
0 likes · 9 min read
Cloud Design Patterns: Challenges and a Comprehensive Catalog for Building Reliable, Scalable, and Secure Applications
Architects Research Society
Architects Research Society
Aug 25, 2022 · Operations

Core Reliability Principles in the Google Cloud Architecture Framework

This article outlines the core reliability principles of the Google Cloud Architecture Framework, explaining key terms such as SLI, SLO, error budget, and SLA, and describing design and operational guidelines for defining reliability goals, building observability, ensuring high availability, creating robust processes, effective alerting, and collaborative incident management.

Error BudgetObservabilityOperations
0 likes · 12 min read
Core Reliability Principles in the Google Cloud Architecture Framework
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 21, 2022 · Backend Development

12 Core Principles of Message Queues (MQ) – A Comprehensive Summary

This article provides a detailed overview of message queue fundamentals, covering producers, consumers, brokers, queue and topic models, delivery guarantees, acknowledgment mechanisms, ordering, persistence, high availability, and selection criteria, helping readers quickly grasp essential MQ concepts for distributed systems.

BrokerConsumerMessage Queue
0 likes · 9 min read
12 Core Principles of Message Queues (MQ) – A Comprehensive Summary
Qunar Tech Salon
Qunar Tech Salon
Aug 17, 2022 · Operations

Design and Optimization of Testing Environment 3.0 at Qunar Travel

This article describes how Qunar Travel has evolved its testing environment governance from a fixed 10‑machine setup to a template‑driven, soft‑routing architecture (Environment 3.0), improving delivery speed, reliability, business connectivity, and reducing operational costs through automated sync, smart recommendations, and continuous business checks.

EnvironmentMicroservicesOperations
0 likes · 22 min read
Design and Optimization of Testing Environment 3.0 at Qunar Travel
58 Tech
58 Tech
Aug 11, 2022 · Backend Development

WLock: High‑Reliability, High‑Throughput Distributed Lock Service Based on WPaxos

WLock is an open‑source distributed lock service built on the WPaxos consensus algorithm and RocksDB storage, offering multiple lock types, flexible acquisition modes, high concurrency optimizations, and strong reliability and throughput for coordinating shared resources in distributed systems.

High ThroughputReliabilityRocksDB
0 likes · 12 min read
WLock: High‑Reliability, High‑Throughput Distributed Lock Service Based on WPaxos
Sohu Tech Products
Sohu Tech Products
Aug 3, 2022 · Fundamentals

Common Message Queues: RabbitMQ, RocketMQ, and Kafka – Components, Features, and Best Practices

This article introduces the core concepts, components, exchange types, reliability mechanisms, ordering, delay, transaction, high‑availability, and load‑balancing strategies of three popular message‑queue systems—RabbitMQ, RocketMQ, and Kafka—while also discussing common challenges such as message ordering, delayed consumption, reliability, idempotence, and backlog handling.

Distributed SystemsKafkaMessage Queue
0 likes · 33 min read
Common Message Queues: RabbitMQ, RocketMQ, and Kafka – Components, Features, and Best Practices
Top Architect
Top Architect
Jul 30, 2022 · Backend Development

Comprehensive Overview of RabbitMQ, RocketMQ, and Kafka: Architecture, Components, and Best Practices

This article provides an in‑depth technical guide to three major message‑queue systems—RabbitMQ, RocketMQ, and Kafka—covering their core components, exchange and routing mechanisms, durability features, consumer acknowledgment, dead‑letter handling, load balancing, and practical strategies for reliability, ordering, and scaling in distributed backend applications.

Distributed SystemsMessagingRabbitMQ
0 likes · 29 min read
Comprehensive Overview of RabbitMQ, RocketMQ, and Kafka: Architecture, Components, and Best Practices
DevOps
DevOps
Jul 25, 2022 · Operations

Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)

This article provides a comprehensive overview of Site Reliability Engineering, explaining its origins, core responsibilities across infrastructure, platform, and business layers, daily tasks such as deployment, on‑call duties, SLI/SLO management, incident post‑mortems, capacity planning, and user support, as well as career advice for aspiring SREs.

InfrastructureOncallReliability
0 likes · 21 min read
Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)
FunTester
FunTester
Jul 24, 2022 · Operations

Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation

Chaos engineering, a discipline for experimenting on distributed systems, helps teams identify hidden weaknesses, improve high‑availability, and build confidence in production by defining stable states, injecting realistic failures, and measuring impact through observability metrics, with practical steps, tool choices, maturity stages, and evaluation methods.

Distributed SystemsFault InjectionObservability
0 likes · 11 min read
Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation
Architects Research Society
Architects Research Society
Jul 2, 2022 · Operations

Reliability vs Resilience: Understanding the Difference and Its Importance

Reliability and resilience are distinct yet complementary goals for cloud services; reliability is the outcome of consistently meeting performance expectations, while resilience describes a system’s ability to continue operating despite failures, and this article introduces the concepts and outlines a four‑part series exploring related threats and enhancement techniques.

Cloud ServicesOperationsReliability
0 likes · 6 min read
Reliability vs Resilience: Understanding the Difference and Its Importance
Open Source Linux
Open Source Linux
Jun 24, 2022 · Fundamentals

Why TCP Matters: A Deep Dive into Transport Layer Fundamentals

This article explains the essential concepts of TCP and UDP, covering the OSI layers, physical and data‑link networking, IP addressing, transport‑layer mechanisms such as sliding windows, congestion control, reliable transmission, connection establishment and termination, and compares the strengths and weaknesses of TCP versus UDP.

NetworkingReliabilityTCP
0 likes · 28 min read
Why TCP Matters: A Deep Dive into Transport Layer Fundamentals
IT Architects Alliance
IT Architects Alliance
Jun 14, 2022 · Cloud Native

Design and Challenges of Multi‑Active Architecture in Hybrid Cloud Environments

This article examines the design principles, challenges, and implementation details of a multi‑active architecture for hybrid cloud environments, covering stability, cost, efficiency, network topology, container orchestration, service discovery, traffic scheduling, and data storage, and outlines practical solutions used by the Zuoyebang platform.

Cloud NativeCost OptimizationKubernetes
0 likes · 13 min read
Design and Challenges of Multi‑Active Architecture in Hybrid Cloud Environments
IT Architects Alliance
IT Architects Alliance
Jun 12, 2022 · Fundamentals

TCP Three-Way Handshake and Four-Way Termination Explained

This article explains how TCP ensures reliable connections through the three-way handshake process—including SYN, SYN‑ACK, and ACK exchanges—and describes the four-step termination sequence with FIN and ACK flags, while also clarifying why three handshakes are necessary instead of two.

Four-way terminationReliabilityTCP
0 likes · 7 min read
TCP Three-Way Handshake and Four-Way Termination Explained
Architecture Digest
Architecture Digest
May 27, 2022 · Backend Development

Ensuring Message Reliability in RocketMQ and RabbitMQ

This article explains how RocketMQ and RabbitMQ guarantee message reliability by covering production, storage, and consumption stages, introducing confirm mechanisms, persistence settings, manual acknowledgments, and compensation strategies such as message database storage to achieve near‑zero data loss in distributed systems.

Backend DevelopmentRabbitMQReliability
0 likes · 10 min read
Ensuring Message Reliability in RocketMQ and RabbitMQ
IT Services Circle
IT Services Circle
May 10, 2022 · Fundamentals

The Drawbacks of TCP: Upgrade Difficulty, Connection Latency, Head‑of‑Line Blocking, and Migration Overhead

This article examines the inherent shortcomings of the TCP protocol, including the difficulty of upgrading the stack, the latency introduced by its three‑way handshake and TLS, head‑of‑line blocking caused by packet loss, and the high cost of connection migration when network conditions change.

LatencyNetwork ProtocolsQUIC
0 likes · 10 min read
The Drawbacks of TCP: Upgrade Difficulty, Connection Latency, Head‑of‑Line Blocking, and Migration Overhead
DeWu Technology
DeWu Technology
May 9, 2022 · Backend Development

Common Issues and Solutions for Message Queue Middleware

Message‑queue middleware such as RabbitMQ, RocketMQ, ActiveMQ, and Kafka introduces challenges like ordering, loss, duplication, back‑pressure and delayed delivery, which can be mitigated by using single‑consumer queues or partitioning, enabling acknowledgments and replication, applying idempotent identifiers, scaling consumers, and employing dead‑letter or scheduling mechanisms.

KafkaMQMessage Queue
0 likes · 21 min read
Common Issues and Solutions for Message Queue Middleware
dbaplus Community
dbaplus Community
May 4, 2022 · Operations

How Tencent Game Teams Use Chaos Engineering to Boost Reliability and Reduce Outages

This article explains the concept of chaos engineering, its six key benefits, the design of a full‑lifecycle chaos platform, fault‑atom categories, experiment orchestration, risk control, automation, red‑blue war games, and practical experiments that helped Tencent Games improve system reliability while cutting operational costs.

DevOpsGamingOperations
0 likes · 21 min read
How Tencent Game Teams Use Chaos Engineering to Boost Reliability and Reduce Outages
High Availability Architecture
High Availability Architecture
Mar 7, 2022 · Operations

Understanding High Concurrency, High Availability, Performance, and Scalability: Concepts and Metrics

This article systematically explains the relationships among high concurrency, high availability, performance, and scalability, defines their quantitative metrics, categorizes sources of change that affect system reliability, and outlines strategies for fault prediction, impact reduction, and rapid recovery in large‑scale services.

OperationsReliabilityScalability
0 likes · 11 min read
Understanding High Concurrency, High Availability, Performance, and Scalability: Concepts and Metrics
Architect's Journey
Architect's Journey
Mar 2, 2022 · Backend Development

Interview Basics: How to Guarantee Message Reliability in MQ

The article explains how to achieve 100% message delivery and consumption in MQ systems by covering producer acknowledgments, broker persistence mechanisms, and consumer idempotency, with detailed comparisons of RabbitMQ and Kafka implementations and configuration tips.

ACKKafkaMessage Queue
0 likes · 13 min read
Interview Basics: How to Guarantee Message Reliability in MQ
Top Architect
Top Architect
Feb 26, 2022 · Backend Development

Why Use Message Queues (MQ) and How to Handle Common MQ Problems

This article explains the motivations for introducing message queues—such as decoupling, asynchronous processing, and traffic shaping—then details the typical challenges like availability, complexity, duplicate consumption, data consistency, message loss, ordering, and backlog, and provides practical solutions for each issue.

DecouplingReliabilityTraffic Shaping
0 likes · 14 min read
Why Use Message Queues (MQ) and How to Handle Common MQ Problems
IT Architects Alliance
IT Architects Alliance
Feb 13, 2022 · Big Data

Comprehensive Overview of Apache Kafka Architecture and Core Concepts

This article provides an in‑depth introduction to Apache Kafka, covering its distributed streaming platform fundamentals, message‑queue models, topic and partition design, broker and cluster roles, producer partitioning logic, reliability guarantees, consumer group assignors, offset management, and performance optimizations such as sequential disk writes and zero‑copy techniques.

Apache KafkaDistributed StreamingReliability
0 likes · 25 min read
Comprehensive Overview of Apache Kafka Architecture and Core Concepts
Sanyou's Java Diary
Sanyou's Java Diary
Feb 11, 2022 · Fundamentals

Mastering TCP: Sliding Window, Flow & Congestion Control Explained

This article continues the previous discussion on TCP handshakes and termination, then thoroughly explains nine essential TCP mechanisms—including sliding window, flow control, congestion control, delayed and piggyback ACKs, sticky packet handling, and keep‑alive—illustrated with diagrams and practical examples.

Flow ControlNetwork ProtocolsReliability
0 likes · 15 min read
Mastering TCP: Sliding Window, Flow & Congestion Control Explained
Selected Java Interview Questions
Selected Java Interview Questions
Feb 5, 2022 · Backend Development

Message Queue Fundamentals: Use Cases, Product Comparison, High Availability, and Reliability Strategies

This article explains why message queues are used, outlines common scenarios such as decoupling, asynchronous processing and traffic shaping, compares major MQ products, and provides practical guidance on high availability, preventing loss, duplicate consumption, ordering, backlog handling, and expiration.

KafkaMessage QueueRabbitMQ
0 likes · 8 min read
Message Queue Fundamentals: Use Cases, Product Comparison, High Availability, and Reliability Strategies
Code Ape Tech Column
Code Ape Tech Column
Feb 4, 2022 · Backend Development

Ensuring Zero Message Loss in MQ Systems: Interview Strategies and Solutions

This article explains how to guarantee that messages are never lost when using MQ middleware such as Kafka, RabbitMQ, or RocketMQ, outlines the key interview points, and provides practical design patterns, detection mechanisms, idempotency, and scaling strategies for reliable message delivery.

Distributed SystemsKafkaMessage Queue
0 likes · 13 min read
Ensuring Zero Message Loss in MQ Systems: Interview Strategies and Solutions
IT Architects Alliance
IT Architects Alliance
Feb 3, 2022 · Backend Development

Common Issues in Message Queues and Distributed Transaction Solutions

This article explains the typical problems encountered with message queues, such as message loss, duplicate delivery, and distributed transaction handling, and details various solutions including local message tables, MQ‑based transactions, and the specific mechanisms used by RocketMQ, Kafka, and RabbitMQ to ensure reliability and consistency.

KafkaMQMessage Queue
0 likes · 20 min read
Common Issues in Message Queues and Distributed Transaction Solutions
AntTech
AntTech
Jan 24, 2022 · Operations

Ant Group's Chaos Engineering System: Evolution, Business Features, Key Technologies, and Future Directions

This article outlines Ant Group's six‑year journey in chaos engineering, describing its three generational evolutions, business‑oriented fault injection, risk‑mining, full‑lifecycle coverage, massive scale, root‑data protection, core technologies such as Awatch, simulation environments, and plans for intelligent, open‑source future development.

Ant GroupAwatchFault Injection
0 likes · 23 min read
Ant Group's Chaos Engineering System: Evolution, Business Features, Key Technologies, and Future Directions
dbaplus Community
dbaplus Community
Dec 15, 2021 · Operations

How Chaos Engineering Guarantees Stability for Distributed Data Systems

This article examines the stability challenges of selecting distributed data products, introduces chaos‑engineering‑based testing methods, outlines practical test scenarios, fault injection techniques, toolchains, and quantitative analysis metrics, and presents a capability assessment standard for ensuring system reliability.

Data PlatformsReliabilitychaos engineering
0 likes · 11 min read
How Chaos Engineering Guarantees Stability for Distributed Data Systems