Tagged articles
352 articles
Page 3 of 4
Wukong Talks Architecture
Wukong Talks Architecture
Dec 8, 2021 · Big Data

Understanding Kafka Core Concepts: Architecture, Messaging Models, Partitioning, Consumer Groups, and Reliability

This article provides a comprehensive overview of Kafka, covering its layered architecture with Zookeeper, core concepts such as topics, partitions and consumer groups, communication workflow, partition selection strategies, rebalancing mechanisms, reliability configurations, replica synchronization, and reasons for moving away from Zookeeper, all explained in clear English.

Distributed SystemsKafkaReliability
0 likes · 19 min read
Understanding Kafka Core Concepts: Architecture, Messaging Models, Partitioning, Consumer Groups, and Reliability
IT Architects Alliance
IT Architects Alliance
Dec 3, 2021 · Big Data

Comprehensive Overview of Apache Kafka Architecture and Core Concepts

This article provides an in‑depth technical guide to Apache Kafka, covering its distributed streaming architecture, core concepts such as topics, partitions, brokers, producers and consumers, reliability guarantees, storage mechanisms, configuration parameters, and consumer assignment strategies, supplemented with Java code examples.

Apache KafkaConsumerDistributed Streaming
0 likes · 24 min read
Comprehensive Overview of Apache Kafka Architecture and Core Concepts
Architecture Digest
Architecture Digest
Nov 19, 2021 · Backend Development

Ensuring Reliable Message Delivery with RabbitMQ: Producer Confirmation, Persistence, and Consumer Acknowledgment

The article explains how to achieve high‑reliability message delivery in RabbitMQ by using producer confirm mode, persisting exchanges, queues and messages, storing outbound messages in a database, and switching consumers to manual acknowledgments to prevent loss in network or broker failures.

Consumer AcknowledgmentJavaMessage Queue
0 likes · 9 min read
Ensuring Reliable Message Delivery with RabbitMQ: Producer Confirmation, Persistence, and Consumer Acknowledgment
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 1, 2021 · Operations

Mastering Service Degradation: Strategies to Keep Your System Available Under Load

Service degradation, a crucial reliability technique, involves selectively disabling non-essential features, applying rate limiting, timeout handling, fallback data, and tiered switches across front‑end, back‑end, and infrastructure layers to maintain core functionality during traffic spikes or component failures, ensuring high availability and meeting SLA targets.

FallbackOperationsReliability
0 likes · 13 min read
Mastering Service Degradation: Strategies to Keep Your System Available Under Load
Open Source Linux
Open Source Linux
Oct 11, 2021 · Operations

10 Essential Ops Principles Every Engineer Should Follow

This article shares ten practical operations guidelines—from avoiding duplicated work and embracing mistakes to emphasizing monitoring, backup roles, clear division of labor, and continuous improvement—aimed at boosting reliability, efficiency, and team cohesion for both engineers and managers.

OperationsReliabilitybest practices
0 likes · 10 min read
10 Essential Ops Principles Every Engineer Should Follow
Selected Java Interview Questions
Selected Java Interview Questions
Oct 9, 2021 · Backend Development

RocketMQ vs Kafka: Detailed Feature, Performance, and Reliability Comparison

This article provides a comprehensive comparison between RocketMQ and Kafka, covering data reliability, performance, queue capacity, real‑time delivery, retry mechanisms, ordering guarantees, scheduled messages, transactional support, query capabilities, message tracing, consumer parallelism, filtering, and commercial backing, helping engineers choose the right messaging middleware for their workloads.

Distributed SystemsKafkaMessage Queue
0 likes · 11 min read
RocketMQ vs Kafka: Detailed Feature, Performance, and Reliability Comparison
HaoDF Tech Team
HaoDF Tech Team
Oct 8, 2021 · Operations

Understanding SRE: Foundations, Metrics, and Tackling Technical Debt

This article introduces the fundamentals of Site Reliability Engineering (SRE), explains how to measure service stability with metrics like MTTR, MTBF, and availability, outlines the SRE workflow from prevention to post‑mortem, and discusses how to identify and reduce technical debt to improve system health.

OperationsReliabilitySRE
0 likes · 18 min read
Understanding SRE: Foundations, Metrics, and Tackling Technical Debt
HelloTech
HelloTech
Sep 27, 2021 · Operations

Fault Drills and Chaos Engineering Practices for Enhancing System Stability

The initiative introduces fault‑drill and chaos‑engineering practices—defining steady‑state metrics, injecting real‑world failures in controlled experiments, automating continuous production tests, and limiting blast radius—to detect weaknesses early, accelerate fault location and recovery, boost emergency response metrics, and foster a resilient engineering culture.

AutomationReliabilitychaos engineering
0 likes · 11 min read
Fault Drills and Chaos Engineering Practices for Enhancing System Stability
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 26, 2021 · Operations

Key Findings from Google DORA’s 2021 Accelerate State of DevOps Report

The 2021 Accelerate State of DevOps report by Google DORA, based on over 32,000 professionals, reveals that elite teams dramatically outperform low‑performing teams across four classic delivery metrics, introduces a new reliability metric, and highlights the impact of team culture, SRE practices, cloud adoption, secure software supply chains, and high‑quality documentation on software delivery and organizational performance.

ReliabilitySREcloud
0 likes · 7 min read
Key Findings from Google DORA’s 2021 Accelerate State of DevOps Report
Top Architect
Top Architect
Sep 17, 2021 · Backend Development

Kafka Storage Mechanism and Reliability Guarantees

This article explains Kafka's storage architecture, including segment files and indexing, and details the reliability mechanisms such as ISR, OSR, LEO, HW, producer acknowledgment levels, and leader election strategies to ensure data consistency and availability.

KafkaMessage QueueReliability
0 likes · 9 min read
Kafka Storage Mechanism and Reliability Guarantees
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 17, 2021 · Big Data

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

This article explains HDFS reliability features such as replica policies, rack awareness, heartbeat, safe mode, checksums, trash, metadata protection and snapshots, then details YARN failover handling for ApplicationMaster, NodeManager and ResourceManager, and finally describes the Hadoop MapReduce shuffle workflow and tuning tips.

HDFSMapReduceReliability
0 likes · 13 min read
Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process
Programmer DD
Programmer DD
Sep 16, 2021 · Backend Development

Why RocketMQ Solves Core Messaging Challenges – Architecture and Features Explained

This article examines the key problems message middleware must address—such as publish/subscribe, ordering, filtering, persistence, reliability, latency, and transaction support—and explains how Apache RocketMQ’s architecture and design choices provide high‑performance, high‑throughput solutions to each of these challenges.

Distributed SystemsMessage QueueReliability
0 likes · 17 min read
Why RocketMQ Solves Core Messaging Challenges – Architecture and Features Explained
IT Architects Alliance
IT Architects Alliance
Aug 28, 2021 · Backend Development

Understanding Message Queues: Sync vs Async, Decoupling, Performance, and Reliability

This article explains the fundamentals of message queues, compares synchronous and asynchronous communication, discusses the benefits of sender‑receiver decoupling, outlines performance and reliability considerations, and provides practical guidance for designing robust distributed messaging architectures.

DecouplingMessage QueueReliability
0 likes · 9 min read
Understanding Message Queues: Sync vs Async, Decoupling, Performance, and Reliability
Efficient Ops
Efficient Ops
Aug 9, 2021 · Operations

Why “High Availability” Often Fails: Lessons from a Messaging System Disaster

A real‑world incident with ActiveMQ’s high‑availability setup shows that focusing on component reliability without business‑level capacity planning, monitoring, and graceful degradation can cripple services, highlighting that true high availability must prioritize overall system and user experience.

Distributed SystemsReliabilitymessaging queues
0 likes · 7 min read
Why “High Availability” Often Fails: Lessons from a Messaging System Disaster
Alibaba Cloud Native
Alibaba Cloud Native
Aug 6, 2021 · Operations

Scaling Chaos Engineering at Qunar: Lessons from Thousands of Microservices

Qunar shares how it built a large‑scale chaos engineering platform for thousands of microservices, detailing tool selection, architecture, evolution stages, fault‑injection scenarios, strong/weak dependency automation, open‑source contributions, and future plans for automated random drills.

Cloud NativeFault InjectionOperations
0 likes · 9 min read
Scaling Chaos Engineering at Qunar: Lessons from Thousands of Microservices
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 5, 2021 · Operations

Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques

This article details Baidu Search's high‑availability engineering, describing eight major challenges in fault analysis and the corresponding innovations—index mirroring, streaming analysis, comprehensive label sets, feature engineering, query reconstruction, intelligent ranking, timeline analysis, and chaos engineering—that together enable near‑real‑time, 99% accurate detection and mitigation of search service failures.

Big DataReliabilitySearch
0 likes · 13 min read
Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Aug 4, 2021 · Backend Development

Ensuring Reliable Kafka Messaging: Handling Message Loss, Producer, Broker, and Consumer Configurations

This article explains how Kafka can suffer message loss and duplication, and provides practical configurations for the producer, broker, and consumer—including acks, retries, replication factor, min.insync.replicas, and manual offset commits—to achieve reliable, idempotent message processing.

IdempotencyReliabilitymessage-queue
0 likes · 9 min read
Ensuring Reliable Kafka Messaging: Handling Message Loss, Producer, Broker, and Consumer Configurations
New Oriental Technology
New Oriental Technology
Jul 28, 2021 · Backend Development

Introduction to Message Middleware and RabbitMQ

This article provides a comprehensive overview of message middleware concepts, the role and benefits of message queues, and an in‑depth look at RabbitMQ architecture, components, reliability mechanisms, advanced features, and internal storage and flow‑control mechanisms.

BackendMessage QueueRabbitMQ
0 likes · 22 min read
Introduction to Message Middleware and RabbitMQ
DevOps
DevOps
Jul 28, 2021 · Operations

Improving System Availability: Stages, Influencing Factors, and Practical Measures

This article explains system availability, outlines three stages of incident handling, identifies key factors that degrade availability such as human error, avalanche effects, untested releases and infrastructure failures, and proposes technical and team‑oriented practices to enhance reliability and achieve higher "nines" of uptime.

OperationsReliabilityincident management
0 likes · 11 min read
Improving System Availability: Stages, Influencing Factors, and Practical Measures
Selected Java Interview Questions
Selected Java Interview Questions
Jul 19, 2021 · Backend Development

Message Queue Concepts and Consumption Scenarios

This article explains the core concepts of message queues—including producers, consumers, messages, brokers, and push/pull delivery—and analyzes three consumption scenarios: at‑most‑once, at‑least‑once, and exactly‑once, detailing the required producer, broker, and consumer behaviors for each.

ConsumerExactly-OnceMessage Queue
0 likes · 6 min read
Message Queue Concepts and Consumption Scenarios
Wukong Talks Architecture
Wukong Talks Architecture
Jul 12, 2021 · Backend Development

Why Use Message Queues? Pain Points, Challenges, and Practical Solutions

This article explains the drawbacks of traditional synchronous architectures, outlines why adopting message queues improves latency, coupling, and peak‑handling, and then details common MQ problems such as duplicate messages, data inconsistency, loss, ordering, backlog, and increased complexity along with concrete mitigation strategies.

AsynchronousBackend DevelopmentDecoupling
0 likes · 13 min read
Why Use Message Queues? Pain Points, Challenges, and Practical Solutions
DevOps
DevOps
Jul 12, 2021 · Operations

The First Four Chaos Experiments to Run on Apache Kafka

This article explains how to use chaos engineering with Gremlin to design, execute, and analyze four experiments that test Kafka broker load, message loss, split‑brain scenarios, and ZooKeeper outages, helping improve the reliability and resilience of Kafka deployments.

Distributed SystemsGremlinKafka
0 likes · 18 min read
The First Four Chaos Experiments to Run on Apache Kafka
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 2, 2021 · Cloud Native

Chaos Engineering Practices at iQIYI: Building Resilient Cloud‑Native Systems

iQIYI’s Little Deer Chaos Platform injects faults and runs red‑blue attacks across production services, enabling teams to validate alerts, circuit‑breakers, and fail‑over mechanisms—demonstrated by video playback and membership service case studies—thereby fostering zero‑trust design, faster skill growth, and resilient cloud‑native operations.

DevOpsFault InjectionReliability
0 likes · 10 min read
Chaos Engineering Practices at iQIYI: Building Resilient Cloud‑Native Systems
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 20, 2021 · Backend Development

How to Prevent Message Loss and Ensure Reliable Delivery in Distributed Systems

This article explains practical techniques for detecting lost messages, guaranteeing reliable production, storage, and consumption stages, handling duplicate deliveries with idempotent designs, managing message backlogs, and implementing distributed transactions using transactional messages in modern message queue systems.

IdempotencyKafkaReliability
0 likes · 18 min read
How to Prevent Message Loss and Ensure Reliable Delivery in Distributed Systems
DevOps
DevOps
Jun 2, 2021 · Operations

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Slack’s resilience engineering team outlines a structured chaos‑engineering workflow—identifying potential failures, ensuring fault tolerance, and deliberately injecting faults in development and production—to safely test system robustness, validate hypotheses, and continuously improve reliability through regular disaster‑theater exercises.

Fault InjectionOperationsReliability
0 likes · 11 min read
Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2021 · Operations

Designing Microservices Architecture for Failure: Patterns and Practices

Microservice architectures must handle inevitable network, hardware, and application errors by employing fault‑tolerant patterns such as graceful degradation, change management, health checks, fail‑over caches, retry logic, rate limiting, circuit breakers, and testing strategies to maintain service reliability and user experience.

MicroservicesOperationsReliability
0 likes · 15 min read
Designing Microservices Architecture for Failure: Patterns and Practices
High Availability Architecture
High Availability Architecture
May 13, 2021 · Operations

Understanding High Availability: Concepts, Metrics, and Design Practices

This article explains high availability in distributed systems, covering its definition, design objectives, key metrics such as MTBF, MTTR, SLA, and practical design elements like redundancy, monitoring, failover, as well as common Q&A on cost, relationship with other architecture attributes, and implementation considerations.

Distributed SystemsReliabilitySLA
0 likes · 13 min read
Understanding High Availability: Concepts, Metrics, and Design Practices
Dada Group Technology
Dada Group Technology
Apr 9, 2021 · Operations

Design and Implementation of JD Daojia Open Platform Message System (BMQ)

This article explains the architecture, reliability mechanisms, dynamic configuration, monitoring, and alerting strategies of JD Daojia Open Platform's Business Message Queue (BMQ), illustrating how the system handles bidirectional communication, fault isolation, and scalable message processing for merchants.

BackendDynamic ConfigurationMessage Queue
0 likes · 13 min read
Design and Implementation of JD Daojia Open Platform Message System (BMQ)
ITPUB
ITPUB
Mar 23, 2021 · Databases

13 Redis Best Practices to Save Memory, Boost Performance, Ensure Reliability

This guide presents thirteen practical Redis best‑practice recommendations covering memory optimization, performance tuning, high reliability, routine operations, resource planning, monitoring, and security, offering concrete steps such as key length control, maxmemory configuration, lazy‑free, command selection, connection pooling, replication settings, and safe deployment to help developers and DBA operators run Redis efficiently and safely.

Memory OptimizationReliabilityperformance
0 likes · 24 min read
13 Redis Best Practices to Save Memory, Boost Performance, Ensure Reliability
Java Backend Technology
Java Backend Technology
Mar 4, 2021 · Backend Development

How to Guarantee RabbitMQ Message Delivery: Persistence, Confirm, and Idempotency

This article explores common pitfalls in RabbitMQ message delivery, explains persistence settings and the confirm mechanism, and why they alone can't ensure 100% reliability, then proposes a robust solution combining pre‑persisting messages to Redis, confirm callbacks, scheduled retries, and idempotent consumer design to achieve near‑zero loss.

Confirm MechanismMessage QueueRabbitMQ
0 likes · 11 min read
How to Guarantee RabbitMQ Message Delivery: Persistence, Confirm, and Idempotency
Top Architect
Top Architect
Feb 17, 2021 · Backend Development

Integrating RabbitMQ with Spring Boot: Configuration, Message Sending, and Reliability

This article explains how to integrate RabbitMQ into a Spring Boot application, covering dependency setup, connection configuration, message production and consumption, handling complex JSON messages, and ensuring both sending and receiving reliability through publisher confirms, return callbacks, and consumer acknowledgements.

JavaMessage QueueRabbitMQ
0 likes · 12 min read
Integrating RabbitMQ with Spring Boot: Configuration, Message Sending, and Reliability
Xianyu Technology
Xianyu Technology
Feb 5, 2021 · Backend Development

Improving Xianyu Messaging Reliability: Architecture, Issues, and Solutions

The article details how Xianyu’s 2020 messaging failures—lost messages, wrong avatars, and order status errors—were traced to duplicate IDs, push‑logic mismatches, and client bugs, and solved by introducing global UUIDs, ACK‑based retries, hierarchical conversation models, hybrid storage caching, and real‑time monitoring, boosting delivery reliability above 99.9%.

BackendMessagingMobile
0 likes · 12 min read
Improving Xianyu Messaging Reliability: Architecture, Issues, and Solutions
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Feb 1, 2021 · Big Data

Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts

This article provides a comprehensive introduction to Apache Kafka, covering its distributed publish‑subscribe architecture, its key components such as brokers, topics, partitions, producers, consumers, and ZooKeeper, as well as its advantages, drawbacks, storage mechanisms, partition assignment strategies, and reliability guarantees for high‑throughput big‑data streaming.

Big DataDistributed SystemsMessage Queue
0 likes · 20 min read
Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts
21CTO
21CTO
Jan 26, 2021 · Fundamentals

Mastering the 21 Essential Software Architecture Characteristics

This article explains the twenty‑one key non‑functional characteristics of software architecture—such as performance, reliability, scalability, security and maintainability—detailing their definitions, typical metrics, and practical techniques for improvement, while linking each trait to ISO‑25010 and real‑world engineering practices.

DevOpsNon-functional RequirementsReliability
0 likes · 20 min read
Mastering the 21 Essential Software Architecture Characteristics
Code Ape Tech Column
Code Ape Tech Column
Jan 25, 2021 · Backend Development

Ensuring Zero Message Loss in RabbitMQ: Persistence, Confirm, and Idempotent Strategies

This article examines how to guarantee reliable message delivery in RabbitMQ by using durable queues, the confirm callback mechanism, pre‑persisting messages with Redis or a database, scheduled compensation tasks, and idempotent processing techniques such as optimistic locking and unique‑ID fingerprinting.

BackendConfirm CallbackIdempotency
0 likes · 11 min read
Ensuring Zero Message Loss in RabbitMQ: Persistence, Confirm, and Idempotent Strategies
Code Ape Tech Column
Code Ape Tech Column
Jan 15, 2021 · Backend Development

Mastering RabbitMQ: From AMQP Basics to High‑Availability Clusters

This article explains RabbitMQ's AMQP fundamentals, exchange types, reliable delivery mechanisms, idempotency strategies, consumer flow control, TTL and dead‑letter handling, as well as clustering, federation, HAProxy and Keepalived solutions for building a resilient messaging infrastructure.

AMQPConsumer Flow ControlMessage Queue
0 likes · 16 min read
Mastering RabbitMQ: From AMQP Basics to High‑Availability Clusters
Architect
Architect
Jan 8, 2021 · Backend Development

Understanding RabbitMQ: AMQP Fundamentals, Exchange Types, Reliability Mechanisms, and High‑Availability Deployment

This article provides a comprehensive overview of RabbitMQ, covering AMQP core concepts, exchange and queue types, message reliability techniques such as confirms and returns, consumer flow‑control, TTL and dead‑letter handling, as well as clustering, federation, and HAProxy/Keepalived high‑availability solutions.

AMQPBackendRabbitMQ
0 likes · 16 min read
Understanding RabbitMQ: AMQP Fundamentals, Exchange Types, Reliability Mechanisms, and High‑Availability Deployment
Alibaba Terminal Technology
Alibaba Terminal Technology
Jan 7, 2021 · Frontend Development

How Frontend Chaos Engineering Boosts Reliability: Lessons from Alibaba

This article explores the challenges of frontend availability, introduces chaos engineering concepts, and details Alibaba's practical approach to frontend fault injection—including static resource hijacking, a safe isolated environment, monitoring integration, and a real‑world drill that demonstrates how to measure and improve detection and response capabilities.

Fault InjectionReliabilitychaos engineering
0 likes · 18 min read
How Frontend Chaos Engineering Boosts Reliability: Lessons from Alibaba
Continuous Delivery 2.0
Continuous Delivery 2.0
Jan 6, 2021 · Operations

Test In Production (TIP): Microsoft’s Shift‑Right Testing, Fault Injection, and Chaos Engineering Practices

The article explains Microsoft’s Test‑In‑Production (TIP) approach, describing why production is the only true environment, how they use gradual releases, feature flags, telemetry, fault injection, circuit‑breaker testing, and chaos engineering to improve reliability, micro‑service compatibility, and business continuity.

ChaosEngineeringFaultInjectionMicrosoft
0 likes · 11 min read
Test In Production (TIP): Microsoft’s Shift‑Right Testing, Fault Injection, and Chaos Engineering Practices
Efficient Ops
Efficient Ops
Jan 5, 2021 · Operations

Master Site Reliability Engineering: Inside the SRE Foundation Course

The SRE Foundation course introduces site reliability engineering principles, practices, and tools, explaining why perfect reliability is impractical, outlining SRE responsibilities, detailing the curriculum across eight modules, and identifying the diverse professionals—from engineers to managers—who can benefit from mastering reliability, scalability, and automation.

CourseReliabilitySRE
0 likes · 7 min read
Master Site Reliability Engineering: Inside the SRE Foundation Course
Continuous Delivery 2.0
Continuous Delivery 2.0
Dec 18, 2020 · Operations

Applying the VALET Model for SRE Transformation at Home Depot (THD)

The article explains how Home Depot (THD) adopted the VALET model—a five‑dimensional SLO language covering Volume, Availability, Latency, Error, and Ticket—to unify communication, automate data collection, and improve reliability across its massive retail and e‑commerce infrastructure.

OperationsReliabilitySLO
0 likes · 9 min read
Applying the VALET Model for SRE Transformation at Home Depot (THD)
Wukong Talks Architecture
Wukong Talks Architecture
Dec 3, 2020 · Fundamentals

Explaining UDP vs TCP in Simple Terms

This article uses everyday analogies to compare UDP and TCP, outlines their core characteristics, typical use cases, and describes how to add TCP-like reliability features to UDP such as connection establishment, ordering, loss recovery, flow control, and congestion control.

Flow ControlNetworkingProtocols
0 likes · 7 min read
Explaining UDP vs TCP in Simple Terms
Tencent Cloud Developer
Tencent Cloud Developer
Nov 19, 2020 · Backend Development

Kafka Message Queue Reliability Design and Implementation

The article thoroughly explains Kafka’s message‑queue reliability design and implementation, covering use‑case scenarios, core concepts, storage format, producer acknowledgment settings, broker replication mechanisms (ISR, HW, LEO), consumer delivery semantics, the epoch solution for synchronization, and practical configuration guidelines for various consistency and availability requirements.

BrokerConsistencyConsumer
0 likes · 15 min read
Kafka Message Queue Reliability Design and Implementation
Efficient Ops
Efficient Ops
Nov 4, 2020 · Operations

Unlocking SRE: Foundations, Principles, and Career Paths Explained

This article clarifies common misconceptions about Site Reliability Engineering, outlines the role’s responsibilities, presents the SRE Foundation course syllabus and target audience, and highlights the GOPS 2020 Global Operations Conference where the training is offered.

DevOpsReliabilitySRE
0 likes · 7 min read
Unlocking SRE: Foundations, Principles, and Career Paths Explained
IT Architects Alliance
IT Architects Alliance
Oct 22, 2020 · Fundamentals

15 Universal Software Architecture Principles and Key Design Guidelines

The article presents a comprehensive set of fifteen universal software architecture principles—ranging from redundancy and rollback to automation and non‑intrusive design—along with essential design guidelines such as separation of concerns, single responsibility, and low coupling to help architects build scalable, reliable, and maintainable systems.

ReliabilityScalabilitySoftware Architecture
0 likes · 12 min read
15 Universal Software Architecture Principles and Key Design Guidelines
JD Cloud Developers
JD Cloud Developers
Sep 22, 2020 · Fundamentals

Designing High‑Reliability Storage Systems: Strategies from JD Cloud & Intel

An in‑depth look at how JD Cloud’s high‑reliability storage architecture tackles data reliability challenges—covering replica management, redundancy, detection and repair mechanisms, tiered storage designs, and Intel Optane’s role in boosting performance—offering practical strategies for balancing cost and resilience.

Intel OptaneReliabilitycloud
0 likes · 17 min read
Designing High‑Reliability Storage Systems: Strategies from JD Cloud & Intel
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 11, 2020 · Cloud Native

Chaos Engineering Framework and Practices in iQIYI FinTech Team

The iQIYI FinTech team implemented a Chaos Engineering framework, using a purpose‑driven Chaos Monkey to inject controlled failures, validate high‑availability, isolation, and self‑healing of payment services, derive architectural improvements, build a fault‑case library, and transition from fault detection to proactive system robustness.

Chaos MonkeyDistributed SystemsFinTech
0 likes · 9 min read
Chaos Engineering Framework and Practices in iQIYI FinTech Team
DevOps
DevOps
Aug 13, 2020 · Operations

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

This article outlines ByteDance’s adoption of chaos engineering, describing its background, industry examples, the evolution of internal fault‑injection platforms across three generations, the fault model and center design, experiment principles, and future plans for infrastructure‑level chaos and automated diagnostics.

Distributed SystemsFault InjectionObservability
0 likes · 21 min read
ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions
Architecture Digest
Architecture Digest
Aug 5, 2020 · Backend Development

Monolith vs Microservices: A Comparative Study of Performance, Complexity, Reliability, and Scalability

This article compares monolithic applications and microservice architectures across dimensions such as network latency, development and operational complexity, reliability, resource consumption, scaling precision, throughput, deployment speed, and team communication, highlighting where each approach wins and offering guidance on when to adopt microservices.

LatencyMicroservicesReliability
0 likes · 12 min read
Monolith vs Microservices: A Comparative Study of Performance, Complexity, Reliability, and Scalability
Zhengtong Technical Team
Zhengtong Technical Team
Jul 14, 2020 · Cloud Computing

Stability Assurance Solutions for an Unattended Parking Cloud SaaS Platform

This article outlines the challenges of scaling parking services to cloud SaaS—including network latency, real‑time processing, and data integrity—and presents a comprehensive stability strategy using MQTT, dual‑network backup, Go‑based process supervision, and edge‑cloud collaboration to achieve high‑availability unattended parking operations.

GoIoTMQTT
0 likes · 11 min read
Stability Assurance Solutions for an Unattended Parking Cloud SaaS Platform
Big Data Technology Architecture
Big Data Technology Architecture
Jun 29, 2020 · Fundamentals

Kafka Storage Mechanism and Reliability Guarantees

This article explains Kafka's internal storage architecture—including topics, partitions, segments, .log and .index files—how data is read, and the various reliability mechanisms such as ISR/OSR, LEO/HW, producer acknowledgment levels, leader election strategies, and delivery semantics.

KafkaProducer AcksReliability
0 likes · 9 min read
Kafka Storage Mechanism and Reliability Guarantees
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jun 18, 2020 · Big Data

Kafka Interview Questions: High Availability, Reliability, Consistency, Performance, and Usage Rationale

This article explains common Kafka interview questions by analyzing the system's high‑availability design, reliability mechanisms, consistency model, performance tricks such as sequential writes and zero‑copy, and the reasons for using Kafka and message queues, providing both conceptual insight and practical details.

ConsistencyDistributed SystemsKafka
0 likes · 12 min read
Kafka Interview Questions: High Availability, Reliability, Consistency, Performance, and Usage Rationale
Programmer DD
Programmer DD
May 12, 2020 · Operations

Boost RabbitMQ Reliability: Proven Strategies for Producers, Consumers, and Ops

This comprehensive guide explains how to enhance RabbitMQ reliability by covering confirmation mechanisms, producer and consumer configurations, queue mirroring, alerting, monitoring metrics, and health‑check commands, providing actionable steps for developers and operations teams to ensure stable message delivery.

Message QueueOperationsRabbitMQ
0 likes · 22 min read
Boost RabbitMQ Reliability: Proven Strategies for Producers, Consumers, and Ops
DataFunTalk
DataFunTalk
Apr 27, 2020 · Operations

ByteDance’s Chaos Engineering Practice and Platform Evolution

This article describes ByteDance’s multi‑generation chaos engineering practice, covering industry background, fault‑injection models, the design of a declarative fault‑center, experiment selection principles, detailed experiment processes, metric classifications, red‑blue war‑game workflows, strong/weak dependency analysis, and future directions for infrastructure‑level chaos engineering.

Fault InjectionObservabilityReliability
0 likes · 21 min read
ByteDance’s Chaos Engineering Practice and Platform Evolution
Liangxu Linux
Liangxu Linux
Apr 17, 2020 · Operations

Essential Bash Scripting Tips for Building Reliable Shell Scripts

This guide presents practical Bash scripting techniques—including strict mode, file locking, graceful termination, timeout handling, and pipeline debugging—to help developers write more robust, maintainable, and error‑resilient shell scripts for automation and system tasks.

BashDevOpsLinux
0 likes · 7 min read
Essential Bash Scripting Tips for Building Reliable Shell Scripts
High Availability Architecture
High Availability Architecture
Apr 8, 2020 · Operations

Slack's Deployment Process: Balancing Speed and Reliability

This article explains how Slack’s engineering team designs a multi‑stage deployment pipeline—including release branches, staging, dogfood, canary, and percentage rollouts—while emphasizing rapid iteration, visibility, and reliability through fast and atomic deployment mechanisms.

DeploymentOperationsReliability
0 likes · 8 min read
Slack's Deployment Process: Balancing Speed and Reliability
Architects Research Society
Architects Research Society
Apr 4, 2020 · Cloud Computing

Immutable Infrastructure: Concepts, Benefits, and Implementation Details

This article explains the difference between mutable and immutable infrastructure, outlines the advantages of immutable architectures such as consistency, reliability, and simplified deployments, and provides practical guidance on implementing immutable infrastructure using cloud environments, automation pipelines, and supporting components.

Deployment AutomationReliabilityScalability
0 likes · 13 min read
Immutable Infrastructure: Concepts, Benefits, and Implementation Details
Continuous Delivery 2.0
Continuous Delivery 2.0
Apr 3, 2020 · Operations

Scalable and Reliable Configuration Distribution at Facebook

This article explains how Facebook’s Configerator system achieves scalable, reliable configuration distribution using a push model, a hierarchical Zeus tree, Package Vessel for large data, and multi‑repo Git strategies to improve commit throughput and fault tolerance.

Configuration ManagementDistributed SystemsReliability
0 likes · 11 min read
Scalable and Reliable Configuration Distribution at Facebook
Programmer DD
Programmer DD
Mar 23, 2020 · Operations

Mastering Chaos Engineering: Boost Confidence in Distributed Systems

This article explains chaos engineering as a systematic approach to experiment on distributed systems, identifies common failure modes, outlines a four‑step experimentation process, and presents advanced principles to help teams increase reliability and confidence in production environments.

Distributed SystemsReliabilitychaos engineering
0 likes · 7 min read
Mastering Chaos Engineering: Boost Confidence in Distributed Systems
360 Tech Engineering
360 Tech Engineering
Feb 19, 2020 · Operations

Best Practices for Writing Reliable Bash Shell Scripts

This article presents a comprehensive guide to writing reliable Bash shell scripts, covering shebang selection, quoting variables, function encapsulation, readonly constants, variable scope, uninitialized variable protection, tracing, error handling, path management, argument parsing with shift, trap usage, and numerous practical tips for robust script development.

BashReliabilityShell scripting
0 likes · 9 min read
Best Practices for Writing Reliable Bash Shell Scripts
Tencent Cloud Developer
Tencent Cloud Developer
Feb 18, 2020 · Backend Development

Technical Overview of Tencent Cloud CKafka for High-Scale Online Classroom Messaging

Tencent Cloud CKafka powers Tencent Classroom’s pandemic‑era online teaching by replacing a custom queue with a high‑performance, highly available, partition‑based message bus that scales to millions of real‑time interactions, offers configurable replication and tuning for reliability, and integrates with big‑data and streaming tools for analytics.

Backend DevelopmentCKafkaKafka
0 likes · 15 min read
Technical Overview of Tencent Cloud CKafka for High-Scale Online Classroom Messaging
ITPUB
ITPUB
Jan 10, 2020 · Databases

How One MySQL Instance Impacts Millions: The Hidden Life‑and‑Death Stakes of Open‑Source Software

This article reveals how open‑source software like MySQL powers critical systems—from hospital patient records to massive retail operations—illustrating that a single code error or missing backup can affect tens of millions of people worldwide, underscoring the profound responsibility of open‑source contributors.

ReliabilitySoftware Impactdatabases
0 likes · 7 min read
How One MySQL Instance Impacts Millions: The Hidden Life‑and‑Death Stakes of Open‑Source Software
Architecture Digest
Architecture Digest
Jan 7, 2020 · Backend Development

Ensuring 100% Message Delivery with RabbitMQ: Reliability Steps and Idempotent Design

This article explains how to achieve guaranteed 100% message delivery in RabbitMQ by leveraging its acknowledgment mechanisms, implementing producer‑side confirmation steps, designing compensation and retry logic, and ensuring consumer‑side idempotency through unique identifiers and various ID‑generation strategies.

Distributed SystemsIdempotencyMessage Delivery
0 likes · 7 min read
Ensuring 100% Message Delivery with RabbitMQ: Reliability Steps and Idempotent Design
ITPUB
ITPUB
Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataHivePerformance Optimization
0 likes · 16 min read
How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains
Alibaba Cloud Native
Alibaba Cloud Native
Dec 4, 2019 · Cloud Native

How Alibaba Supercharged etcd for Double‑11: Performance, Stability, and Management Secrets

Alibaba’s three‑year etcd journey reveals how hardware upgrades, software patches, a custom storage freelist algorithm, client best practices, and an enhanced operator platform collectively boosted etcd’s performance 24‑fold, expanded storage 50‑times, and hardened stability for massive Double‑11 workloads.

Reliabilitycloud-nativeetcd
0 likes · 11 min read
How Alibaba Supercharged etcd for Double‑11: Performance, Stability, and Management Secrets
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 20, 2019 · Operations

Ensuring Message Reliability, Duplicate Handling, and Transactional Guarantees in Distributed Message Queues

This article explains how to detect and prevent message loss, guarantee reliable delivery across production, storage, and consumption stages, handle duplicate messages with idempotent designs, resolve message backlogs, and implement distributed transactions using transactional messages in systems like Kafka, RocketMQ, and RabbitMQ.

ReliabilityTransactional Messagingduplicate handling
0 likes · 19 min read
Ensuring Message Reliability, Duplicate Handling, and Transactional Guarantees in Distributed Message Queues
Ctrip Technology
Ctrip Technology
Nov 14, 2019 · Operations

Chaos Engineering: Principles, Practices, and Lessons from Ctrip

The article explains Chaos Engineering as a discipline for deliberately injecting failures into distributed systems to uncover hidden weaknesses, outlines its five core principles, describes practical implementation steps and real‑world examples from Ctrip, and discusses future directions for reliability engineering.

Distributed SystemsFault InjectionOperations
0 likes · 9 min read
Chaos Engineering: Principles, Practices, and Lessons from Ctrip
Programmer DD
Programmer DD
Oct 2, 2019 · Backend Development

How to Build a Reliable, Secure, and Scalable IM Server from Scratch

This article walks through constructing a lightweight instant‑messaging backend, covering version 1.0.0 features, reliability guarantees, application‑level ACK handling, security encryption, database schema for users, relations and offline messages, and storage strategies to prevent duplicate delivery.

Backend ArchitectureInstant MessagingJava
0 likes · 12 min read
How to Build a Reliable, Secure, and Scalable IM Server from Scratch
DevOps
DevOps
Sep 16, 2019 · Operations

Netflix Chaos Engineering: Background, Evolution, Tools, Principles, and Practice

This article presents a comprehensive overview of Netflix's chaos engineering journey, detailing its origins, the development of the Simian Army tools, core principles, practical steps, and applications in Kubernetes environments, offering valuable insights for reliable DevOps practices.

DevOpsKubernetesNetflix
0 likes · 10 min read
Netflix Chaos Engineering: Background, Evolution, Tools, Principles, and Practice
Efficient Ops
Efficient Ops
Jun 13, 2019 · Operations

How to Build a Future‑Proof Operations Platform with End‑State Architecture

This article explains the challenges of modern large‑scale operations, introduces the end‑state architectural principle, details the system components and safety model, discusses real‑world deployment issues, and looks ahead to future AIOps possibilities, offering practical guidance for building resilient operation platforms.

Reliabilityend-stateplatform architecture
0 likes · 24 min read
How to Build a Future‑Proof Operations Platform with End‑State Architecture
Programmer DD
Programmer DD
Jun 7, 2019 · Operations

Why Most Alerts Fail and How to Build Actionable Monitoring

This article explains the fundamental flaws of typical alert systems, distinguishes between business rule and reliability monitoring, outlines essential metrics and strategies for effective alerts, and presents simple yet powerful anomaly‑detection algorithms to ensure alerts are actionable and reduce noise.

AlertingOperationsReliability
0 likes · 21 min read
Why Most Alerts Fail and How to Build Actionable Monitoring
21CTO
21CTO
Jun 3, 2019 · Backend Development

How Didi Engineered a Scalable Large‑Scale Microservice Framework with Go

In this detailed talk, Didi senior engineer Du Huan explains the challenges of building large microservice frameworks, outlines design principles such as the Rule of Least Power, describes the evolution of service frameworks, and shares concrete implementation techniques and business benefits of Didi's Go‑based platform.

MicroservicesReliabilityService Architecture
0 likes · 29 min read
How Didi Engineered a Scalable Large‑Scale Microservice Framework with Go
Didi Tech
Didi Tech
May 23, 2019 · Cloud Native

Design Practices for Large‑Scale Microservice Frameworks

In his Go China talk, senior Didi engineer Du Huan outlined the design and implementation of a large‑scale microservice framework that abstracts I/O, injects tracing via protocol hijacking, optimizes timers, and enforces fail‑fast circuit breaking, delivering faster development, higher stability, seamless upgrades, and a unified operating‑system‑like layer for thousands of services.

GoReliabilityService Architecture
0 likes · 29 min read
Design Practices for Large‑Scale Microservice Frameworks
High Availability Architecture
High Availability Architecture
May 21, 2019 · Cloud Native

Integrating Contract Testing and Chaos Engineering for Reliable Microservice Architectures

The article explains how contract testing and chaos engineering can be combined to improve the quality and resilience of microservice systems, describing their principles, practical tools such as Chaos Monkey and ChaosBlade, and detailed experiment steps for validating service reliability in cloud‑native environments.

MicroservicesReliabilitycontract testing
0 likes · 11 min read
Integrating Contract Testing and Chaos Engineering for Reliable Microservice Architectures
Efficient Ops
Efficient Ops
Apr 18, 2019 · Fundamentals

What Makes Software Trustworthy? Insights from Huawei Cloud DevCloud

The article explores the concept of trustworthy software, outlines its five key dimensions—safety, reliability, availability, security, and resilience—and describes how Huawei Cloud DevCloud is applying these principles through its open‑source mirror services and secure development practices.

ReliabilitySoftware Engineeringcloud computing
0 likes · 6 min read
What Makes Software Trustworthy? Insights from Huawei Cloud DevCloud
JD Tech Talk
JD Tech Talk
Mar 6, 2019 · Fundamentals

Understanding TCP Three‑Way Handshake and Four‑Way Termination

This article explains the essential conditions for TCP communication, details the three‑step handshake and four‑step termination processes with packet‑capture illustrations, and discusses why these sequences ensure reliable connections between a client and a server.

Four-way terminationNetwork ProtocolsReliability
0 likes · 8 min read
Understanding TCP Three‑Way Handshake and Four‑Way Termination
Java Backend Technology
Java Backend Technology
Mar 2, 2019 · Operations

How Alibaba’s ‘MonkeyKing’ Uses Chaos Engineering to Strengthen System Reliability

Alibaba’s MonkeyKing, inspired by Netflix’s Chaos Monkey, employs intentional fault injection—from random node kills to simulated network outages—to test and improve system robustness across IaaS, PaaS, and SaaS layers, offering a comprehensive model for reliability engineering in complex distributed environments.

AlibabaDistributed SystemsFault Injection
0 likes · 8 min read
How Alibaba’s ‘MonkeyKing’ Uses Chaos Engineering to Strengthen System Reliability
dbaplus Community
dbaplus Community
Dec 12, 2018 · Backend Development

How to Choose the Right Message Queue: RabbitMQ vs Kafka

This article examines the role of message‑queue middleware in high‑concurrency IM systems, compares popular open‑source options such as ActiveMQ, RabbitMQ, Kafka, RocketMQ and ZeroMQ, and provides a detailed multi‑dimensional framework—including functionality, performance, reliability, operational management, and ecosystem factors—to help engineers select the most suitable queue for their specific business needs.

KafkaMessage QueueMiddleware Selection
0 likes · 28 min read
How to Choose the Right Message Queue: RabbitMQ vs Kafka