Tagged articles
352 articles
Page 4 of 4
360 Tech Engineering
360 Tech Engineering
Aug 2, 2018 · Operations

Online Monitoring Practices for DSP Advertising: Shifting Testing to Production

This article discusses the concept of test right‑shift—moving testing to post‑release production—by detailing a four‑layer online monitoring system for a DSP advertising platform, including interface‑level, UI‑level, revenue, and daily key‑metric monitoring, and shares real‑world incident examples.

DSP advertisingReliabilityTest Right-Shift
0 likes · 8 min read
Online Monitoring Practices for DSP Advertising: Shifting Testing to Production
Meituan Technology Team
Meituan Technology Team
Jul 26, 2018 · Backend Development

Evolution of Meituan Delivery System Architecture and Practices

Meituan Delivery’s architecture has progressed from a rapid MVP with coarse services to a scalable, fine‑grained platform comprising fulfillment, operation, and master‑data subsystems, employing reliability engineering, capacity planning, AI‑driven simulation, and location services to ensure high availability, efficiency, and future‑ready scalability.

AIBig DataMicroservices
0 likes · 16 min read
Evolution of Meituan Delivery System Architecture and Practices
Architects' Tech Alliance
Architects' Tech Alliance
Jul 22, 2018 · Industry Insights

Why Do SSDs Wear Out? Understanding Flash Memory Lifespan and Reliability

Flash memory’s high performance comes with concerns about reliability, write endurance, and failure rates, so this article explains why flash cells have limited erase cycles, how ECC and LDPC error‑correction extend SSD lifespan, compares SSD vs HDD performance, and outlines factors influencing SSD durability and data recovery.

ECCLDPCReliability
0 likes · 9 min read
Why Do SSDs Wear Out? Understanding Flash Memory Lifespan and Reliability
Architecture Digest
Architecture Digest
Jul 19, 2018 · Operations

How to Prevent System Failures: Suspect Third‑Party Services, Guard Consumers, and Strengthen Your Own Service

The article presents practical strategies for avoiding service failures by treating third‑party dependencies as unreliable, designing robust APIs for consumers, and applying solid engineering principles such as degradation plans, timeout settings, traffic control, and resource‑limiting techniques.

ReliabilityResource Managementapi-design
0 likes · 16 min read
How to Prevent System Failures: Suspect Third‑Party Services, Guard Consumers, and Strengthen Your Own Service
360 Quality & Efficiency
360 Quality & Efficiency
Jul 9, 2018 · Fundamentals

Reliability Redundancy, Gödel’s Incompleteness, and the Halting Problem: Foundations of Program Analysis

The article explores reliability engineering with redundant systems, explains Gödel’s incompleteness theorem and the halting problem, and introduces program static analysis techniques, illustrating how theoretical foundations guide practical approaches to detecting software defects through approximations and abstract interpretation.

ComputabilityGödelReliability
0 likes · 8 min read
Reliability Redundancy, Gödel’s Incompleteness, and the Halting Problem: Foundations of Program Analysis
AntTech
AntTech
Jul 3, 2018 · Backend Development

Evolution of Financial‑Grade Message Queues at Ant Financial

The article reviews the ten‑year evolution of Ant Financial's message queue, detailing its core reliability, consistency, availability and performance requirements, the architectural mechanisms built to meet them, the shift to pull‑mode and API‑mode designs, and the recent integration of compute capabilities to create a smart data transmission platform.

Big DataDistributed SystemsMessage Queue
0 likes · 13 min read
Evolution of Financial‑Grade Message Queues at Ant Financial
21CTO
21CTO
Jun 26, 2018 · Backend Development

Mastering Message Queues: Why Use Them, Pitfalls, Selection, and High‑Availability

This article reviews the essential concepts of message‑queue middleware, covering why they are needed, their drawbacks, how to choose among popular solutions, and practical techniques for ensuring high availability, avoiding duplicate consumption, guaranteeing reliable delivery, and preserving message order.

ReliabilitySystem Designmiddleware
0 likes · 19 min read
Mastering Message Queues: Why Use Them, Pitfalls, Selection, and High‑Availability
Youzan Coder
Youzan Coder
Jun 22, 2018 · Operations

Chaos Engineering: Definition, Principles, and Implementation Steps

Chaos engineering is a disciplined practice that injects controlled faults into distributed systems—often in production—to validate steady-state hypotheses, uncover hidden reliability weaknesses, and continuously improve resilience, as illustrated by the staged implementations and fault-injection techniques used by companies such as JD.com, Youzan, and Netflix.

Fault InjectionReliabilitychaos engineering
0 likes · 11 min read
Chaos Engineering: Definition, Principles, and Implementation Steps
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 18, 2018 · Backend Development

Unlocking High-Performance MQ: Lessons from Alibaba, Tencent, and Ctrip

To support Meituan’s rapid growth, this article examines the design and evolution of several industry-leading message-queue solutions—including Alibaba’s Notify and RocketMQ, Tencent’s Tube and Hippo, and Ctrip’s Herms—highlighting their reliability, scalability, and decoupling features, and extracting key insights for building robust MQ systems.

Message QueueReliabilityScalability
0 likes · 7 min read
Unlocking High-Performance MQ: Lessons from Alibaba, Tencent, and Ctrip
Tencent Cloud Developer
Tencent Cloud Developer
May 31, 2018 · Backend Development

Tencent Billing System (Mi Master): Architecture, Reliability, Security, and Global Capabilities

The Mi Master billing platform, a SaaS service from Tencent that processes over 735 billion RMB quarterly, provides a unified, modular architecture with distributed multi‑master databases, high‑availability cross‑region deployment, multi‑stage fraud detection, and global support for 80+ payment channels across 180+ countries, delivering seamless APIs, automated reconciliation, and extensive revenue‑sharing tools for products such as Honor of Kings, PUBG, WeChat Pay, and QQ Wallet.

Distributed SystemsReliabilitybilling architecture
0 likes · 19 min read
Tencent Billing System (Mi Master): Architecture, Reliability, Security, and Global Capabilities
Java Captain
Java Captain
May 30, 2018 · Backend Development

Key Points for Revisiting Message Queue Middleware: Usage, Drawbacks, Selection, High Availability, Idempotency, Reliability, and Ordering

This article reviews essential concepts of message queue middleware, covering why to use it, its drawbacks, selection criteria, high‑availability designs, preventing duplicate consumption, ensuring reliable transmission, and maintaining message order, providing a concise study guide for developers and architects.

BackendIdempotencyMQ
0 likes · 21 min read
Key Points for Revisiting Message Queue Middleware: Usage, Drawbacks, Selection, High Availability, Idempotency, Reliability, and Ordering
MaGe Linux Operations
MaGe Linux Operations
Apr 15, 2018 · Fundamentals

Why Does TCP Need a Three‑Way Handshake and a Four‑Way Teardown?

This article explains why TCP requires a three‑way handshake to establish connections and a four‑step termination process, using a relatable video‑call scenario and detailed protocol diagrams to illustrate the underlying mechanisms and practical rules for confirming audio transmission.

Four-way HandshakeReliabilityTCP
0 likes · 10 min read
Why Does TCP Need a Three‑Way Handshake and a Four‑Way Teardown?
Didi Tech
Didi Tech
Apr 11, 2018 · Backend Development

How to Turn Synchronous RPC into Asynchronous Queues for Reliable Microservices

The article examines the reliability challenges of microservice architectures that rely heavily on synchronous RPC calls, and proposes a comprehensive solution that converts failing RPCs to asynchronous message‑queue workflows, introduces a write‑ahead‑queue for transactional consistency between databases and queues, and outlines offset management to ensure end‑to‑end fault tolerance.

KafkaMessage QueueMicroservices
0 likes · 12 min read
How to Turn Synchronous RPC into Asynchronous Queues for Reliable Microservices
Programmer DD
Programmer DD
Apr 6, 2018 · Backend Development

How to Choose the Right Message Middleware: Kafka vs RabbitMQ Deep Dive

This comprehensive guide compares Kafka and RabbitMQ across functional features, performance, reliability, operational management, and ecosystem support, offering practical criteria and pitfalls to help engineers select the most suitable message middleware for their distributed systems.

Message QueueMiddleware SelectionRabbitMQ
0 likes · 30 min read
How to Choose the Right Message Middleware: Kafka vs RabbitMQ Deep Dive
Efficient Ops
Efficient Ops
Mar 15, 2018 · Operations

How Baidu’s CCS System Scales Command Execution Across Millions of Servers

This article examines Baidu’s Cluster Control System (CCS), detailing its two‑level data model, four‑tier scheduling architecture, and three‑layer execution agents, and explains how control and execution information, redundancy, and fault‑tolerant designs enable reliable large‑scale command execution across thousands of servers.

Command ExecutionDistributed SystemsOperations
0 likes · 12 min read
How Baidu’s CCS System Scales Command Execution Across Millions of Servers
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 20, 2017 · Backend Development

Synchronous vs Asynchronous Messaging: When to Use Queues and How They Impact Performance

This article explains the fundamental differences between synchronous and asynchronous communication, explores how message queues decouple producers and consumers, and discusses key considerations such as persistence, performance, reliability, and language support for building robust backend systems.

Message QueueReliabilitySynchronous
0 likes · 9 min read
Synchronous vs Asynchronous Messaging: When to Use Queues and How They Impact Performance
21CTO
21CTO
Sep 26, 2017 · Operations

Why You Should Never Trust Any Component in Your System—and How to Protect It

In programming and operations, every element—from services and dependencies to requests, machines, data centers, power, networks, and humans—can fail unexpectedly, so you must assume distrust and implement defensive measures such as monitoring, redundancy, rate limiting, fallback strategies, backups, and automated deployment.

OperationsReliabilitySecurity
0 likes · 9 min read
Why You Should Never Trust Any Component in Your System—and How to Protect It
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 15, 2017 · Product Management

What Core Skills Make a Great Product Manager? A Deep Dive

This article offers a systematic analysis of product manager core competencies—including product design, requirement documentation, accessibility, reliability, globalization, positioning, and user feedback—while sharing practical insights drawn from the author's ten‑year industry experience and a recent forced break for reflection.

Product DesignReliabilityaccessibility
0 likes · 11 min read
What Core Skills Make a Great Product Manager? A Deep Dive
21CTO
21CTO
Sep 6, 2017 · Operations

How JD’s Self‑Built Data Center Achieves Ultra‑Low Energy Use and High Reliability

JD’s self‑constructed data center in Suqian, Jiangsu, combines innovative free‑cooling, high‑efficiency chillers, redundant power architecture, and advanced monitoring to deliver over 180 days of natural cooling, reduce PUE to ≤1.3, and ensure continuous operation with robust backup systems.

CoolingData centerInfrastructure
0 likes · 12 min read
How JD’s Self‑Built Data Center Achieves Ultra‑Low Energy Use and High Reliability
Qunar Tech Salon
Qunar Tech Salon
Jul 24, 2017 · Backend Development

Addressing Service Decomposition Challenges with Event‑Driven Architecture in Transaction Systems

The article explains how a transaction system evolved from a monolithic application to a service‑oriented design, tackling issues such as RPC‑induced coupling, state explosion, and distributed consistency by introducing reliable event‑driven mechanisms, a core StoreEngine, and an ActorEngine framework.

Event-drivenReliabilityactor-model
0 likes · 8 min read
Addressing Service Decomposition Challenges with Event‑Driven Architecture in Transaction Systems
Architecture Digest
Architecture Digest
Jun 11, 2017 · Big Data

Kafka High‑Reliability Architecture, Storage Mechanisms, Replication, and Benchmark Analysis

This article explains Kafka's distributed architecture, its topic‑partition storage model, replication and synchronization mechanisms, reliability guarantees such as ISR and high‑watermark, and presents benchmark results that illustrate how replication factor, acks settings, and partition count affect throughput and latency.

BenchmarkKafkaReliability
0 likes · 34 min read
Kafka High‑Reliability Architecture, Storage Mechanisms, Replication, and Benchmark Analysis
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Apr 6, 2017 · Operations

How SRE’s Dialectical Thinking Redefines Modern Operations

An insightful reflection on Google’s SRE philosophy shows how dialectical thinking—questioning absolute stability, embracing limited toil, prioritizing simple monitoring, recognizing automation’s hidden risks, and practicing real‑world failure drills—can reshape operations, encouraging smarter, more resilient system design.

AutomationReliabilitySRE
0 likes · 7 min read
How SRE’s Dialectical Thinking Redefines Modern Operations
Tencent Cloud Developer
Tencent Cloud Developer
Mar 13, 2017 · Cloud Computing

Technical Overview of Tencent Cloud CMQ: Architecture, Reliability, Consistency, and Scalability

Tencent Cloud CMQ is a Raft‑based distributed message queue that delivers high reliability, strong consistency, and horizontal scalability for finance‑grade workloads, using multi‑node broker sets with majority‑acknowledged writes, automatic leader election, unlimited buffering, full‑path tracing, while requiring application‑level idempotency and offering limited strict ordering.

CMQDistributed MessagingRaft
0 likes · 11 min read
Technical Overview of Tencent Cloud CMQ: Architecture, Reliability, Consistency, and Scalability
Efficient Ops
Efficient Ops
Feb 14, 2017 · Databases

What Real‑World DBA Lessons Reveal About Database Reliability

The article shares a DBA’s three‑year journey at Ganji, detailing core responsibilities, painful incidents like accidental table deletions and massive Redis growth, and practical lessons on stability, backup, hardware prioritization, business alignment, and improving communication between operations and development teams.

Database AdministrationDevOpsReliability
0 likes · 9 min read
What Real‑World DBA Lessons Reveal About Database Reliability
ITPUB
ITPUB
Feb 13, 2017 · Operations

What the GitLab Deletion Teaches About Boosting System Reliability

The article reflects on the GitLab database deletion incident, analyzing how human error, decision fatigue, inadequate backup strategies, and insufficient safeguards exposed reliability gaps, and proposes practical DevOps practices—such as pair operations, diversified redundancy, strict command restrictions, and continuous feedback—to strengthen complex software systems.

BackupGitLabHumanFactors
0 likes · 10 min read
What the GitLab Deletion Teaches About Boosting System Reliability
21CTO
21CTO
Jan 4, 2017 · Operations

How to Build Truly High‑Availability Systems: Principles and Practices

This article explains what high availability means for distributed systems, outlines common availability tiers, and describes how redundancy, load balancing, and automatic failover across a typical Internet architecture can achieve reliable, scalable services.

Distributed SystemsOperationsReliability
0 likes · 6 min read
How to Build Truly High‑Availability Systems: Principles and Practices
Qunar Tech Salon
Qunar Tech Salon
Dec 1, 2016 · Backend Development

How to Prevent Service Failures: Suspect Third‑Party, Guard Users, and Perfect Your Own Service

The article shares practical strategies for preventing service failures by doubting third‑party services, protecting against misuse by consumers, and improving one’s own code and architecture, covering fallback plans, timeout settings, retry policies, API design, traffic control, and resource limits.

API-designOperationsReliability
0 likes · 16 min read
How to Prevent Service Failures: Suspect Third‑Party, Guard Users, and Perfect Your Own Service
Architects' Tech Alliance
Architects' Tech Alliance
Aug 4, 2016 · Information Security

Reliability and High Availability of Backup Software Systems

This article examines how backup software ensures enterprise data reliability through media redundancy, server failover, load balancing, and both cold and high‑availability solutions for the management server, highlighting technologies such as GridStor, dual‑array clustering, and deduplication.

BackupData ProtectionReliability
0 likes · 11 min read
Reliability and High Availability of Backup Software Systems
Architecture Digest
Architecture Digest
Jul 6, 2016 · Backend Development

Designing a Message Queue: Key Considerations and Architecture

The article explains why and when to use message queues, then walks through designing one from scratch, covering decoupling, eventual consistency, broadcast, flow control, RPC protocols, high availability, storage choices, consumer relationships, reliable delivery, transactions, performance optimizations, and push versus pull models.

AsynchronousBackend ArchitectureDistributed Systems
0 likes · 35 min read
Designing a Message Queue: Key Considerations and Architecture
Meituan Technology Team
Meituan Technology Team
Jul 1, 2016 · Backend Development

Designing a Custom Message Queue: Principles and Practices

The article outlines how to design a custom message queue by first identifying appropriate use‑cases such as decoupling, eventual consistency, broadcasting and peak‑shaving, then examining push versus pull models, high‑availability, ordering, duplicate handling, storage choices, batch processing, flow‑control and performance optimizations, with advanced topics reserved for a follow‑up.

DesignMessage QueueReliability
0 likes · 34 min read
Designing a Custom Message Queue: Principles and Practices

LinkedIn’s Kafka at Scale: Architecture, Optimizations, and Operational Practices

The article details how LinkedIn has scaled Kafka from handling billions to trillions of messages daily, describing quota enforcement, a ZooKeeper‑free consumer, reliability enhancements, security plans, monitoring frameworks, fault‑injection testing, cluster balancing, and integration with other internal data systems.

Big DataKafkaLinkedIn
0 likes · 12 min read
LinkedIn’s Kafka at Scale: Architecture, Optimizations, and Operational Practices
ITPUB
ITPUB
Nov 25, 2015 · Operations

Why Meizu Adopted Multi‑Data‑Center Deployment and How It Works

Meizu moved from a single‑datacenter to a multi‑datacenter architecture to improve reliability, reduce latency, and meet user proximity demands, detailing technical challenges, traffic scheduling, read‑heavy and read‑write balanced services, and GSLB‑based routing solutions.

GSLBReliabilityTraffic Scheduling
0 likes · 10 min read
Why Meizu Adopted Multi‑Data‑Center Deployment and How It Works
21CTO
21CTO
Sep 30, 2015 · Operations

How LinkedIn Scaled Kafka to Process Over 1 Trillion Messages Daily

Since 2011, LinkedIn has expanded its Kafka deployment from handling billions to over a trillion messages per day, focusing on quotas, a new ZooKeeper‑free consumer, reliability enhancements, security, monitoring frameworks, fault‑injection testing, cluster balancing, and ecosystem integrations, offering valuable lessons for large‑scale streaming systems.

KafkaLinkedInReliability
0 likes · 12 min read
How LinkedIn Scaled Kafka to Process Over 1 Trillion Messages Daily
MaGe Linux Operations
MaGe Linux Operations
Jul 7, 2015 · Operations

Why Ops Tools Are Far More Complex Than You Think

The article reveals how operating‑tool systems, often underestimated, demand high technical rigor to ensure automation success rates and absolute reliability for emergency actions, requiring sophisticated failure handling, capacity awareness, and scalable design—challenges comparable to core online services.

AutomationOperationsReliability
0 likes · 7 min read
Why Ops Tools Are Far More Complex Than You Think
Qunar Tech Salon
Qunar Tech Salon
Oct 31, 2014 · Operations

Simple Testing Can Prevent Most Critical Failures: Findings from an Analysis of Five Open‑Source Distributed Systems

A recent study of five major open‑source distributed systems reveals that most failures can be triggered and reproduced with simple, multi‑event tests, highlighting the importance of systematic testing, deterministic error handling, and concise logging for reliable system operation.

Bug AnalysisDistributed SystemsReliability
0 likes · 6 min read
Simple Testing Can Prevent Most Critical Failures: Findings from an Analysis of Five Open‑Source Distributed Systems
MaGe Linux Operations
MaGe Linux Operations
Jul 16, 2014 · Cloud Computing

Why Modern Cloud Storage Is Getting So Complex—and How Qiniu Solved It

From the evolution of single‑machine file systems to today’s distributed, erasure‑coded cloud storage, this article examines why storage has become increasingly complex, the limitations of traditional replication, and how Qiniu’s next‑gen architecture leverages EC, faster repairs, and cost reductions to meet scalability, reliability, and availability demands.

ReliabilityScalabilitycloud storage
0 likes · 15 min read
Why Modern Cloud Storage Is Getting So Complex—and How Qiniu Solved It