Tag

reliability

0 views collected around this technical thread.

Java Captain
Java Captain
May 23, 2025 · Backend Development

Common Causes of Kafka Message Loss and Mitigation Strategies

This article examines the typical reasons Kafka messages are lost across producers, brokers, and consumers, and provides detailed configuration recommendations and best‑practice solutions to significantly reduce the risk of data loss in distributed streaming systems.

BrokerConfigurationConsumer
0 likes · 15 min read
Common Causes of Kafka Message Loss and Mitigation Strategies
FunTester
FunTester
May 19, 2025 · Operations

Chaos Engineering Tools, Theory, and Practices

Chaos engineering, a scientific method for improving system resilience, is explored through an overview of leading tools such as Gremlin, ChaosBlade, Chaos Mesh, Chaos Toolkit, and ChaosMeta, alongside core concepts, real-world case studies, common misconceptions, and the practical value of controlled fault injection in distributed systems.

Chaos EngineeringDistributed Systemsfault injection
0 likes · 12 min read
Chaos Engineering Tools, Theory, and Practices
FunTester
FunTester
May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Chaos EngineeringDistributed SystemsSRE
0 likes · 9 min read
Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles
Cognitive Technology Team
Cognitive Technology Team
Apr 30, 2025 · Backend Development

Preventing Message Loss, Duplicate Consumption, and Backlog in RocketMQ: Best Practices and Strategies

This article examines the three major reliability challenges of message queues—loss, duplicate consumption, and backlog—and provides detailed RocketMQ‑specific strategies, including producer acknowledgment, broker replication, idempotent consumer design, monitoring, scaling, and parameter tuning to ensure high‑availability distributed systems.

BackendDistributed SystemsMessage Queue
0 likes · 17 min read
Preventing Message Loss, Duplicate Consumption, and Backlog in RocketMQ: Best Practices and Strategies
Deepin Linux
Deepin Linux
Apr 21, 2025 · Fundamentals

TCP Protocol: Overview, Mechanisms, and Practical Usage

This article provides a comprehensive English guide to the Transmission Control Protocol (TCP), covering its connection‑oriented design, reliability features, packet structure, three‑way handshake, data transfer process, flow and congestion control, four‑step termination, and example C++ socket code for establishing, sending, receiving, and closing TCP connections.

C++ SocketTCPcongestion control
0 likes · 42 min read
TCP Protocol: Overview, Mechanisms, and Practical Usage
FunTester
FunTester
Mar 23, 2025 · Operations

The Origin, Development, and Future of Chaos Engineering

Chaos engineering, introduced by Netflix in 2011 to proactively inject failures and test system resilience, has evolved over the past decade into a widely adopted practice integrated with SRE, automation, AI, and Kubernetes, offering best‑practice guidelines and future trends for improving distributed system reliability.

Chaos EngineeringCloud NativeKubernetes
0 likes · 8 min read
The Origin, Development, and Future of Chaos Engineering
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Mar 9, 2025 · Fundamentals

Unveiling Complete Data Flow Systems: Architecture, Reliability, and Scalability

This article explains how modern data‑intensive applications are built, detailing a complete data‑flow architecture—from API requests, caching, database queries, change capture, search indexing, and message queues—to core system concerns such as reliability, scalability, and maintainability, offering practical insights for architects.

Data FlowSystem Architecturemaintainability
0 likes · 10 min read
Unveiling Complete Data Flow Systems: Architecture, Reliability, and Scalability
AntTech
AntTech
Feb 25, 2025 · Artificial Intelligence

Call for Papers: ISSTA 2025 Workshop on Reliable and Trustworthy Software Systems (Ant Group)

The Ant Group invites submissions to its inaugural ISSTA 2025 workshop on reliable, secure, and trustworthy software systems, covering topics such as software reliability, AI-driven testing, model interpretability, LLM verification, runtime analysis, and visualization, with deadlines from March to June 2025.

AICall for PapersISSTA
0 likes · 5 min read
Call for Papers: ISSTA 2025 Workshop on Reliable and Trustworthy Software Systems (Ant Group)
Architecture Digest
Architecture Digest
Feb 20, 2025 · Backend Development

Key Considerations and Best Practices for Using Spring Event in Backend Systems

This article explains critical pitfalls, graceful shutdown requirements, startup timing issues, suitable business scenarios, and reliability techniques—including retries, idempotency, and integration with Kafka and MQ—when applying Spring Event for publish‑subscribe patterns in high‑traffic backend services.

BackendJavaSpring
0 likes · 11 min read
Key Considerations and Best Practices for Using Spring Event in Backend Systems
Architecture and Beyond
Architecture and Beyond
Feb 6, 2025 · Operations

Analyzing DeepSeek’s Availability Issues and Applying Traditional Internet Reliability Strategies to AIGC

This article examines DeepSeek’s frequent service interruptions, contrasts the inherent reliability challenges of AIGC products with traditional internet applications, and proposes adopting proven isolation, rate‑limiting, and elastic‑scaling techniques to improve AI service availability and user experience.

AIGCDeepSeekavailability
0 likes · 12 min read
Analyzing DeepSeek’s Availability Issues and Applying Traditional Internet Reliability Strategies to AIGC
Efficient Ops
Efficient Ops
Jan 22, 2025 · Operations

Essential Ops Metrics Every Engineer Should Monitor

Operations engineers need to track a comprehensive set of system, application, fault, security, and backup metrics—such as CPU and memory usage, response time, alert counts, incident rates, and recovery objectives—to quickly assess health, anticipate problems, and ensure reliable performance.

Backup and RecoveryPerformance Metricsoperations
0 likes · 5 min read
Essential Ops Metrics Every Engineer Should Monitor
macrozheng
macrozheng
Jan 17, 2025 · Backend Development

Mastering Spring Event: Avoid Pitfalls and Ensure Reliable Publish‑Subscribe

This article shares hard‑won lessons from production incidents and provides practical guidelines—graceful shutdown, proper startup timing, suitable business scenarios, reliability patterns, and idempotent handling—to use Spring Event safely and effectively in Java backend systems.

BackendJavaPublish-Subscribe
0 likes · 12 min read
Mastering Spring Event: Avoid Pitfalls and Ensure Reliable Publish‑Subscribe
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

Distributed SystemsMonitoringfault tolerance
0 likes · 18 min read
Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies
DaTaobao Tech
DaTaobao Tech
Dec 25, 2024 · Operations

Fundamentals of Service Level Agreements (SLA) for Messaging Middleware

The article explains SLA fundamentals for messaging middleware, defining contracts, SLI/SLO relationships, key metrics such as availability, latency and error‑rate, dynamic lifecycle processes, template components, error‑budget calculations, industry benchmarks, internal monitoring practices, a sample SLA draft, and best‑practice recommendations for continuous improvement.

Messaging MiddlewareSLAService Level Agreement
0 likes · 41 min read
Fundamentals of Service Level Agreements (SLA) for Messaging Middleware
JD Tech Talk
JD Tech Talk
Dec 11, 2024 · Backend Development

Analysis of Message Queue Disorder Issues and Practical Solutions

This article examines the root causes of message queue disorder in distributed systems, illustrates real‑world impacts such as data loss during migration, and presents concrete mitigation strategies including ordered messaging, pre‑processing checks, state‑machine handling, and monitoring to improve system reliability.

Message Queuebackend developmentordering
0 likes · 9 min read
Analysis of Message Queue Disorder Issues and Practical Solutions
Sanyou's Java Diary
Sanyou's Java Diary
Dec 2, 2024 · Big Data

Understanding Kafka: Core Architecture, Storage, and Reliability Explained

This article provides a comprehensive overview of Kafka, covering its overall structure, key components such as brokers, producers, consumers, topics, partitions, replicas, leader‑follower mechanics, logical and physical storage models, producer and consumer workflows, configuration parameters, partition assignment strategies, rebalancing, log retention and compaction, indexing, zero‑copy transmission, and the reliability concepts that ensure data durability.

Data StreamingDistributed SystemsKafka
0 likes · 18 min read
Understanding Kafka: Core Architecture, Storage, and Reliability Explained
Selected Java Interview Questions
Selected Java Interview Questions
Nov 28, 2024 · Backend Development

Key Considerations and Best Practices for Using Spring Event in Backend Systems

This article explains critical pitfalls and best‑practice guidelines for employing Spring Event in Java backend applications, covering graceful shutdown requirements, event loss during startup, suitable business scenarios, reliability enhancements, retry mechanisms, idempotency, and the relationship between Spring Event and message queues.

BackendJavaRetry
0 likes · 12 min read
Key Considerations and Best Practices for Using Spring Event in Backend Systems
Architecture & Thinking
Architecture & Thinking
Nov 28, 2024 · Cloud Native

How to Scale Istio Across Hundreds of Services: Real‑World Strategies & Performance Insights

This article shares practical guidance on rolling out Istio service mesh to over ten business lines, covering selection of pilot projects, benefit analysis using access logs, sidecar injection, performance and resource impact, multi‑region active‑active architecture benefits, and rapid fault‑recovery tactics.

Cloud NativeIstioService Mesh
0 likes · 9 min read
How to Scale Istio Across Hundreds of Services: Real‑World Strategies & Performance Insights
DataFunTalk
DataFunTalk
Nov 25, 2024 · Artificial Intelligence

2024 AI Development Report Summary by Fei‑Fei Li’s Team

The 2024 AI Development Report by Fei‑Fei Li’s team highlights rapid progress in model capabilities, rising training costs, dominant contributions from the US, China and Europe, emerging reliability challenges, and the broad economic, medical, and educational impacts of artificial intelligence.

2024AIModel Training
0 likes · 12 min read
2024 AI Development Report Summary by Fei‑Fei Li’s Team
Architecture & Thinking
Architecture & Thinking
Oct 10, 2024 · Mobile Development

How Baidu Built a Scalable Android IM SDK for Real‑Time Messaging

This article explains the background, architecture, core processes, and engineering challenges of Baidu's Android instant‑messaging SDK, detailing how the public IM system, long‑connection layer, and modular components enable reliable, real‑time communication across multiple devices.

Android SDKInstant MessagingMobile Development
0 likes · 21 min read
How Baidu Built a Scalable Android IM SDK for Real‑Time Messaging