Tagged articles
352 articles
Page 1 of 4
Su San Talks Tech
Su San Talks Tech
May 18, 2026 · Artificial Intelligence

How to Guarantee Reliable Function Calling in LLM Agents

The article breaks down the reliability challenges of LLM Function Calling, categorizes five failure modes, and presents concrete engineering safeguards such as precise schema design, tool description, constraint enforcement, few‑shot calibration, structured output, validation‑feedback loops, monitoring, and risk‑aware trade‑offs.

Function CallingJSON SchemaLLM
0 likes · 17 min read
How to Guarantee Reliable Function Calling in LLM Agents
ZhiKe AI
ZhiKe AI
May 17, 2026 · Artificial Intelligence

The Harsh Truth About AI Agents: 80% Show ROI, Yet 88% Never Reach Production

While 80% of enterprises report measurable ROI from AI Agents, 88% of projects never leave the lab; the article examines real‑world case studies, reliability gaps, cost overruns, and emerging tooling that together define the current promise and pitfalls of production‑grade AI Agents.

AI AgentsClaude CodeCost Overrun
0 likes · 10 min read
The Harsh Truth About AI Agents: 80% Show ROI, Yet 88% Never Reach Production
21CTO
21CTO
May 10, 2026 · Industry Insights

Why GitHub’s Reliability Issues Are Driving Users Away

GitHub’s uptime has fallen sharply, with hundreds of incidents—including dozens of major outages—largely fueled by AI‑driven code generation, prompting high‑profile users to migrate, leadership to prioritize availability, and a costly overhaul of capacity and architecture.

AI-driven developmentGitHubGitHub Actions
0 likes · 11 min read
Why GitHub’s Reliability Issues Are Driving Users Away
FunTester
FunTester
Apr 28, 2026 · Operations

How Self‑Healing Automation Platforms Transform SRE Practices

The article explains how a self‑healing platform improves SRE reliability by reducing MTTR, preserving error‑budget, automating high‑impact incident remediation, enforcing safety guardrails, and shifting team focus from firefighting to sustainable reliability engineering.

AutomationError BudgetMTTR
0 likes · 10 min read
How Self‑Healing Automation Platforms Transform SRE Practices
AI Tech Publishing
AI Tech Publishing
Apr 25, 2026 · Artificial Intelligence

A Comprehensive Guide to Harness Engineering for Reliable AI Agents

This article systematically breaks down Harness Engineering—a framework that organizes large models, context, tools, state, sandboxing, security, and evaluation into a reliable AI agent engineering system, showing how to move agents from demo to production.

AI AgentsContext managementHarness Engineering
0 likes · 21 min read
A Comprehensive Guide to Harness Engineering for Reliable AI Agents
ZhiKe AI
ZhiKe AI
Apr 17, 2026 · Artificial Intelligence

Claude Opus 4.7 Boosts Programming Performance by 11% – Why Its ‘No’ Makes It More Reliable

Claude Opus 4.7 raises SWE‑bench Pro accuracy from 53.4% to 64.3% (a +11 pp jump), triples visual resolution, can refuse or verify dubious instructions, and keeps pricing unchanged while increasing token consumption, positioning it as a more reliable AI colleague despite a slight dip in long‑document search.

AI benchmarkingClaude OpusReliability
0 likes · 8 min read
Claude Opus 4.7 Boosts Programming Performance by 11% – Why Its ‘No’ Makes It More Reliable
AI Waka
AI Waka
Apr 14, 2026 · Artificial Intelligence

From Prompt Chains to Python State Machines: Evolving Production‑Grade AI Orchestration

This article chronicles three generations of production‑grade AI orchestration—from fragile Claude Code skill chains, through adversarial sub‑agent pipelines with explicit judges, to a deterministic Python state‑machine built on the Claude Agent SDK—highlighting how structured control flow, task splitting, and budget enforcement dramatically improve reliability over raw prompt‑driven workflows.

AI orchestrationClaude Agent SDKLLM
0 likes · 19 min read
From Prompt Chains to Python State Machines: Evolving Production‑Grade AI Orchestration
AI Explorer
AI Explorer
Apr 4, 2026 · Artificial Intelligence

Can GPT-3-Powered Robots Achieve 99% Success? Inside Sia’s GEN-1 Breakthrough

Sia’s GEN-1 robot, powered by a GPT-3-style large language model, claims a jump in task-success rate from 64% to 99%, signaling a shift from simple perception-execution to cognitive decision-making, while the article scrutinizes the definition of success, cost, safety, and industry impact.

AI integrationGPT-3Reliability
0 likes · 6 min read
Can GPT-3-Powered Robots Achieve 99% Success? Inside Sia’s GEN-1 Breakthrough
DevOps Coach
DevOps Coach
Mar 26, 2026 · Industry Insights

Which DevOps Metrics Will Drive Business Success by 2026?

The article analyzes how traditional DevOps activity metrics are being replaced by outcome‑focused indicators that directly affect cost, delivery speed, reliability and overall business performance, citing New Relic and Flexera forecasts and outlining the metrics teams should adopt or discard by 2026.

DevOpsDoRAFinOps
0 likes · 13 min read
Which DevOps Metrics Will Drive Business Success by 2026?
Architect
Architect
Feb 1, 2026 · Artificial Intelligence

How OpenClaw Makes AI Agents Reliable: Inside Its Architecture and Engineering Secrets

This article dissects OpenClaw’s architecture, revealing how a TypeScript CLI process, a gateway server, lane‑queue concurrency, structured memory, tool‑execution allowlists, and semantic browser snapshots combine to turn fragile AI agents into stable, observable, and controllable systems.

AI AgentsMemory ManagementReliability
0 likes · 20 min read
How OpenClaw Makes AI Agents Reliable: Inside Its Architecture and Engineering Secrets
DevOps Coach
DevOps Coach
Jan 27, 2026 · Backend Development

7 Essential Kafka Design Patterns Every Engineer Should Master

This guide presents seven practical Kafka design patterns—single‑key single‑write, log compaction, multi‑consumer‑group fan‑out, retry and dead‑letter topics, exactly‑once processing with Streams, schema evolution with Avro, and choreography vs orchestration—detailing when to use each, core principles, code examples, tips, common pitfalls, and final recommendations for building reliable, observable, and maintainable event‑driven systems.

Design PatternsEvent StreamingKafka
0 likes · 9 min read
7 Essential Kafka Design Patterns Every Engineer Should Master
Tech Freedom Circle
Tech Freedom Circle
Jan 18, 2026 · Interview Experience

How to Achieve Zero P4 Incidents for a Year – A Complete Interview Framework

The article presents a systematic BAR (Background‑Action‑Result) framework for answering the interview question about maintaining a full year of zero P4‑level faults, covering fault‑grade definitions, a three‑layer protection strategy, concrete tooling (Sentinel, SkyWalking, ChaosBlade, etc.), quantitative results, and a set of high‑frequency follow‑up questions to showcase deep technical expertise.

KubernetesMicroservicesReliability
0 likes · 23 min read
How to Achieve Zero P4 Incidents for a Year – A Complete Interview Framework
Tencent Cloud Developer
Tencent Cloud Developer
Jan 7, 2026 · Artificial Intelligence

How Context Engineering Powers the Next Generation of AI Agents

Transitioning from simple chatbots to sophisticated agents, this article explains how expanding context becomes a core variable, detailing the evolution from prompt engineering to context engineering, the challenges of managing growing context, and practical solutions like structured context, tool integration, and the MCP framework for reliable AI systems.

AgentLLMReliability
0 likes · 20 min read
How Context Engineering Powers the Next Generation of AI Agents
Ops Development Stories
Ops Development Stories
Dec 31, 2025 · Operations

12 Major 2025 Internet Outages: What Every Ops Team Can Learn

This article compiles twelve high‑profile internet service failures from 2025, detailing each incident’s description, micro‑scenario, technical root cause, and risk perspective, and extracts actionable lessons on infrastructure resilience, change management, and security‑aware operations.

Internet OutagesOperationsReliability
0 likes · 20 min read
12 Major 2025 Internet Outages: What Every Ops Team Can Learn
IT Services Circle
IT Services Circle
Dec 25, 2025 · Fundamentals

Why Does TCP Need a Three‑Way Handshake? Unpacking the Connection Ritual

This article explains the three‑step TCP handshake, detailing how SYN, SYN‑ACK, and ACK packets establish a reliable connection, why two‑step handshakes are unsafe, why a four‑step process is unnecessary, and how the protocol ensures ordered, secure data transmission.

Connection establishmentNetwork ProtocolsReliability
0 likes · 9 min read
Why Does TCP Need a Three‑Way Handshake? Unpacking the Connection Ritual
NiuNiu MaTe
NiuNiu MaTe
Dec 23, 2025 · Fundamentals

Why TCP Needs a Three‑Way Handshake: The Secret Behind Reliable Connections

TCP’s three‑way handshake is a carefully designed three‑step “social ritual” that establishes a reliable connection by exchanging SYN, SYN‑ACK, and ACK packets, each carrying sequence numbers and flags to confirm readiness, prevent “ghost” connections, and ensure ordered, secure data transmission.

Connection establishmentReliabilitySequence numbers
0 likes · 10 min read
Why TCP Needs a Three‑Way Handshake: The Secret Behind Reliable Connections
Architect
Architect
Dec 18, 2025 · Backend Development

Why Graceful Shutdown Is Essential for Spring Event and How to Avoid Common Pitfalls

This article shares hard‑learned production experience on using Spring Event, explaining why services must shut down gracefully before publishing events, how startup timing can cause event loss, which business scenarios fit the publish‑subscribe model, and practical reliability techniques such as retries and idempotency.

EventJavaPubSub
0 likes · 11 min read
Why Graceful Shutdown Is Essential for Spring Event and How to Avoid Common Pitfalls
DevOps Coach
DevOps Coach
Dec 8, 2025 · Operations

How to Quantify SRE ROI: Turning Reliability Metrics into Business Value

This article explains how SRE leaders can bridge the gap between technical reliability metrics and business outcomes by defining core SRE concepts, applying a step‑by‑step ROI formula, illustrating code‑level impact, avoiding common pitfalls, and looking ahead to AI‑driven reliability forecasting.

BusinessValueOperationsROI
0 likes · 10 min read
How to Quantify SRE ROI: Turning Reliability Metrics into Business Value
dbaplus Community
dbaplus Community
Nov 18, 2025 · Backend Development

How to Guarantee 100% No Message Loss in Distributed MQ Systems

Ensuring that messages never disappear in a distributed MQ system requires a three‑pronged strategy covering production, storage, and consumption, with proper ACK configurations, local message tables, replication settings, and manual offset commits to achieve reliable, at‑least‑once processing without data loss.

BackendKafkaMQ
0 likes · 11 min read
How to Guarantee 100% No Message Loss in Distributed MQ Systems
Ray's Galactic Tech
Ray's Galactic Tech
Nov 17, 2025 · Backend Development

How to Reach Millisecond Consistency for Million‑Scale Transactions with RocketMQ

This article explains how to use RocketMQ's transactional messages and an atomic‑level wrapper to achieve sub‑second final consistency for million‑scale transaction systems, detailing the two‑phase commit workflow, annotation‑driven implementation, performance optimizations, failure handling, monitoring, and suitable use cases.

Distributed TransactionsJavaReliability
0 likes · 11 min read
How to Reach Millisecond Consistency for Million‑Scale Transactions with RocketMQ
DevOps Coach
DevOps Coach
Nov 10, 2025 · Operations

How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases

This guide explains the SRE framework—SLA, SLO, SLI hierarchy, golden signals, error budgets, and DORA metrics—showing how to instrument a Python app with OpenTelemetry, query Prometheus, avoid common pitfalls, and adopt a cultural and technical process that balances feature velocity with system stability.

DoRAError BudgetGolden Signals
0 likes · 18 min read
How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases
Model Perspective
Model Perspective
Nov 8, 2025 · Operations

How Mathematical Modeling Powers China’s New Fujian Aircraft Carrier

From its 2018 construction kickoff to its 2025 commissioning, the Fujian aircraft carrier’s development showcases a timeline of milestones, while the article delves into the critical mathematical models—covering electromagnetic launch, energy storage, fluid dynamics, stability, scheduling, radar, and reliability—that underpin its design and operation.

Aircraft CarrierElectromagnetic LaunchHydrodynamics
0 likes · 18 min read
How Mathematical Modeling Powers China’s New Fujian Aircraft Carrier
Continuous Delivery 2.0
Continuous Delivery 2.0
Nov 4, 2025 · Operations

Google's STAMP Framework: Redefining SRE for AI‑Driven Systems

Google’s SRE team is shifting from traditional error‑budget approaches to the STAMP (Systems-Theoretic Accident Model and Processes) framework, applying control theory and system‑level analysis to manage the growing complexity of AI‑powered services, improve safety, and proactively prevent hazardous states.

AIReliabilitySRE
0 likes · 12 min read
Google's STAMP Framework: Redefining SRE for AI‑Driven Systems
DevOps Coach
DevOps Coach
Oct 28, 2025 · Cloud Native

20 Essential Kubernetes Tips to Boost Security, Reliability, and Manageability

This guide presents twenty practical Kubernetes best‑practice tips covering productivity shortcuts, resource limits, health probes, node draining, PodDisruptionBudgets, RBAC hardening, read‑only ConfigMaps/Secrets, non‑root containers, network policies, image version pinning, secret rotation, centralized logging, etcd backups, resource cleanup, and secure access methods.

Cluster ManagementDevOpsKubernetes
0 likes · 8 min read
20 Essential Kubernetes Tips to Boost Security, Reliability, and Manageability
Su San Talks Tech
Su San Talks Tech
Oct 28, 2025 · Backend Development

How to Prevent MQ Message Loss: 5 Proven Strategies for Reliable Messaging

Discover the three stages where MQ messages can be lost and explore five practical solutions—including producer confirmations, message persistence, consumer acknowledgments, transactional messaging, and retry with dead‑letter queues—complete with code examples and guidance on selecting the right approach for different scenarios.

Dead Letter QueueKafkaMessage Queue
0 likes · 14 min read
How to Prevent MQ Message Loss: 5 Proven Strategies for Reliable Messaging
Architecture Digest
Architecture Digest
Oct 12, 2025 · Backend Development

Zero‑Loss RabbitMQ: Publisher Confirms, Persistence & Manual ACK

Learn how to prevent message loss in RabbitMQ by addressing three critical failure points—producer‑to‑broker, broker storage, and broker‑to‑consumer—using publisher confirms, durable queues with persistent messages, cluster mirroring, and manual consumer acknowledgments, complete with Java code examples.

JavaMessage QueuePersistence
0 likes · 11 min read
Zero‑Loss RabbitMQ: Publisher Confirms, Persistence & Manual ACK
DevOps Coach
DevOps Coach
Oct 2, 2025 · Interview Experience

Top 10 SRE Interview Questions & Answers to Ace Your Next Interview

This article compiles ten essential Site Reliability Engineering interview questions covering incident command systems, shell types, browser request flow, SSH, error budgets, toil reduction, Linux boot process, QUIC benefits, UDP VPN usage, and common enterprise network protocols, providing concise answers to help you prepare effectively.

DevOpsOperationsReliability
0 likes · 10 min read
Top 10 SRE Interview Questions & Answers to Ace Your Next Interview
Su San Talks Tech
Su San Talks Tech
Sep 23, 2025 · Backend Development

How to Guarantee 100% Message Delivery with Kafka: Interview‑Ready Strategies

This article dissects Kafka’s storage architecture, identifies loss points in production, storage, and consumption phases, and presents interview‑ready strategies—including acks settings, flush tuning, consumer batch commits, detection via sequence numbers, and transactional messaging—to guarantee virtually 100 % message durability.

Consumer CommitKafkaReliability
0 likes · 20 min read
How to Guarantee 100% Message Delivery with Kafka: Interview‑Ready Strategies
Raymond Ops
Raymond Ops
Sep 17, 2025 · Fundamentals

Why UDP Is the Wild West of Internet Protocols and How TCP Tames It

This article compares UDP and TCP by using vivid analogies, explaining UDP's connectionless, fast but unreliable nature and TCP's reliable, connection‑oriented handshake and termination processes, while highlighting their respective advantages, drawbacks, and typical real‑time application scenarios.

ConnectionlessHandshakeNetwork Protocols
0 likes · 10 min read
Why UDP Is the Wild West of Internet Protocols and How TCP Tames It
AI Large Model Application Practice
AI Large Model Application Practice
Sep 8, 2025 · Artificial Intelligence

How to Build Reliable, High‑Performance AI Services in Enterprise Applications

When integrating generative AI into existing enterprise systems, architects must address reliability, performance, and security by applying patterns such as circuit breakers, retries with exponential backoff, asynchronous processing, caching, request hedging, input/output guards, sandboxes, and security proxies to ensure continuous, fast, and safe AI‑driven functionality.

AI integrationAsynchronousReliability
0 likes · 18 min read
How to Build Reliable, High‑Performance AI Services in Enterprise Applications
Cognitive Technology Team
Cognitive Technology Team
Aug 24, 2025 · Fundamentals

Why TCP’s Three‑Way Handshake and Four‑Way Teardown Matter for Reliable Networks

Understanding TCP’s three‑way handshake and four‑way termination reveals how reliable connections are established and gracefully closed, highlighting the protocol’s core mechanisms—sequence numbers, acknowledgments, flow control, and TIME‑WAIT—while also addressing performance considerations, optimization techniques, and the future impact of emerging protocols like QUIC.

HandshakeNetworkingReliability
0 likes · 12 min read
Why TCP’s Three‑Way Handshake and Four‑Way Teardown Matter for Reliable Networks
Big Data Technology Tribe
Big Data Technology Tribe
Jul 9, 2025 · Backend Development

Mastering Idempotency: Design Patterns & Best Practices for Reliable Distributed Systems

This comprehensive guide explains the concept of idempotency, why it is essential in distributed and micro‑service architectures, and provides practical patterns, code examples, and best‑practice recommendations for HTTP, databases, messaging, caching, and service‑mesh implementations.

BackendDesign PatternsDistributed Systems
0 likes · 21 min read
Mastering Idempotency: Design Patterns & Best Practices for Reliable Distributed Systems
FunTester
FunTester
Jun 27, 2025 · Fundamentals

Mastering TCP Retransmission: Boost Your Testing Efficiency

This article explains the core principles of TCP's retransmission mechanisms, outlines four common strategies, discusses how high retransmission rates indicate network or server issues, and provides practical methods for test engineers to diagnose, monitor, and optimize TCP reliability in performance testing scenarios.

Performance OptimizationReliabilityTCP
0 likes · 11 min read
Mastering TCP Retransmission: Boost Your Testing Efficiency
TAL Education Technology
TAL Education Technology
Jun 23, 2025 · Operations

How Chaos Engineering Boosts System Resilience: A Practical Guide

This article explains what Chaos Engineering is, why it matters for modern distributed systems, outlines a step‑by‑step approach to designing and running effective chaos experiments, describes platform features, and shares a real‑world case study of a pre‑launch blind test.

Distributed SystemsReliabilityResilience Testing
0 likes · 9 min read
How Chaos Engineering Boosts System Resilience: A Practical Guide
Architecture & Thinking
Architecture & Thinking
Jun 18, 2025 · Cloud Native

How Outlier Detection in Service Mesh Boosts Service Reliability

This article explains the concept, implementation principles, configuration details, and common use cases of Outlier Detection in Service Meshes, showing how it isolates faulty instances to improve stability, performance, and automated operations in cloud‑native environments.

Cloud NativeMicroservicesReliability
0 likes · 6 min read
How Outlier Detection in Service Mesh Boosts Service Reliability
Pan Zhi's Tech Notes
Pan Zhi's Tech Notes
Jun 16, 2025 · Backend Development

How RocketMQ Guarantees No Message Loss, Duplication, or Disorder

This article explains RocketMQ’s architecture, the roles of NameServer, Broker, Producer, Consumer, and how each component ensures reliable message delivery—covering synchronous, asynchronous, and one‑way sending, storage mechanisms, consumer retries, dead‑letter queues, installation steps, and Java client integration with code examples.

Distributed SystemsInstallationJava
0 likes · 20 min read
How RocketMQ Guarantees No Message Loss, Duplication, or Disorder
Instant Consumer Technology Team
Instant Consumer Technology Team
Jun 5, 2025 · Big Data

Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss

This article shares practical Kafka production insights, covering architecture overview, producer throughput tuning, message loss prevention, broker and consumer configurations, duplicate consumption avoidance, backlog mitigation, ordering guarantees, and the mechanics of consumer group rebalancing, helping engineers build stable, high‑performance streaming pipelines.

Big DataKafkaMessage Queue
0 likes · 15 min read
Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss
Java Captain
Java Captain
May 23, 2025 · Backend Development

Common Causes of Kafka Message Loss and Mitigation Strategies

This article examines the typical reasons Kafka messages are lost across producers, brokers, and consumers, and provides detailed configuration recommendations and best‑practice solutions to significantly reduce the risk of data loss in distributed streaming systems.

BrokerConfigurationConsumer
0 likes · 15 min read
Common Causes of Kafka Message Loss and Mitigation Strategies
FunTester
FunTester
May 19, 2025 · Operations

Chaos Engineering Tools, Theory, and Practices

Chaos engineering, a scientific method for improving system resilience, is explored through an overview of leading tools such as Gremlin, ChaosBlade, Chaos Mesh, Chaos Toolkit, and ChaosMeta, alongside core concepts, real-world case studies, common misconceptions, and the practical value of controlled fault injection in distributed systems.

Distributed SystemsFault InjectionReliability
0 likes · 12 min read
Chaos Engineering Tools, Theory, and Practices
FunTester
FunTester
May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Distributed SystemsFault InjectionOperations
0 likes · 9 min read
Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles
Liangxu Linux
Liangxu Linux
May 5, 2025 · Fundamentals

Why UDP Is the Speedy Rogue and TCP the Polite Gentleman

This article uses vivid analogies to compare UDP’s fast, connection‑less, unordered packet delivery with TCP’s reliable, connection‑oriented handshake, flow control, and ordered transmission, outlining each protocol’s characteristics, advantages, drawbacks, and typical real‑time applications such as live streaming, gaming, and video calls.

InternetNetworkingProtocols
0 likes · 10 min read
Why UDP Is the Speedy Rogue and TCP the Polite Gentleman
FunTester
FunTester
Mar 23, 2025 · Operations

The Origin, Development, and Future of Chaos Engineering

Chaos engineering, introduced by Netflix in 2011 to proactively inject failures and test system resilience, has evolved over the past decade into a widely adopted practice integrated with SRE, automation, AI, and Kubernetes, offering best‑practice guidelines and future trends for improving distributed system reliability.

KubernetesReliabilitySRE
0 likes · 8 min read
The Origin, Development, and Future of Chaos Engineering
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Mar 9, 2025 · Fundamentals

Unveiling Complete Data Flow Systems: Architecture, Reliability, and Scalability

This article explains how modern data‑intensive applications are built, detailing a complete data‑flow architecture—from API requests, caching, database queries, change capture, search indexing, and message queues—to core system concerns such as reliability, scalability, and maintainability, offering practical insights for architects.

Data FlowReliabilityScalability
0 likes · 10 min read
Unveiling Complete Data Flow Systems: Architecture, Reliability, and Scalability
Architecture and Beyond
Architecture and Beyond
Feb 6, 2025 · Operations

Analyzing DeepSeek’s Availability Issues and Applying Traditional Internet Reliability Strategies to AIGC

This article examines DeepSeek’s frequent service interruptions, contrasts the inherent reliability challenges of AIGC products with traditional internet applications, and proposes adopting proven isolation, rate‑limiting, and elastic‑scaling techniques to improve AI service availability and user experience.

AIGCAvailabilityDeepSeek
0 likes · 12 min read
Analyzing DeepSeek’s Availability Issues and Applying Traditional Internet Reliability Strategies to AIGC
Efficient Ops
Efficient Ops
Jan 22, 2025 · Operations

Essential Ops Metrics Every Engineer Should Monitor

Operations engineers need to track a comprehensive set of system, application, fault, security, and backup metrics—such as CPU and memory usage, response time, alert counts, incident rates, and recovery objectives—to quickly assess health, anticipate problems, and ensure reliable performance.

Reliabilitybackup and recoveryperformance metrics
0 likes · 5 min read
Essential Ops Metrics Every Engineer Should Monitor
macrozheng
macrozheng
Jan 17, 2025 · Backend Development

Mastering Spring Event: Avoid Pitfalls and Ensure Reliable Publish‑Subscribe

This article shares hard‑won lessons from production incidents and provides practical guidelines—graceful shutdown, proper startup timing, suitable business scenarios, reliability patterns, and idempotent handling—to use Spring Event safely and effectively in Java backend systems.

BackendEventIdempotence
0 likes · 12 min read
Mastering Spring Event: Avoid Pitfalls and Ensure Reliable Publish‑Subscribe
Ops Development Stories
Ops Development Stories
Jan 16, 2025 · Operations

How AI is Transforming Site Reliability Engineering (SRE)

This article examines how the rapid rise of AI reshapes Site Reliability Engineering by enhancing monitoring, automating operations, improving fault diagnosis, and presenting new challenges such as data security and model explainability, ultimately driving more efficient and reliable system management.

AIReliabilitySRE
0 likes · 21 min read
How AI is Transforming Site Reliability Engineering (SRE)
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 8, 2025 · Cloud Native

Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage

Using the recent OpenAI service disruption as a case study, this article examines the stability challenges of large‑scale Kubernetes deployments and details how Alibaba Cloud Container Service and its Prometheus‑based observability solutions enhance reliability through high‑availability architecture, optimized exporters, out‑of‑band data links, and best‑practice guidelines.

Alibaba CloudLarge-Scale ClustersObservability
0 likes · 22 min read
Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

Reliabilityfault tolerancemonitoring
0 likes · 18 min read
Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies
DaTaobao Tech
DaTaobao Tech
Dec 25, 2024 · Operations

Fundamentals of Service Level Agreements (SLA) for Messaging Middleware

The article explains SLA fundamentals for messaging middleware, defining contracts, SLI/SLO relationships, key metrics such as availability, latency and error‑rate, dynamic lifecycle processes, template components, error‑budget calculations, industry benchmarks, internal monitoring practices, a sample SLA draft, and best‑practice recommendations for continuous improvement.

Messaging MiddlewareOperationsReliability
0 likes · 41 min read
Fundamentals of Service Level Agreements (SLA) for Messaging Middleware
dbaplus Community
dbaplus Community
Dec 16, 2024 · Operations

How Qunar Built a 5‑Million‑Metric Radar System to Cut Ticket Failures by 87%

This article details the design, implementation, and results of Qunar's intelligent ticket‑monitoring Radar system, covering the business need, architecture, anomaly‑detection algorithms, test‑set construction, parameter tuning, and the achieved 87% detection accuracy with future plans for large‑model integration.

OperationsReliabilityanomaly detection
0 likes · 17 min read
How Qunar Built a 5‑Million‑Metric Radar System to Cut Ticket Failures by 87%
JD Tech Talk
JD Tech Talk
Dec 11, 2024 · Backend Development

Analysis of Message Queue Disorder Issues and Practical Solutions

This article examines the root causes of message queue disorder in distributed systems, illustrates real‑world impacts such as data loss during migration, and presents concrete mitigation strategies including ordered messaging, pre‑processing checks, state‑machine handling, and monitoring to improve system reliability.

Distributed SystemsMessage QueueReliability
0 likes · 9 min read
Analysis of Message Queue Disorder Issues and Practical Solutions
Sanyou's Java Diary
Sanyou's Java Diary
Dec 2, 2024 · Big Data

Understanding Kafka: Core Architecture, Storage, and Reliability Explained

This article provides a comprehensive overview of Kafka, covering its overall structure, key components such as brokers, producers, consumers, topics, partitions, replicas, leader‑follower mechanics, logical and physical storage models, producer and consumer workflows, configuration parameters, partition assignment strategies, rebalancing, log retention and compaction, indexing, zero‑copy transmission, and the reliability concepts that ensure data durability.

Data StreamingDistributed SystemsKafka
0 likes · 18 min read
Understanding Kafka: Core Architecture, Storage, and Reliability Explained
Selected Java Interview Questions
Selected Java Interview Questions
Nov 28, 2024 · Backend Development

Key Considerations and Best Practices for Using Spring Event in Backend Systems

This article explains critical pitfalls and best‑practice guidelines for employing Spring Event in Java backend applications, covering graceful shutdown requirements, event loss during startup, suitable business scenarios, reliability enhancements, retry mechanisms, idempotency, and the relationship between Spring Event and message queues.

BackendEvent-Driven ArchitectureJava
0 likes · 12 min read
Key Considerations and Best Practices for Using Spring Event in Backend Systems
Architecture & Thinking
Architecture & Thinking
Nov 28, 2024 · Cloud Native

How to Scale Istio Across Hundreds of Services: Real‑World Strategies & Performance Insights

This article shares practical guidance on rolling out Istio service mesh to over ten business lines, covering selection of pilot projects, benefit analysis using access logs, sidecar injection, performance and resource impact, multi‑region active‑active architecture benefits, and rapid fault‑recovery tactics.

Cloud NativeIstioMicroservices
0 likes · 9 min read
How to Scale Istio Across Hundreds of Services: Real‑World Strategies & Performance Insights
DataFunTalk
DataFunTalk
Nov 25, 2024 · Artificial Intelligence

2024 AI Development Report Summary by Fei‑Fei Li’s Team

The 2024 AI Development Report by Fei‑Fei Li’s team highlights rapid progress in model capabilities, rising training costs, dominant contributions from the US, China and Europe, emerging reliability challenges, and the broad economic, medical, and educational impacts of artificial intelligence.

2024AIEconomic Impact
0 likes · 12 min read
2024 AI Development Report Summary by Fei‑Fei Li’s Team
JD Cloud Developers
JD Cloud Developers
Nov 14, 2024 · Artificial Intelligence

Boosting Advertising Image Generation Reliability with Human Feedback

This paper presents a multimodal Trustworthy Feedback Network (RFNet) and a consistency regularization method that use human feedback to dramatically improve the usability and visual quality of automatically generated e‑commerce advertising images while reducing manual inspection costs.

AIHuman FeedbackReliability
0 likes · 9 min read
Boosting Advertising Image Generation Reliability with Human Feedback
Tencent Cloud Middleware
Tencent Cloud Middleware
Oct 30, 2024 · Backend Development

How Kafka Guarantees High Reliability and Performance – A Deep Technical Dive

This article thoroughly examines Apache Kafka’s architecture, covering its macro components, ack strategies, replication mechanisms, high‑watermark handling, leader election, and performance optimizations such as batch sending, compression, PageCache, zero‑copy, mmap and sendfile, while also explaining common pitfalls like data loss and log corruption.

Distributed SystemsKafkaMessage Queue
0 likes · 31 min read
How Kafka Guarantees High Reliability and Performance – A Deep Technical Dive
MaGe Linux Operations
MaGe Linux Operations
Oct 7, 2024 · Operations

Why Choose RocketMQ? Features, Comparisons, and Reliability Explained

This article provides a comprehensive overview of RocketMQ, covering its architecture, key features such as high reliability, low latency and high throughput, comparisons with Kafka, RabbitMQ and ActiveMQ, and detailed mechanisms that ensure message durability, performance, and ordered consumption.

Distributed SystemsLow latencyMessage Queue
0 likes · 12 min read
Why Choose RocketMQ? Features, Comparisons, and Reliability Explained
Architect
Architect
Sep 30, 2024 · Operations

Automated Resource Balancing and Migration for Redis Clusters

The article describes how an automated resource‑balancing system continuously monitors Redis host memory usage, selects optimal nodes, safely migrates them through a multi‑step process (adding slaves, verifying replication, promoting masters, deleting old nodes), and provides task management and notification features to maintain high availability and reduce manual DBA effort.

AutomationCluster MigrationOperations
0 likes · 13 min read
Automated Resource Balancing and Migration for Redis Clusters
FunTester
FunTester
Sep 18, 2024 · Operations

Overview and Practice of Chaos Engineering

Chaos Engineering introduces controlled failures to test system resilience, covering its history, practical benefits, experiment design, and a comparison of popular open‑source and commercial tools for improving reliability in distributed and cloud‑native environments.

Distributed SystemsReliability
0 likes · 13 min read
Overview and Practice of Chaos Engineering
Baidu Geek Talk
Baidu Geek Talk
Sep 16, 2024 · Mobile Development

Design and Implementation of Baidu Android IM SDK and Public IM System

Baidu built a unified Android IM SDK and public instant‑messaging system that consolidates login, message routing, synchronization, notifications, and group chat into reusable client and server components, using a hybrid push‑pull model to deliver real‑time, secure communication while dramatically lowering development and maintenance costs across its product portfolio.

Android SDKInstant MessagingMobile Development
0 likes · 22 min read
Design and Implementation of Baidu Android IM SDK and Public IM System
Java High-Performance Architecture
Java High-Performance Architecture
Sep 9, 2024 · Backend Development

Designing a Scalable WebSocket Messaging Service with Reliable RabbitMQ Integration

This article outlines a comprehensive backend design that abstracts WebSocket into a reusable communication service, details project structure, business processes, reliability mechanisms for RabbitMQ, message classification, API design, and a unified message format to enable plug‑and‑play real‑time messaging across various Java applications.

Backend ArchitectureJavaMessage Queue
0 likes · 9 min read
Designing a Scalable WebSocket Messaging Service with Reliable RabbitMQ Integration
DevOps
DevOps
Aug 20, 2024 · Operations

CI/CD Maturity Levels and DevOps Practices in Chinese Enterprises

This article analyzes the current state of DevOps adoption in China, presents detailed CI/CD capability levels with a maturity model table, and discusses future operational trends such as automation, AIOps, security integration, observability, and reliability engineering to guide enterprises toward more efficient software delivery.

AutomationDevOpsObservability
0 likes · 20 min read
CI/CD Maturity Levels and DevOps Practices in Chinese Enterprises
Architect's Guide
Architect's Guide
Aug 14, 2024 · Backend Development

Key Considerations and Best Practices for Using Spring Event in Production

This article explains critical pitfalls, proper shutdown handling, event loss during startup, suitable business scenarios, reliability guarantees, and best‑practice patterns for employing Spring Event in high‑traffic backend systems, providing concrete code examples and operational recommendations.

BackendEventJava
0 likes · 11 min read
Key Considerations and Best Practices for Using Spring Event in Production
Top Architect
Top Architect
Aug 10, 2024 · Backend Development

Handling Interface-Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing

The article explains interface‑level failures in business systems and presents four mitigation strategies—degradation, circuit breaking, rate limiting, and queuing—detailing their principles, implementation methods, and algorithmic choices such as fixed and sliding windows, token bucket and leaky bucket.

BackendCircuit BreakingReliability
0 likes · 18 min read
Handling Interface-Level Failures: Degradation, Circuit Breaking, Rate Limiting, and Queuing
Bilibili Tech
Bilibili Tech
Aug 9, 2024 · Operations

Design and Implementation of Bilibili's Change Control Platform

Bilibili’s Change Prevention Platform consolidates data from over 60 systems to proactively detect and block more than 100 risky changes daily, reducing change‑related incidents by applying a four‑pillar framework of technical support, landing, cross‑domain enablement, and cultural safeguards, while evolving toward AI‑driven, end‑to‑end change defense.

BilibiliDevOpsReliability
0 likes · 20 min read
Design and Implementation of Bilibili's Change Control Platform
dbaplus Community
dbaplus Community
Aug 6, 2024 · Operations

How to Slash MTTR: Proven Strategies for Faster Incident Recovery

This article explains what MTTR is, why it matters for system stability, and provides a step‑by‑step framework—including monitoring, alert tuning, rapid mitigation, clear role assignments, and post‑mortem practices—to dramatically shorten repair times and improve overall reliability.

AlertingMTTROps
0 likes · 24 min read
How to Slash MTTR: Proven Strategies for Faster Incident Recovery
Architect
Architect
Aug 5, 2024 · Backend Development

How to Build a Scalable WebSocket Communication Service with Reliable Messaging

This article outlines the design of an internal WebSocket communication service that abstracts real‑time messaging, reduces code coupling, supports various business scenarios, ensures reliable delivery with RabbitMQ, defines unified APIs and message formats, and demonstrates a DDD‑based project structure for easy integration.

Backend ArchitectureDDDMessage Queue
0 likes · 10 min read
How to Build a Scalable WebSocket Communication Service with Reliable Messaging
Code Ape Tech Column
Code Ape Tech Column
Jul 30, 2024 · Backend Development

Design and Implementation of a Unified WebSocket Communication Service for Backend Systems

This article describes the background, objectives, core design, reliability mechanisms, message classification, API design, and unified calling approach of a company‑wide WebSocket abstraction layer that replaces polling, supports asynchronous communication, and ensures reliable message delivery using RabbitMQ and confirm mechanisms.

Backend ArchitectureMessage QueueRabbitMQ
0 likes · 10 min read
Design and Implementation of a Unified WebSocket Communication Service for Backend Systems
Software Development Quality
Software Development Quality
Jul 24, 2024 · Fundamentals

How to Measure Hardware Development Efficiency: 20 Key Performance Indicators

This guide outlines twenty essential hardware development, security, and reliability performance indicators—such as development cycle, defect density, security certification rate, MTBF, and supply‑chain safety—and provides practical measurement methods to help engineers quantify and improve product quality and safety.

HardwareR&DReliability
0 likes · 21 min read
How to Measure Hardware Development Efficiency: 20 Key Performance Indicators
Tencent Cloud Developer
Tencent Cloud Developer
Jul 16, 2024 · Big Data

In‑Depth Exploration of Apache Kafka: Architecture, High Reliability, and High Performance

Apache Kafka achieves high‑throughput, fault‑tolerant messaging by combining a partitioned log architecture with leader‑follower replication, asynchronous producer pipelines, configurable acknowledgments, page‑cache‑based sequential writes, zero‑copy transfers, batching, compression, and a multi‑reactor network model that together ensure scalability, reliability, and performance.

Apache KafkaReliabilityStreaming
0 likes · 30 min read
In‑Depth Exploration of Apache Kafka: Architecture, High Reliability, and High Performance
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 2, 2024 · Backend Development

When Should You Send Messages in a Transaction? Real‑World Order Processing Insights

This article examines the trade‑offs of sending messages before, during, or after database persistence in order‑creation workflows, explores transaction‑message patterns, half‑message checks, and message‑table strategies, and offers practical guidance for building reliable backend messaging systems.

BackendMessage QueueMessaging
0 likes · 10 min read
When Should You Send Messages in a Transaction? Real‑World Order Processing Insights
DevOps Coach
DevOps Coach
Jun 30, 2024 · Operations

Effective Incident Mitigation and Recovery: Practical SRE Strategies

The article outlines SRE‑based incident mitigation and recovery practices, covering urgent mitigations, impact reduction, key metrics such as TTD, TTR, TBF, and detailed strategies for shortening detection and repair times, preventing fatigue, improving observability, and designing resilient systems.

MitigationOperationsReliability
0 likes · 23 min read
Effective Incident Mitigation and Recovery: Practical SRE Strategies
Selected Java Interview Questions
Selected Java Interview Questions
Jun 30, 2024 · Backend Development

Design and Implementation of a Unified WebSocket Communication Service for Backend Systems

This article presents a comprehensive design of a unified WebSocket communication service that abstracts messaging, improves reliability with RabbitMQ, replaces polling with push, and provides standardized APIs and message formats for backend developers to quickly integrate real‑time features.

BackendRabbitMQReliability
0 likes · 9 min read
Design and Implementation of a Unified WebSocket Communication Service for Backend Systems
Qunar Tech Salon
Qunar Tech Salon
Jun 21, 2024 · Cloud Native

Redesigning Kubernetes DNS Architecture with q-dnsmasq for Improved Reliability and Performance

This article details the motivation, design, implementation, testing, and rollout of a refactored Kubernetes DNS solution that replaces the default kube-dns → CoreDNS chain with a node‑local q‑dnsmasq cache and parallel upstream queries to achieve higher availability, faster resolution, and better cache hit rates in large‑scale clusters.

CoreDNSDNSKubernetes
0 likes · 18 min read
Redesigning Kubernetes DNS Architecture with q-dnsmasq for Improved Reliability and Performance
Ctrip Technology
Ctrip Technology
May 23, 2024 · Backend Development

Evolution of Ctrip Account System: Domain‑Driven, Middle‑Platform, and Multi‑Region Architecture

This article details Ctrip's account system evolution, covering its transition from monolithic to domain‑driven microservices, middle‑platform consolidation, and multi‑region deployment, including design goals, read/write comparison processes, configuration‑driven capabilities, and routing strategies to improve scalability, reliability, and operational efficiency.

Domain-Driven DesignReliabilityaccount system
0 likes · 13 min read
Evolution of Ctrip Account System: Domain‑Driven, Middle‑Platform, and Multi‑Region Architecture
Cognitive Technology Team
Cognitive Technology Team
May 16, 2024 · Operations

Core Principles of High‑Availability Architecture Design

These core principles—minimal dependency, weak dependency, distribution, rate limiting, degradable design, balanced risk, fault prevention and isolation, no single point of failure, self‑protection, automatic failover, and retry/idempotency/compensation—guide the design of highly available systems by reducing risk, ensuring redundancy, and protecting services at all layers.

OperationsReliabilitySystem Design
0 likes · 3 min read
Core Principles of High‑Availability Architecture Design
vivo Internet Technology
vivo Internet Technology
May 15, 2024 · Databases

Challenges and New Technology Exploration in Vivo Database Operations Platform

At the 2024 XCOPS Intelligent Operations Management Annual Meeting in Guangzhou, Vivo’s Deng Song will discuss building a robust database operations platform, addressing availability threats, efficiency levers, 0‑to‑1 development strategies, and considerations of reliability, cost, and data privacy amid emerging AI and large‑model technologies.

ReliabilityTech Talkaiops
0 likes · 3 min read
Challenges and New Technology Exploration in Vivo Database Operations Platform
Architect
Architect
Apr 11, 2024 · Backend Development

How WeChat Achieves Real‑Time, Lossless Messaging: Architecture Deep Dive

This article dissects WeChat's early message‑sending and receiving architecture, explaining how the system meets real‑time delivery and no‑loss guarantees through a multi‑stage server pipeline, push notifications, and a sequence‑based acknowledgment mechanism, illustrated with concrete flow diagrams and numeric examples.

Backend ArchitectureMessagingReal-time Delivery
0 likes · 11 min read
How WeChat Achieves Real‑Time, Lossless Messaging: Architecture Deep Dive
Bilibili Tech
Bilibili Tech
Apr 9, 2024 · Operations

BCM – Building and Deploying Bilibili’s Chaos Engineering Platform

At the 2024 GOPS Global Operations Conference, Bilibili senior R&D engineer Gu Lintao will present BCM—Bilibili’s Chaos Engineering Platform—showcasing how its design and capabilities let developers, testers, and SREs safely inject faults, uncover hidden architectural risks, and improve service stability through real‑world drills and systematic reliability engineering.

BilibiliDevOpsReliability
0 likes · 3 min read
BCM – Building and Deploying Bilibili’s Chaos Engineering Platform
Deepin Linux
Deepin Linux
Apr 2, 2024 · Fundamentals

Understanding TCP: Fundamentals, Handshake, Data Transfer, and Optimization

This article provides a comprehensive overview of the Transmission Control Protocol (TCP), covering its connection‑oriented design, reliability mechanisms, three‑way handshake, four‑step termination, packet structure, flow and congestion control, and practical C++ socket examples for establishing, sending, receiving, and closing connections.

NetworkingReliabilityTCP
0 likes · 35 min read
Understanding TCP: Fundamentals, Handshake, Data Transfer, and Optimization
Efficient Ops
Efficient Ops
Mar 25, 2024 · Operations

Why SRE Exists and How It Solves Modern Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how SRE teams use SLOs, monitoring, and scenario drills to improve system reliability, performance, and observability in complex production environments.

DevOpsOperationsReliability
0 likes · 12 min read
Why SRE Exists and How It Solves Modern Reliability Challenges
Selected Java Interview Questions
Selected Java Interview Questions
Mar 25, 2024 · Databases

Redis Best Practices: Memory Management, Performance Tuning, Reliability, Operations, and Security

This comprehensive guide outlines practical Redis best practices covering memory optimization, key design, data type selection, performance enhancements, high‑availability deployment, operational safeguards, security hardening, and monitoring to help engineers build stable, efficient caching solutions.

ReliabilitySecuritybest practices
0 likes · 15 min read
Redis Best Practices: Memory Management, Performance Tuning, Reliability, Operations, and Security
High Availability Architecture
High Availability Architecture
Mar 21, 2024 · Operations

Applying Chaos Engineering to WeChat Pay: Design, Implementation, and Outcomes

To improve the ultra‑high availability of WeChat Pay, the team introduced chaos engineering using multi‑partition isolation, controlled blast radius, automated fault injection, and systematic risk discovery, detailing the design, execution, automation, and results of this reliability‑focused initiative.

Fault InjectionReliabilityWeChat Pay
0 likes · 18 min read
Applying Chaos Engineering to WeChat Pay: Design, Implementation, and Outcomes
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2024 · Operations

Chaos Engineering in WeChat Pay: Design, Implementation, and Results

WeChat Pay’s team adopted Netflix‑style chaos engineering, building an automated, YAML‑driven fault‑injection platform that isolates experiments in multi‑zone partitions, enabling over 500 safe experiments in 2021‑2022, uncovering critical bugs across core services while maintaining five‑nine availability and zero production incidents.

AutomationFault InjectionReliability
0 likes · 18 min read
Chaos Engineering in WeChat Pay: Design, Implementation, and Results
JavaEdge
JavaEdge
Mar 2, 2024 · Backend Development

How We Boosted Twitter’s Recommendation Engine Reliability from 2‑9 to 3‑9

This article details how a Twitter recommendation engine was refactored over three months to improve stability, introduce scalable tooling, redesign material storage and read‑status services, and ultimately raise availability from under 99% to over 99.9% while cutting latency and resource usage.

ReliabilityScalabilityarchitecture
0 likes · 13 min read
How We Boosted Twitter’s Recommendation Engine Reliability from 2‑9 to 3‑9
Refining Core Development Skills
Refining Core Development Skills
Feb 29, 2024 · Fundamentals

Understanding ECC Memory and Hamming Code Error‑Correction

This article explains why ECC memory modules use an extra chip, how bit‑flip errors occur in 64‑bit CPU‑memory transfers, and how simple parity and Hamming‑code algorithms detect and correct single‑bit errors while only detecting double‑bit errors, illustrating the principles with diagrams and examples.

ECCError CorrectionHamming Code
0 likes · 13 min read
Understanding ECC Memory and Hamming Code Error‑Correction
Java Captain
Java Captain
Feb 16, 2024 · Fundamentals

Understanding TCP: Concepts, Operation, and Key Features

TCP (Transmission Control Protocol) is a connection-oriented, reliable, byte-stream transport layer protocol that ensures ordered, error-free data delivery through mechanisms such as three-way handshake, four-way termination, flow control, and congestion control, making it essential for web browsing, email, and online gaming.

Connection ManagementNetworkingReliability
0 likes · 4 min read
Understanding TCP: Concepts, Operation, and Key Features
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 5, 2024 · Cloud Computing

Alibaba Cloud Server R&D Papers Accepted at DesignCon 2024 and ECTC 2024: Immersion Cooling Impact on PCIe 5.0/6.0 and Long‑Term Reliability of Crystal Oscillators

Alibaba Cloud announced that two of its server‑R&D papers were selected for DesignCon and ECTC 2024, presenting measurement‑based studies on PCIe 5.0/6.0 link performance under air and immersion cooling and a long‑term reliability analysis of crystal oscillators in various immersion‑cooling fluids, insights that guide next‑generation server architecture and large‑scale liquid‑cool deployment.

High-speed interconnectImmersion CoolingPCIe
0 likes · 11 min read
Alibaba Cloud Server R&D Papers Accepted at DesignCon 2024 and ECTC 2024: Immersion Cooling Impact on PCIe 5.0/6.0 and Long‑Term Reliability of Crystal Oscillators
Java Captain
Java Captain
Feb 1, 2024 · Fundamentals

Understanding TCP: Reliable Data Transmission in the Internet

TCP (Transmission Control Protocol) is a core Internet protocol that ensures reliable data transmission through mechanisms such as three‑way handshake, flow and congestion control, segmentation and reassembly, and error detection, while also facing challenges like latency, packet loss, and emerging real‑time application demands.

Internet ProtocolNetworkingReliability
0 likes · 4 min read
Understanding TCP: Reliable Data Transmission in the Internet
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jan 27, 2024 · Backend Development

Understanding Message Queues: Concepts, Use Cases, Types, and Common Implementations

This article introduces message queues, explaining their role in high‑traffic systems, key characteristics such as asynchronous communication, decoupling, reliability and buffering, compares point‑to‑point and publish‑subscribe models, and reviews popular implementations like RabbitMQ, Kafka, ActiveMQ, RocketMQ and Pulsar.

Backend DevelopmentDecouplingKafka
0 likes · 8 min read
Understanding Message Queues: Concepts, Use Cases, Types, and Common Implementations
Architecture Digest
Architecture Digest
Jan 15, 2024 · Databases

Understanding Redis Persistence: AOF vs RDB Mechanisms

This article explains Redis's two persistence mechanisms—Append Only File (AOF) and RDB snapshots—detailing their operation, advantages, risks, write‑back strategies, rewrite process, and how to choose the appropriate method for performance and reliability requirements.

AOFPersistenceRDB
0 likes · 11 min read
Understanding Redis Persistence: AOF vs RDB Mechanisms