Tagged articles
15 articles
Page 1 of 1
Baidu Geek Talk
Baidu Geek Talk
Apr 8, 2026 · Artificial Intelligence

How to Engineer Reliable Long‑Running AI Coding Tasks: Harnessing Agents for Scale

This article analyzes the challenges of using AI coding agents for large‑scale, long‑running tasks such as bulk file migration or code review, and presents a systematic engineering approach—including task decomposition, parallel execution, persistent progress files, resumable workflows, and multi‑level retry strategies—backed by concrete script examples and real‑world case studies.

AI agentsMeta SkillParallel Execution
0 likes · 31 min read
How to Engineer Reliable Long‑Running AI Coding Tasks: Harnessing Agents for Scale
Ops Community
Ops Community
Sep 17, 2025 · Operations

Mastering System Fault Tolerance: From Theory to Production‑Ready High‑Availability

This comprehensive guide explores the philosophy, core patterns, and practical techniques for designing fault‑tolerant, highly available systems, covering circuit breakers, retries, rate limiting, monitoring, cloud‑native deployment, and real‑world case studies to help engineers build resilient production architectures.

Cloud Nativecircuit breakerfault tolerance
0 likes · 24 min read
Mastering System Fault Tolerance: From Theory to Production‑Ready High‑Availability
Architect
Architect
Apr 25, 2024 · Backend Development

Design and Implementation of an Elasticsearch Data Synchronization Service (ECP)

This article describes the challenges of synchronizing billions of order records to Elasticsearch and presents the design, architecture, and key technical details of a generic data‑sync service (ECP) that supports multiple data sources, dynamic rate limiting, retry strategies, SPI‑based extensibility, environment isolation, health‑check, fault recovery, smooth migration, and elegant logging.

Backend DevelopmentDynamic Rate LimitingElasticsearch
0 likes · 22 min read
Design and Implementation of an Elasticsearch Data Synchronization Service (ECP)
Zhuanzhuan Tech
Zhuanzhuan Tech
Apr 3, 2024 · Backend Development

Design and Implementation of an Elasticsearch Data Synchronization Service (ECP) for Large‑Scale Order Data

This article describes the challenges and technical solutions for synchronizing billions of order records from a relational database to Elasticsearch, including multi‑source data reading, dynamic rate limiting, retry strategies, SPI‑based service integration, environment isolation, health‑checking, smooth migration, and structured logging, all implemented in a backend service called ECP.

JavaSPIbackend service
0 likes · 21 min read
Design and Implementation of an Elasticsearch Data Synchronization Service (ECP) for Large‑Scale Order Data
Selected Java Interview Questions
Selected Java Interview Questions
Feb 2, 2024 · Backend Development

Comprehensive Guide to API Request Retry Mechanisms and Spring Boot Implementation

This article examines why API requests fail, explains the importance of retry mechanisms, compares linear, exponential and randomized back‑off strategies, discusses maximum attempt considerations and idempotency, and provides a detailed Spring Boot implementation using Spring Retry along with alternative approaches.

API RetryBackendIdempotency
0 likes · 21 min read
Comprehensive Guide to API Request Retry Mechanisms and Spring Boot Implementation
Sanyou's Java Diary
Sanyou's Java Diary
Dec 28, 2023 · Operations

Mastering High Availability: Traffic Governance, Circuit Breakers, Isolation, Retries, Timeouts and Rate Limiting

This article explains how to achieve the three‑high goals of high performance, high availability and easy scalability in microservice systems by using traffic governance techniques such as circuit breaking, various isolation strategies, retry mechanisms, timeout controls, degradation tactics and rate‑limiting, illustrated with practical examples and diagrams.

MicroservicesTimeoutcircuit breaker
0 likes · 32 min read
Mastering High Availability: Traffic Governance, Circuit Breakers, Isolation, Retries, Timeouts and Rate Limiting
IT Architects Alliance
IT Architects Alliance
May 31, 2022 · Backend Development

Optimizing High‑Concurrency Services: Practical Strategies for QPS Over 200K

This article presents practical techniques for optimizing high‑concurrency online services—such as avoiding relational databases, employing multi‑level caching, leveraging multithreading, implementing circuit‑breaker patterns, reducing I/O, managing retries, handling edge cases, and logging efficiently—to maintain sub‑300 ms response times under massive load.

IO optimizationcachingcircuit breaker
0 likes · 10 min read
Optimizing High‑Concurrency Services: Practical Strategies for QPS Over 200K
Architecture Digest
Architecture Digest
May 17, 2022 · Backend Development

Optimizing High‑Concurrency Services: Practical Strategies for QPS Over 200k

This article outlines practical techniques for handling online services with QPS exceeding 200,000, including avoiding relational databases, employing multi‑level caching, leveraging multithreading, implementing degradation and circuit‑breaker patterns, optimizing I/O, using controlled retries, handling edge cases, and logging efficiently.

IO optimizationcachingcircuit breaker
0 likes · 9 min read
Optimizing High‑Concurrency Services: Practical Strategies for QPS Over 200k
NiuNiu MaTe
NiuNiu MaTe
Mar 31, 2021 · Backend Development

How to Ensure Reliable Service‑to‑Service Messaging: 5 Proven Retry Strategies

This article explores why reliable inter‑service communication is essential in microservice architectures, illustrates common pitfalls with real‑world examples, and presents five practical retry and persistence solutions—including fast retry, in‑memory queues, persistent queues, retry services, and pre‑notification—to improve message delivery reliability.

Distributed SystemsMessage QueueMicroservices
0 likes · 11 min read
How to Ensure Reliable Service‑to‑Service Messaging: 5 Proven Retry Strategies
dbaplus Community
dbaplus Community
Mar 25, 2021 · Operations

Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts

This article distills Bilibili's technical director insights on building high‑service‑quality architectures, covering systematic load‑balancing strategies, sophisticated rate‑limiting mechanisms, robust retry policies, precise timeout controls, and comprehensive approaches to prevent cascading failures in large‑scale systems.

Backend ArchitectureSREload balancing
0 likes · 14 min read
Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts
iQIYI Technical Product Team
iQIYI Technical Product Team
Oct 30, 2020 · Mobile Development

Design and Optimization of iQIYI Mobile APM Network Monitoring System

The iQIYI mobile APM system provides real‑time, user‑level network monitoring with classified error detection, cloud‑controlled SDK sampling, second‑level backend storage, and web dashboards, while employing DNS three‑layer caching, weak‑network grading, gateway multiplexing, super‑pipeline proxies and layered retry strategies, reducing Android error rates from 5.3 % to 0.48 % and iOS from 4.63 % to 0.35 %.

APMDNSMobile Development
0 likes · 11 min read
Design and Optimization of iQIYI Mobile APM Network Monitoring System
ITPUB
ITPUB
Jul 28, 2020 · Databases

Why MySQL Binlog Can Cause Order Fulfillment Delays and How to Fix It

This article explains MySQL Binlog’s role in event‑driven order processing, analyzes hidden pitfalls such as premature Binlog reads and two‑phase commit issues, and offers practical solutions like retry strategies and direct Binlog consumption to ensure data consistency.

Database LogsEvent-Driven Architecturemysql
0 likes · 14 min read
Why MySQL Binlog Can Cause Order Fulfillment Delays and How to Fix It
Tencent Cloud Developer
Tencent Cloud Developer
Apr 22, 2020 · Cloud Native

Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation

Drawing on Google SRE principles, Bilibili’s technical director outlines a systematic, cloud‑native framework for high‑quality service architecture during traffic peaks, covering frontend and internal load balancing, distributed rate limiting, controlled retries, fail‑fast timeouts, and comprehensive failure‑mitigation strategies.

SREcloud-nativeload balancing
0 likes · 13 min read
Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation