Tagged articles

software reliability

37 articles · Page 1 of 1

Jun 29, 2026 · Industry Insights

The Nameless Engineers Behind Decades-Long, Error‑Free Code Powering Our Physical World

A handful of largely unknown engineers have written tiny, fast, deterministic and rigorously tested code—such as μC/OS, FreeRTOS, MQTT, SQLite and Linux kernel components—that has quietly powered everything from aircraft cockpits to household routers for decades without failure.

embedded systemsindustry historylow-level programming

0 likes · 19 min read

The Nameless Engineers Behind Decades-Long, Error‑Free Code Powering Our Physical World

Frontend AI Walk

Jun 21, 2026 · Artificial Intelligence

From Simple Prompts to Closed-Loop SOPs: Loop Engineering for Reliable AI Code

The article demonstrates how adding a structured Loop Engineering prompt—anchoring, execution, verification, correction, and exit—transforms ordinary AI code‑generation prompts into a closed‑loop SOP, reducing errors, enforcing self‑checks, and delivering more reliable, maintainable code for complex multi‑file projects.

AI promptingLoop EngineeringPrompt engineering

0 likes · 13 min read

From Simple Prompts to Closed-Loop SOPs: Loop Engineering for Reliable AI Code

AI Engineering

Jun 7, 2026 · Artificial Intelligence

How a Four-Layer Configuration Stops Claude Code from Fabricating Answers

Claude Code often fabricates functions, imports, and test results, but by adding a four‑layer system—honesty rules in CLAUDE.md, a verification protocol, post‑write hooks, and a fact‑checking sub‑agent—developers can force the model to provide evidence, avoid false claims, and improve reliability in production.

ClaudeHooksLLM

0 likes · 12 min read

How a Four-Layer Configuration Stops Claude Code from Fabricating Answers

DevOps Coach

Apr 20, 2026 · Industry Insights

Why Senior Developers Still Matter When AI Does the Coding

The article argues that despite junior developers completing tasks quickly with AI assistants, senior engineers add lasting value through rigorous testing, system reliability, deep architectural insight, and mentorship, illustrating the complementary roles of experience and generative AI in modern software teams.

AI coding toolssenior developerssoftware engineering

0 likes · 13 min read

Why Senior Developers Still Matter When AI Does the Coding

FunTester

Apr 19, 2026 · Artificial Intelligence

How AI Can Reduce Deployment Failures by Up to 50% and Boost Team Efficiency

This article analyzes why software deployment failures pose systemic risks, enumerates the most common root causes, and explains how AI‑driven automation—covering intelligent version control, automatic rollback, test optimization, dependency management, database migration, observability, security checks, self‑documenting pipelines, backup verification, and predictive scaling—can transform DevOps from reactive firefighting to proactive, self‑healing delivery.

AIContinuous IntegrationDeployment Automation

0 likes · 15 min read

How AI Can Reduce Deployment Failures by Up to 50% and Boost Team Efficiency

AgentGuide

Mar 24, 2026 · Artificial Intelligence

What I Learned Moving from Backend Engineering to AI Agent Development

The author, a former backend engineer turned AI Agent developer, explains how LLM uncertainty, context engineering, shifting code responsibilities, workflow standards, new failure modes, and the ReAct paradigm shape modern Agent development, and outlines tasks best suited—or unsuited—for LLMs.

AI AgentLLMPrompt engineering

0 likes · 6 min read

What I Learned Moving from Backend Engineering to AI Agent Development

Java Web Project

Mar 10, 2026 · Industry Insights

Why AI‑Generated Code Still Needs a Post‑Processing Engineer

The article analyzes how large‑model code generators can quickly produce 80‑point prototypes but still require skilled engineers to fix missing logic, boundary cases, security flaws, and performance issues, turning shaky AI output into reliable, production‑ready software.

AI code generationAutonomous AgentsIndustry insight

0 likes · 9 min read

Why AI‑Generated Code Still Needs a Post‑Processing Engineer

java1234

Feb 28, 2026 · Artificial Intelligence

The Ironic New Role in the Large‑Model Era: The “Large‑Model Post‑Processing Engineer”

In the age of large‑model AI, code can be generated up to an 80‑point prototype with a single prompt, but turning that prototype into a reliable, secure, high‑performance product still requires engineers to perform the painstaking 20‑point post‑processing work.

AI code generationagent architecturelarge language models

0 likes · 9 min read

The Ironic New Role in the Large‑Model Era: The “Large‑Model Post‑Processing Engineer”

FunTester

Oct 31, 2025 · Fundamentals

Master Defensive Programming: Turn Failures into Manageable Events

This article explains why defensive programming is essential, outlines its core principles, presents common failure scenarios and practical guidelines, and shows how testing and observability can turn inevitable errors into controlled, recoverable events that keep systems stable and maintainable.

Defensive ProgrammingError handlingObservability

0 likes · 9 min read

Master Defensive Programming: Turn Failures into Manageable Events

FunTester

Sep 14, 2025 · Operations

Essential Fault Testing & Chaos Engineering Resources: Articles, Guides, and Byteman Tutorials

This curated collection presents dozens of Chinese articles and guides on fault testing, chaos engineering, and Byteman usage, covering topics such as SACK, delayed ACK, RTT, socket buffers, HTTP timeouts, and practical Byteman techniques, each with publication dates for quick reference.

BytemanResiliencechaos engineering

0 likes · 9 min read

Essential Fault Testing & Chaos Engineering Resources: Articles, Guides, and Byteman Tutorials

AntTech

Jun 23, 2025 · Artificial Intelligence

Can AI Auditors Ensure Reliable Software? Highlights from EXPRESS 2025 at ISSTA

The EXPRESS 2025 workshop at ISSTA in Norway will showcase AI‑driven code auditing, present cutting‑edge research on trustworthy software systems, and invite researchers and practitioners to discuss transparency, reliability, and security challenges in modern software engineering.

AI auditingISSTA 2025LLM

0 likes · 5 min read

Can AI Auditors Ensure Reliable Software? Highlights from EXPRESS 2025 at ISSTA

DeWu Technology

Mar 17, 2025 · Operations

Stability and Its Significance: Challenges and Practices for Building System Reliability

Building system stability requires quantifying risk through formulas, confronting challenges like low short‑term value and resource competition, and implementing a consensus‑driven framework that sets clear goals, cultivates awareness, enforces safety standards, ensures emergency response, conducts routine inspections, and applies sound architecture governance to continuously reduce inherent and change‑related risks.

Risk Managementprocess improvementsoftware reliability

0 likes · 25 min read

Stability and Its Significance: Challenges and Practices for Building System Reliability

JD Cloud Developers

Dec 4, 2024 · Operations

Mastering Gray Releases: Safe Deployment, Validation, and Rollback Strategies

This guide explains how to design and execute gray releases with patience, detailed planning, monitoring, and effective rollback techniques to minimize risk and ensure system stability during high‑risk deployment phases.

OperationsRollbackdeployment

0 likes · 13 min read

Mastering Gray Releases: Safe Deployment, Validation, and Rollback Strategies

JD Cloud Developers

Oct 21, 2024 · Operations

How Test Teams Can Build Observability Beyond Traditional Monitoring

This article examines how quality assurance engineers can adopt observability principles—distinct from conventional monitoring—to enhance system health detection, root‑cause analysis, and proactive risk mitigation across resources, services, business functions, data, and logs.

MonitoringObservabilityOperations

0 likes · 17 min read

How Test Teams Can Build Observability Beyond Traditional Monitoring

FunTester

Sep 19, 2024 · Fundamentals

Software Antifragility: Rethinking Error Handling and Reliability

This paper introduces the concept of software antifragility, drawing on Taleb’s theory to argue that embracing errors through fault tolerance, automatic runtime repair, and fault injection can transform software systems into self‑improving, more robust entities, and discusses implications for development processes and product reliability.

Antifragilitychaos engineeringfault tolerance

0 likes · 13 min read

Software Antifragility: Rethinking Error Handling and Reliability

Software Development Quality

Aug 12, 2024 · Information Security

How to Detect and Prevent Financial Losses in Banking Systems

This guide explains what capital loss means, outlines common financial loss scenarios, details a comprehensive testing methodology, presents real-world banking and insurance loss cases, and offers practical prevention measures to safeguard financial operations.

Fraud Preventionbanking systemsfinancial loss

0 likes · 9 min read

How to Detect and Prevent Financial Losses in Banking Systems

Ele.me Technology

May 28, 2024 · Operations

Automated Mock for E2E Testing: Design and Implementation of Unmanned MOCK

Unmanned MOCK automatically generates intelligent, context‑aware mock responses for downstream services in end‑to‑end tests by collecting sub‑call data, extracting knowledge, and applying dynamic rules, so failures in downstream systems are isolated, raising test success rates toward near‑100 % without manual mock configuration.

automated testinge2eservice isolation

0 likes · 12 min read

Automated Mock for E2E Testing: Design and Implementation of Unmanned MOCK

Efficient Ops

Mar 25, 2024 · Operations

How CAICT’s SRE Standards Strengthen System Reliability and Continuity

This article outlines the rising frequency of system outages, explains the key characteristics and challenges of modern large‑scale distributed systems, introduces China’s CAICT SRE framework and its two‑part reliability model, showcases a successful SRE case, and announces the 2024 SRE maturity assessment program.

Digital GovernanceSREsoftware reliability

0 likes · 12 min read

How CAICT’s SRE Standards Strengthen System Reliability and Continuity

Tencent Cloud Developer

Jan 10, 2024 · Operations

The Challenges of Building Continuously Available Systems: Entropy, Murphy's Law, and the 'Divine Doctor Paradox'

Building continuously available systems in 2023 is hampered by entropy‑driven technical debt and Murphy’s Law failures, and the “Divine Doctor Paradox” shows that successful availability work goes unnoticed while blame follows any outage, making cultural commitment—not just technology—the essential solution.

High AvailabilityMurphy's LawSRE

0 likes · 14 min read

The Challenges of Building Continuously Available Systems: Entropy, Murphy's Law, and the 'Divine Doctor Paradox'

Bilibili Tech

Jan 5, 2024 · Cloud Native

ChangePilot: Bilibili’s Unified Change Management Platform and Practices

ChangePilot is Bilibili’s unified change‑management platform that standardizes change definition, lifecycle, and risk governance through a platform‑scenario model and five control levels (G0‑G4), offering built‑in checks, searchable records, subscription alerts, intelligent correlation, and emergency channels to boost production stability while maintaining operational efficiency.

Change ManagementSRErisk control

0 likes · 29 min read

ChangePilot: Bilibili’s Unified Change Management Platform and Practices

Advanced AI Application Practice

Nov 28, 2023 · Operations

Is a Didi Outage a P0‑Level Incident? Understanding Severity Classifications

The article explains the common P0‑to‑PX incident severity hierarchy used in software development, detailing what constitutes a P0 crash versus lower‑level issues, notes that definitions can vary across organizations, and adds a personal perspective on Didi’s service reliability.

DidiIncident ManagementOperations

0 likes · 3 min read

Is a Didi Outage a P0‑Level Incident? Understanding Severity Classifications

FunTester

Oct 12, 2023 · Interview Experience

Master Performance Testing: Key Interview Questions & 12306 Crash Lessons

This article compiles essential performance testing interview questions, outlines a complete testing process with metrics and types, analyzes the 12306 ticketing system crash causes—including overload, bugs, security and network issues—and offers practical mitigation strategies for engineers.

12306 crashInterview Questionsload testing

0 likes · 8 min read

Master Performance Testing: Key Interview Questions & 12306 Crash Lessons

DevOps Coach

Sep 21, 2023 · Operations

Why Observability Engineering Is Essential for Modern Software Systems

The article examines the concept of observability engineering, highlighting its importance for complex distributed systems, the cultural shift toward DevOps collaboration, key principles from the book “Observability Engineering,” and practical guidance for developers, SREs, managers, and executives to improve reliability, performance, and security.

distributed systemssoftware reliability

0 likes · 14 min read

Why Observability Engineering Is Essential for Modern Software Systems

FunTester

Aug 11, 2023 · Operations

Essential Performance Testing Best Practices Every Engineer Should Follow

Performance testing is crucial for ensuring software reliability, and this guide outlines essential best practices—including setting clear goals, selecting appropriate tools, crafting maintainable scripts, using realistic data, running long‑duration loads, and scheduling regular tests—to help engineers achieve stable, high‑performing applications.

Operationsbest practicesload testing

0 likes · 8 min read

Essential Performance Testing Best Practices Every Engineer Should Follow

JD Tech

Jun 7, 2023 · Operations

Practical Guide to Achieving High Availability in Software Delivery

This article explains the concept of high availability, outlines the challenges of collaborative delivery, architectural design, coding practices, secure release, and deployment operations, and provides concrete steps, process standards, emergency plans, and self‑check tools to ensure reliable, fault‑tolerant software systems.

High AvailabilityMonitoringarchitecture

0 likes · 13 min read

Practical Guide to Achieving High Availability in Software Delivery

JD Retail Technology

Mar 16, 2023 · Operations

Ensuring High Availability in Software: Collaboration, Architecture, Implementation, and Operational Practices

This article explains the concept of high availability, outlines the challenges of achieving it in complex software delivery chains, and provides practical guidance on improving collaboration efficiency, establishing process standards, designing robust architecture, implementing disciplined coding, executing safe releases, and maintaining operational safeguards.

High Availabilityarchitecturecollaboration

0 likes · 11 min read

Ensuring High Availability in Software: Collaboration, Architecture, Implementation, and Operational Practices

NetEase Smart Enterprise Tech+

Mar 1, 2023 · Operations

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

This article explains the origins and meaning of software stability and stability testing, outlines key standards such as GB/T 16260 and industry definitions, and presents a comprehensive framework for stability quality assurance covering system elements, external disturbances, baseline setting, robust design, monitoring, and rapid incident response.

OperationsSREStability

0 likes · 17 min read

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

dbaplus Community

May 11, 2022 · Backend Development

Mastering Failure‑Oriented Design: Mindset, Process, and Distributed Locks

This article explores the philosophy and practical techniques of failure‑oriented design, covering why anticipating failures is crucial for developers, the organizational and process changes needed, core design principles, and concrete implementations such as multi‑level Redis distributed locks with code examples.

Backend EngineeringDistributed LockFailure Design

0 likes · 23 min read

Mastering Failure‑Oriented Design: Mindset, Process, and Distributed Locks

DeWu Technology

Feb 28, 2022 · Operations

DeWu Tech Salon – Quality Assurance Sessions Summary

The DeWu Tech Salon, co‑hosted by DeWu App Quality Platform and TesterHome, brought senior engineers from Alibaba Cloud, ByteDance, Lagou and DeWu together to share practical QA insights on end‑side monitoring, traffic replay, full‑link stress testing, and industry‑scale chaos engineering, while announcing a PPT collection, a testing‑expert recruitment drive, and a preview of the next wireless‑technology salon.

chaos engineeringperformance monitoringsoftware reliability

0 likes · 6 min read

DeWu Tech Salon – Quality Assurance Sessions Summary

DevOps

May 10, 2021 · Backend Development

Automated Unit Test Generation for Exception Recall in C/C++ Services

This article presents a white‑box, unit‑test‑driven approach for automatically generating C/C++ test cases that detect and recall runtime stability issues, detailing problem analysis, solution design, code‑analysis, test‑data generation, code generation, failure analysis, and deployment results across large‑scale backend modules.

C#FuzzingTest Generation

0 likes · 19 min read

Automated Unit Test Generation for Exception Recall in C/C++ Services

Alibaba Cloud Developer

Apr 1, 2021 · Cloud Native

From Google to Ant: How He Zhengyu Built Ant’s Trusted Native Cloud Platform

This interview chronicles He Zhengyu’s journey from a prodigious student to a Google engineer and Ant Group leader, highlighting his role in shaping the Trusted Native initiative that combines cloud‑native, secure containers, confidential computing, and open‑source contributions to boost reliability and security for large‑scale financial services.

Career Adviceopen sourcesoftware reliability

0 likes · 15 min read

From Google to Ant: How He Zhengyu Built Ant’s Trusted Native Cloud Platform

Baidu Intelligent Testing

Jan 6, 2021 · Backend Development

Automated Unit Test Generation for Exception Recall in C/C++ Backend Systems

This article presents a comprehensive approach to automatically generate unit tests for C/C++ backend services, leveraging static code analysis, white‑box techniques, and fuzzing to create high‑coverage test cases that proactively detect stability issues without manual effort.

FuzzingTest GenerationUnit Testing

0 likes · 20 min read

Automated Unit Test Generation for Exception Recall in C/C++ Backend Systems

DevOps Cloud Academy

Aug 27, 2020 · Cloud Native

Step-by-Step Guide to Building More Reliable Software with Kubernetes and DevOps

This article presents a practical, multi‑stage approach for improving software reliability in Kubernetes‑based microservice environments, covering static analysis, testing pyramids, CI/CD observability, performance testing, deployment strategies, and feedback loops to help engineering teams deliver faster, higher‑quality releases.

CI/CDCloud Nativedevops

0 likes · 11 min read

Step-by-Step Guide to Building More Reliable Software with Kubernetes and DevOps

21CTO

Jun 18, 2019 · Operations

Why Embracing Failure Accelerates Growth: Lessons from Intuit and PayPal

The article explains how organizations can achieve rapid growth by openly acknowledging failures, creating lightweight post‑mortem processes, and continuously learning from mistakes, illustrated through Intuit’s SaaS transition, PayPal’s rollback challenges, and practical rules for QA and architecture.

QARollbackSaaS

0 likes · 31 min read

Why Embracing Failure Accelerates Growth: Lessons from Intuit and PayPal

360 Tech Engineering

Jul 11, 2018 · Fundamentals

Static Program Analysis, Gödel’s Incompleteness, and the Halting Problem: Foundations of Software Reliability

This article explains how redundancy and voting schemes improve system reliability, introduces Gödel’s incompleteness and consistency concepts, describes the undecidable halting problem, and outlines static program analysis techniques—including data‑flow, inter‑procedural, pointer analysis, and constraint solving—while discussing practical heuristic rules and tools.

Gödeldecision problemshalting problem

0 likes · 8 min read

Static Program Analysis, Gödel’s Incompleteness, and the Halting Problem: Foundations of Software Reliability

UCloud Tech

Mar 23, 2018 · Operations

How UCloud’s Application Hot‑Patch Framework Enables Zero‑Downtime Fixes

This article explains the design, components, and implementation of UCloud's application hot‑patch framework, covering its motivation, safety checks, multi‑thread support, and how the Creator, Loader, and Core Runtime work together to apply, manage, and roll back patches without restarting services.

ELFLinuxUCloud

0 likes · 13 min read

How UCloud’s Application Hot‑Patch Framework Enables Zero‑Downtime Fixes

Art of Distributed System Architecture Design

May 22, 2015 · Industry Insights

How Facebook Cuts Power Use with Cold Storage: Inside Their Low‑Energy Data Center Design

This article examines Facebook's cold storage system, detailing how the company redesigned hardware and software to slash power consumption, improve reliability with Reed‑Solomon coding, mitigate bit‑rot, and balance loads while supporting massive photo archives in energy‑constrained data centers.

Data CenterFacebookHardware Design

0 likes · 8 min read