Tagged articles
3281 articles
Page 18 of 33
Efficient Ops
Efficient Ops
Jul 27, 2021 · Operations

What Does China’s 2021 DevOps Survey Reveal About Industry Trends?

On July 15, 2021, the China Academy of Information and Communications Technology unveiled the 2021 China DevOps Status Survey Report, detailing the nation’s digital transformation, the growing demand for rapid software delivery, the extensive multi‑company survey methodology, and key findings on DevOps adoption and future trends.

2021ChinaDevOps
0 likes · 5 min read
What Does China’s 2021 DevOps Survey Reveal About Industry Trends?
Java Architect Essentials
Java Architect Essentials
Jul 25, 2021 · Backend Development

How I Cut Full GC Frequency by 80%: A JVM Tuning Case Study

Over a month of systematic JVM tuning reduced Full GC from 40 times per day to once every ten days and halved Young GC duration by adjusting heap sizes, survivor ratios, and metaspace settings while investigating and fixing a memory leak caused by an anonymous inner class listener.

BackendGarbage CollectionJVM
0 likes · 10 min read
How I Cut Full GC Frequency by 80%: A JVM Tuning Case Study
Architects' Tech Alliance
Architects' Tech Alliance
Jul 24, 2021 · Backend Development

How to Build a Scalable Backend Stack for Startups: Languages, Components, and Best Practices

This guide outlines a comprehensive backend technology stack for startups, covering language choices, core components, development processes, infrastructure services, database options, monitoring, CI/CD, and operational best practices to help teams design, select, and implement a reliable server-side architecture.

BackendOperationsTechnology Stack
0 likes · 31 min read
How to Build a Scalable Backend Stack for Startups: Languages, Components, and Best Practices
Efficient Ops
Efficient Ops
Jul 20, 2021 · Databases

Master Redis: 13 Proven Practices to Boost Memory, Performance & Reliability

Discover a comprehensive Redis best‑practice guide covering memory optimization, performance tuning, high reliability, daily operations, resource planning, monitoring, and security, with actionable tips such as key length control, maxmemory settings, lazy‑free, connection pooling, replication strategies, and safe deployment practices.

Database ManagementOperationsperformance optimization
0 likes · 23 min read
Master Redis: 13 Proven Practices to Boost Memory, Performance & Reliability
Ops Development Stories
Ops Development Stories
Jul 20, 2021 · Cloud Native

How to Build a Production‑Ready ELK Logging Stack on Kubernetes

This guide walks through the concepts of ELK, why log management is essential for Kubernetes, three collection strategies, required log fields, and step‑by‑step deployment of Elasticsearch, Kibana, Filebeat, and Logstash—including YAML manifests, configuration snippets, and Kibana UI setup—for a fully operational, cloud‑native logging solution.

Cloud NativeELKFilebeat
0 likes · 26 min read
How to Build a Production‑Ready ELK Logging Stack on Kubernetes
Youzan Coder
Youzan Coder
Jul 19, 2021 · Operations

How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance

This article examines the challenges faced by a search middle platform—such as inaccurate impact assessment, unstable underlying clusters, and missing process standards—and details a comprehensive quality‑assurance strategy that includes baseline test suites, stability practices, performance testing, emergency drills, and systematic monitoring to ensure reliable search services.

BackendOperationsPerformance Testing
0 likes · 13 min read
How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance
Efficient Ops
Efficient Ops
Jul 18, 2021 · Operations

Master Ansible in 16 Visual Steps

Ansible, a rapidly popular open‑source automation tool built on Python, simplifies batch system configuration, program deployment, and command execution with thousands of built‑in modules, offering a beginner‑friendly yet powerful solution for modern operations teams.

AnsibleConfiguration ManagementOperations
0 likes · 3 min read
Master Ansible in 16 Visual Steps
macrozheng
macrozheng
Jul 18, 2021 · Operations

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

In this article, a programmer recounts the recent Bilibili outage, analyzes its timeline, proposes technical root‑cause hypotheses such as CDN failure and service‑chain avalanche, shares insights from the platform’s high‑availability architecture, and outlines preventive techniques for building more resilient backend systems.

BilibiliCDNOperations
0 likes · 10 min read
Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures
IT Architects Alliance
IT Architects Alliance
Jul 18, 2021 · Operations

How to Achieve Smooth Releases and AB Testing with Nginx: A Step‑by‑Step Guide

This article details a practical smooth‑release process for a cloud‑office system, explains how to use Nginx health‑check endpoints to take instances offline, and presents three AB‑testing routing strategies—IP‑based, cookie‑based, and AB‑cluster proxy—complete with configuration examples, pros and cons, and deployment steps.

AB testingBlue‑Green deploymentCloud Native
0 likes · 9 min read
How to Achieve Smooth Releases and AB Testing with Nginx: A Step‑by‑Step Guide
21CTO
21CTO
Jul 16, 2021 · Operations

What Bilibili’s Outage Teaches About Achieving True High Availability

The article analyzes Bilibili’s recent service outage, explains why high availability matters, introduces key metrics like MTBF and MTTR, and outlines practical strategies such as redundancy, rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to build resilient systems.

MTBFMTTROperations
0 likes · 18 min read
What Bilibili’s Outage Teaches About Achieving True High Availability
High Availability Architecture
High Availability Architecture
Jul 15, 2021 · Operations

Baidu Game Microservice Monitoring Practice and System Design

This article describes Baidu's comprehensive approach to monitoring game microservices, covering the background, initial monitoring tools, evolution of the monitoring system, systematic design for risk control, intelligent detection, alarm optimization, efficient fault localization, and future outlook for high‑availability architecture.

BaiduGame DevelopmentMicroservices
0 likes · 13 min read
Baidu Game Microservice Monitoring Practice and System Design
Code Ape Tech Column
Code Ape Tech Column
Jul 15, 2021 · Operations

What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure

The article analyzes Bilibili's recent half‑hour service disruption, explores technical rumors such as an etcd crash, examines Kubernetes‑based cloud‑native infrastructure, reviews similar historic outages, and offers expert recommendations for improving high‑availability and disaster‑recovery in large‑scale internet services.

BilibiliCloud NativeKubernetes
0 likes · 8 min read
What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure
dbaplus Community
dbaplus Community
Jul 14, 2021 · Operations

How to Rapidly Diagnose and Resolve Common Online Service Failures

This guide walks through practical troubleshooting steps for typical production incidents—including disk exhaustion, high CPU, Java OOM, MySQL deadlocks and slow queries, Redis memory alerts, network TCP issues, and business‑log analysis—providing concrete commands, diagrams and mitigation strategies for each layer.

Operationsnetwork
0 likes · 32 min read
How to Rapidly Diagnose and Resolve Common Online Service Failures
Baidu Geek Talk
Baidu Geek Talk
Jul 14, 2021 · Operations

How Baidu Built a Robust Microservice Monitoring System for Game Services

This article details Baidu's comprehensive microservice monitoring practice for its game platform, covering the initial fragmented setup, systematic redesign across risk control, intelligent monitoring, smart alerting, and rapid fault localization, and presents the resulting monitoring architecture, visualizations, and future improvement goals.

AlertingBaiduMicroservices
0 likes · 14 min read
How Baidu Built a Robust Microservice Monitoring System for Game Services
Open Source Linux
Open Source Linux
Jul 11, 2021 · Operations

Mastering Shell Script Best Practices for Reliable Automation

This article outlines practical shell‑script guidelines for automating system and application operations, covering script header conventions, formatting, error handling, safe use of commands, variable handling, file packaging, pipeline restrictions, concurrency locks, logging, and risk‑mitigation strategies to make automation both efficient and secure.

DevOpsLinuxOperations
0 likes · 10 min read
Mastering Shell Script Best Practices for Reliable Automation
Selected Java Interview Questions
Selected Java Interview Questions
Jul 7, 2021 · Operations

Redis Monitoring Metrics and Commands Guide

This article provides a comprehensive overview of Redis monitoring metrics—including performance, memory, basic activity, persistence, and error indicators—along with recommended monitoring tools, configuration settings, and command-line examples for gathering and interpreting these metrics in production environments.

MetricsOperationsdatabase
0 likes · 7 min read
Redis Monitoring Metrics and Commands Guide
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 6, 2021 · Operations

Mastering Release Strategies: Alibaba’s DevOps Playbook for Faster, Safer Deployments

This article surveys common software release strategies—stop‑the‑world, canary, gray/rolling, blue‑green, A/B testing, and traffic‑isolation—detailing their advantages, disadvantages, and ideal scenarios, and then presents Alibaba’s practical best‑practice guide for planning, monitoring, and continuously delivering high‑quality releases.

Blue‑Green deploymentContinuous DeploymentOperations
0 likes · 16 min read
Mastering Release Strategies: Alibaba’s DevOps Playbook for Faster, Safer Deployments
Efficient Ops
Efficient Ops
Jul 5, 2021 · Operations

10 Essential Practices to Prevent DBA and Ops Disasters

Learn ten practical strategies—from safe change rollbacks and cautious destructive commands to robust backups, clear prompts, vigilant monitoring, and disciplined handovers—that help DBAs and operations engineers avoid costly system failures and maintain reliable production environments.

BackupOperationsOracle
0 likes · 6 min read
10 Essential Practices to Prevent DBA and Ops Disasters
Top Architect
Top Architect
Jul 4, 2021 · Operations

Design and Implementation of a Simple Gray Release System

The article explains the concept of gray release, outlines a basic architecture with strategy configuration, execution, and service registry components, describes common traffic-splitting strategies, and details practical implementations using Nginx, gateway services, and complex scenarios involving data synchronization and message queues.

A/B testingBackendDeployment
0 likes · 7 min read
Design and Implementation of a Simple Gray Release System
Alibaba Cloud Native
Alibaba Cloud Native
Jun 30, 2021 · Operations

How We Built a Dual‑Center, High‑Availability RocketMQ Platform

This article explains why RocketMQ was chosen, describes its large‑scale usage, details the design and implementation of a same‑city dual‑center architecture with near‑by production and consumption, outlines failover mechanisms, governance practices, lessons learned, and future plans for the messaging platform.

Dual CenterMessage QueueOperations
0 likes · 15 min read
How We Built a Dual‑Center, High‑Availability RocketMQ Platform
Architects Research Society
Architects Research Society
Jun 29, 2021 · Operations

Understanding the Differences Between SCADA and DCS Systems

SCADA and DCS originated as separate control systems but have converged over time; SCADA focuses on distributed monitoring and data acquisition across wide geographic areas, while DCS emphasizes centralized control, and modern high‑bandwidth networks now allow them to operate together as a unified monitoring solution.

DCSOperationsSCADA
0 likes · 6 min read
Understanding the Differences Between SCADA and DCS Systems
DevOps
DevOps
Jun 29, 2021 · Operations

Why Traditional Enterprise IT Departments Are Marginalized and How Digital Transformation Can Create a New IT

The article analyzes the current marginalization of IT departments in traditional enterprises due to limited value, hierarchical organization, and misaligned assessment, and proposes that digital transformation—redefining IT roles, aligning technology with business goals, and building a digital foundation—can turn IT into a profit‑center and strategic enabler.

DigitalizationIT transformationOperations
0 likes · 12 min read
Why Traditional Enterprise IT Departments Are Marginalized and How Digital Transformation Can Create a New IT
Tencent Cloud Developer
Tencent Cloud Developer
Jun 28, 2021 · Cloud Native

Effective Service Governance for Serverless: Challenges and Solutions

Effective serverless governance requires comprehensive observability, traffic management, and service registration built on Kubernetes, using either a mesh sidecar with Istio or an embedded SDK, to simplify complex operational tasks such as discovery, fault tolerance, gray releases, and metric correlation for large‑scale function deployments.

Cloud NativeOperationsServerless
0 likes · 17 min read
Effective Service Governance for Serverless: Challenges and Solutions
Programmer DD
Programmer DD
Jun 27, 2021 · Operations

How ByteDance Powers Billions with Multi‑Terabit Data Center Bandwidth

The article examines how ByteDance, Douyin, TikTok and other Chinese tech giants operate massive data centers with terabit‑level outbound bandwidth, millions of servers, and extensive CDN and load‑balancing architectures to support hundreds of millions of concurrent users.

ByteDanceCDNOperations
0 likes · 9 min read
How ByteDance Powers Billions with Multi‑Terabit Data Center Bandwidth
Java Architect Essentials
Java Architect Essentials
Jun 24, 2021 · Operations

Scaling WeChat Moments: Architecture, Capacity Planning, and Flexible Strategies for High Traffic

This article analyzes the large‑scale architecture of WeChat Moments, detailing image and video traffic characteristics, hardware and software safeguards, disaster‑recovery mechanisms, capacity assessment, and a series of flexible strategies such as compression format changes, bitrate reduction, buffer pools, and timeline throttling to handle holiday spikes.

Backend ArchitectureFlexible StrategiesMoments
0 likes · 10 min read
Scaling WeChat Moments: Architecture, Capacity Planning, and Flexible Strategies for High Traffic
Efficient Ops
Efficient Ops
Jun 23, 2021 · Operations

Agent vs Network Data: Choosing the Right Cloud Performance Monitoring Approach

This article compares agent‑based and network‑data approaches to cloud‑native application performance monitoring, discussing their architectures, advantages, challenges, and how combining white‑box and black‑box techniques can improve fault detection, scalability, and operational efficiency in complex cloud environments.

AgentOperationsWhite-box
0 likes · 10 min read
Agent vs Network Data: Choosing the Right Cloud Performance Monitoring Approach
DevOps
DevOps
Jun 22, 2021 · Operations

Building Digital Champion Capabilities: Integrating Customer Solutions, Operations, Technology, and Talent Ecosystems

The article outlines how digital‑champion enterprises achieve superior performance by integrating four core ecosystems—customer solutions, operations, technology, and talent—through strategic planning, partnership, and advanced technologies such as AI, big data, and industrial IoT, while highlighting maturity stages and practical implementation steps.

Artificial IntelligenceBig DataDigital Transformation
0 likes · 28 min read
Building Digital Champion Capabilities: Integrating Customer Solutions, Operations, Technology, and Talent Ecosystems
dbaplus Community
dbaplus Community
Jun 17, 2021 · Cloud Native

How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks

Facing surges during holidays and major shopping events, Dada’s DevOps team built a cloud‑native elastic scaling system that combines fine‑grained capacity management, multi‑cloud support, metric‑driven auto‑scaling, and extreme‑scale down strategies, delivering stable delivery performance while cutting costs.

Auto ScalingOperationscapacity management
0 likes · 17 min read
How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks
HomeTech
HomeTech
Jun 16, 2021 · R&D Management

Technical Debt Governance in Autohome's Cloud Platform: Theory and Practice

This article presents Autohome's Cloud Platform (Home Cloud) technical debt governance framework, defining ideal technical states, outlining five systematic steps—from factor collection to project execution—and sharing practical outcomes that have enhanced the competitiveness of its applications and development teams.

OperationsR&D managementTechnical Debt
0 likes · 7 min read
Technical Debt Governance in Autohome's Cloud Platform: Theory and Practice
DevOps
DevOps
Jun 16, 2021 · Operations

Understanding Digital Transformation: Definitions, Strategic Questions, Drivers, Frameworks, Roadmaps, Benefits and Pitfalls

The article provides a comprehensive overview of digital transformation, covering its definition, essential strategic questions, key drivers such as customer expectations, cloud and AI, priority areas in the value chain, practical frameworks, roadmap steps, expected benefits and common reasons for failure.

Artificial IntelligenceBig DataBusiness strategy
0 likes · 20 min read
Understanding Digital Transformation: Definitions, Strategic Questions, Drivers, Frameworks, Roadmaps, Benefits and Pitfalls
Efficient Ops
Efficient Ops
Jun 15, 2021 · Operations

Mastering IT Monitoring: Strategies, Challenges, and Best Practices

This article explores the fundamentals of IT monitoring, examines common challenges such as scalability, reliability, and alert fatigue, compares four implementation approaches—from open‑source to fully custom solutions—and presents practical techniques like alert convergence, suppression, and automation to build a robust, adaptable monitoring platform.

Alert ManagementOperationsScalability
0 likes · 19 min read
Mastering IT Monitoring: Strategies, Challenges, and Best Practices
Efficient Ops
Efficient Ops
Jun 8, 2021 · Operations

How Red‑Blue Drills Boost Securities Ops: From Capacity Testing to Full‑Scale Automation

Lin Ying, a senior test manager at Guoxin Securities, shares insights from his GOPS 2021 talk on the securities industry's digital transformation, current IT challenges, and a comprehensive red‑blue exercise strategy that combines full‑link load testing, automated workflows, and proactive monitoring to ensure system stability during market peaks.

DevOpsOperationscapacity testing
0 likes · 13 min read
How Red‑Blue Drills Boost Securities Ops: From Capacity Testing to Full‑Scale Automation
IT Architects Alliance
IT Architects Alliance
Jun 5, 2021 · Operations

Top 20 DevOps Interview Questions with Expert Answers

This article compiles the 20 most common DevOps interview questions, providing detailed explanations of concepts such as the DevOps‑Agile distinction, core benefits, key tools, anti‑patterns, KPI metrics, automation advantages, containers, microservice frameworks, version control practices, Git revert techniques, post‑mortem meetings, asset vs configuration management, continuous testing elements, and essential development and infrastructure operations.

DevOpsKPIOperations
0 likes · 16 min read
Top 20 DevOps Interview Questions with Expert Answers
Youzan Coder
Youzan Coder
Jun 4, 2021 · Operations

How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System

This article analyzes the stability challenges of a multi‑store chain’s product‑copy mechanism, outlines design goals for isolation and scalability, and presents short‑ and long‑term monitoring, flow‑control, and emergency‑response strategies to ensure reliable large‑scale operations.

Flow ControlOperationsScalability
0 likes · 12 min read
How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System
Ops Development Stories
Ops Development Stories
Jun 4, 2021 · Operations

Step-by-Step Guide to Monitoring MongoDB with Zabbix Agent 2

This tutorial explains how to use Zabbix Agent 2 to monitor MongoDB databases and clusters, covering the required read‑only user setup, relevant Zabbix templates, key metrics such as jumbo chunks, connection pool stats, server status, collection and replSet information, and practical configuration examples.

Agent2MongoDBOperations
0 likes · 6 min read
Step-by-Step Guide to Monitoring MongoDB with Zabbix Agent 2
DevOps
DevOps
Jun 2, 2021 · Operations

Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering

Slack’s resilience engineering team outlines a structured chaos‑engineering workflow—identifying potential failures, ensuring fault tolerance, and deliberately injecting faults in development and production—to safely test system robustness, validate hypotheses, and continuously improve reliability through regular disaster‑theater exercises.

Fault InjectionOperationsReliability
0 likes · 11 min read
Disasterpiece Theater: Slack’s Process for Approachable Chaos Engineering
Efficient Ops
Efficient Ops
Jun 1, 2021 · Operations

Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops

This article details how a major securities firm analyzed business stability, built a comprehensive stability engineering platform using chaos engineering, practiced extensive fault‑injection drills, and outlines future directions such as random‑scenario exercises, red‑blue battles, and AI‑driven risk detection.

Operationschaos engineeringfinancial systems
0 likes · 11 min read
Mastering System Stability: Building a Chaos‑Driven Platform for Financial Ops
Efficient Ops
Efficient Ops
Jun 1, 2021 · Artificial Intelligence

How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges

At the 16th GOPS Global Operations Conference, Shen Hui of DingMao Technology explained how time‑series data analysis underpins AIOps, outlining its four‑step workflow, key challenges, and the company’s three‑pipeline solution that enables trend forecasting, fault prediction, and a robust AI‑driven operational platform.

AIOperationsTime Series Analysis
0 likes · 7 min read
How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges
Practical DevOps Architecture
Practical DevOps Architecture
May 31, 2021 · Cloud Native

Dockerfile Instruction Writing Recommendations

This article provides practical guidelines for writing Dockerfile instructions—including RUN, CMD, ENTRYPOINT, ADD, COPY, and WORKDIR—offering syntax recommendations, best‑practice examples, and advice on when to use each command to create efficient, maintainable container images.

DockerOperationscloud-native
0 likes · 6 min read
Dockerfile Instruction Writing Recommendations
DataFunTalk
DataFunTalk
May 29, 2021 · Databases

Evaluation and Deployment of DorisDB for Analytical Workloads at 58 Group

This article details 58 Group's comprehensive evaluation of DorisDB, TiFlash, and ClickHouse for large‑scale analytical workloads, covering functional and performance benchmarks, real‑world use cases such as security analysis and DBA operations, data ingestion methods, cluster architecture, automation practices, and lessons learned.

Analytical DatabaseDorisDBOperations
0 likes · 10 min read
Evaluation and Deployment of DorisDB for Analytical Workloads at 58 Group
DevOps Cloud Academy
DevOps Cloud Academy
May 28, 2021 · Operations

Common Mistakes in DevOps Implementation and How to Avoid Them

The article outlines ten frequent DevOps pitfalls—from out‑of‑order delivery and role misunderstandings to neglecting security and team fatigue—and provides practical guidance on planning, automation, quality, and cultural practices to achieve successful continuous delivery.

DevOpsOperationsci/cd
0 likes · 11 min read
Common Mistakes in DevOps Implementation and How to Avoid Them
Amap Tech
Amap Tech
May 28, 2021 · Cloud Native

Gaode's Serverless/FaaS Platform: Architecture, Implementation, and Business Impact

Gaode’s new serverless/FaaS platform, built on Alibaba Cloud Function Compute with custom C++, Go, and Node.js runtimes, now processes over 100 000 QPS, enabling a unified client‑cloud codebase, rapid feature iteration, automatic scaling and cost savings, while supporting extensive monitoring, Dapr integration, and future edge‑computing enhancements.

BackendCloud NativeFaaS
0 likes · 20 min read
Gaode's Serverless/FaaS Platform: Architecture, Implementation, and Business Impact
TAL Education Technology
TAL Education Technology
May 27, 2021 · Big Data

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.

AlertingGrafanaOperations
0 likes · 12 min read
Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading
Liangxu Linux
Liangxu Linux
May 27, 2021 · Operations

How I Built an Automated Redis Sentinel to Seamlessly Handle Failover

A sysadmin narrates how he monitors four Redis nodes, detects master failure with PING, promotes a slave using SLAVEOF, reconfigures the remaining replicas, and ultimately automates the entire process with a custom Sentinel program and a multi‑node Sentinel cluster for high availability.

Operationsautomationc++
0 likes · 11 min read
How I Built an Automated Redis Sentinel to Seamlessly Handle Failover
Python Crawling & Data Mining
Python Crawling & Data Mining
May 22, 2021 · Fundamentals

Master Python Lists: From Basics to Advanced Operations

This tutorial walks you through Python list syntax, common operations such as adding, modifying, searching, deleting, sorting, and nesting, complete with code examples and output screenshots, helping beginners and intermediate programmers deepen their understanding of list handling.

Data StructuresListOperations
0 likes · 8 min read
Master Python Lists: From Basics to Advanced Operations
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2021 · Operations

Designing Microservices Architecture for Failure: Patterns and Practices

Microservice architectures must handle inevitable network, hardware, and application errors by employing fault‑tolerant patterns such as graceful degradation, change management, health checks, fail‑over caches, retry logic, rate limiting, circuit breakers, and testing strategies to maintain service reliability and user experience.

MicroservicesOperationsReliability
0 likes · 15 min read
Designing Microservices Architecture for Failure: Patterns and Practices
IT Architects Alliance
IT Architects Alliance
May 19, 2021 · Backend Development

Backend Technology Stack Selection for Startup Companies

This article provides a comprehensive guide for startups on choosing and assembling a backend technology stack, covering language choices, core components such as project management, DNS, load balancing, CDN, RPC frameworks, service discovery, databases, NoSQL, messaging, logging, monitoring, configuration, deployment, and operational best‑practice recommendations.

BackendOperationsTechnology Stack
0 likes · 29 min read
Backend Technology Stack Selection for Startup Companies
Alibaba Cloud Developer
Alibaba Cloud Developer
May 18, 2021 · Operations

Mastering Incident Response: Structured Problem Solving and Key Roles

This guide outlines a structured approach to incident response, detailing problem definition, temporary fixes, root‑cause analysis, solution design, implementation, and standardization, while highlighting four critical roles—commander, communicator, rapid‑recovery lead, and diagnosis lead—to ensure swift, coordinated recovery of production services.

OperationsSRETeam Roles
0 likes · 10 min read
Mastering Incident Response: Structured Problem Solving and Key Roles
dbaplus Community
dbaplus Community
May 18, 2021 · Operations

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

This guide explains why monitoring is essential throughout a product lifecycle, outlines monitoring modes and methods, compares health checks, logs, tracing and metric solutions, and provides a detailed Prometheus‑based monitoring architecture with concrete metric definitions, alerting rules, and incident‑response procedures.

AlertingMetricsOperations
0 likes · 25 min read
Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation
FunTester
FunTester
May 18, 2021 · Operations

Performance Testing: Measuring QPS with JsonPath, Regex, and Exception Handling in Java

This article explores how to accurately measure QPS in Java performance‑testing scripts by using JsonPath and regular‑expression validation, analyzes error margins under different thread and iteration configurations, and demonstrates that exception handling has minimal impact on overall throughput.

JsonPathOperationsPerformance Testing
0 likes · 8 min read
Performance Testing: Measuring QPS with JsonPath, Regex, and Exception Handling in Java
ITPUB
ITPUB
May 17, 2021 · Operations

How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems

This article describes how a Chinese securities firm applied big‑data‑driven clustering and Bayesian methods to automate root‑cause analysis of trading‑system anomalies, detailing the challenges, algorithmic designs, practical implementations, and evaluation results that demonstrate significant reductions in false alarms and faster recovery.

Bayesian inferenceOperationsRoot Cause Analysis
0 likes · 17 min read
How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems
JD Cloud Developers
JD Cloud Developers
May 11, 2021 · Operations

How JD.com’s AIDCTwins Digital Twin Transforms Data Center Operations

JD.com’s AIDCTwins platform leverages low‑code modeling, IoT sensing and cross‑platform 3D visualization to create a digital twin of its massive data‑center infrastructure, dramatically cutting labor costs, enabling real‑time updates, and boosting intelligent, green operation across thousands of servers and racks.

Digital TwinInfrastructureIoT
0 likes · 6 min read
How JD.com’s AIDCTwins Digital Twin Transforms Data Center Operations
JD Cloud Developers
JD Cloud Developers
May 11, 2021 · Cloud Native

How JD Built a Cloud‑Native Monitoring & Logging System for Massive Scale

This article explains the fundamental differences between traditional and cloud‑native monitoring systems, outlines the challenges each faces, and details JD.com's evolution from physical servers to JDOS 2.0, describing its modular architecture, deployment model, and ongoing optimization efforts.

JD.comOperationsarchitecture
0 likes · 10 min read
How JD Built a Cloud‑Native Monitoring & Logging System for Massive Scale
DevOps Cloud Academy
DevOps Cloud Academy
May 7, 2021 · Operations

Understanding DevOps as an Interface, Not a Job Role

The article explains that DevOps should be viewed as an interdisciplinary interface rather than a specific job title, contrasts it with traditional roles like software developer or system administrator, and illustrates the concept with a Java‑style code example, while also including a brief promotional note about a DevOps training course.

Code ExampleDevOpsEngineering Roles
0 likes · 3 min read
Understanding DevOps as an Interface, Not a Job Role
Open Source Linux
Open Source Linux
May 6, 2021 · Operations

How to Build a Scalable Container Log Collection System with S6 and Filebeat

This article explains Docker and Kubernetes logging challenges, compares logging drivers, introduces S6‑based container logging, and presents a node‑level log‑agent architecture using Filebeat, Logrotate, Kafka, and Elasticsearch to achieve reliable, auto‑rotating log collection in production environments.

DockerFilebeatKubernetes
0 likes · 9 min read
How to Build a Scalable Container Log Collection System with S6 and Filebeat
High Availability Architecture
High Availability Architecture
May 3, 2021 · Operations

Meituan Elastic Scaling System: Evolution, Challenges, and Business Enablement

This article introduces Meituan's elastic scaling platform, detailing its evolution from version 1.0 to 2.0, the technical and operational challenges faced, the strategies adopted for promotion and resource management, and several real‑world business scenarios where elastic scaling reduces cost and improves reliability.

MeituanOperationsResource Management
0 likes · 24 min read
Meituan Elastic Scaling System: Evolution, Challenges, and Business Enablement
Java Interview Crash Guide
Java Interview Crash Guide
Apr 30, 2021 · Operations

How Do Large Internet Companies Achieve Cross‑Region Multi‑Active High Availability?

The article explains why large internet firms adopt cross‑region multi‑active architectures for high availability, compares cold backup, hot standby, same‑city active‑active, and cross‑region active‑active solutions, discusses their trade‑offs, and presents practical design patterns and questions for implementing such systems.

Distributed SystemsOperationsdisaster recovery
0 likes · 15 min read
How Do Large Internet Companies Achieve Cross‑Region Multi‑Active High Availability?
MaGe Linux Operations
MaGe Linux Operations
Apr 29, 2021 · Operations

Essential Kubernetes Production Best Practices for Reliable Ops

This article outlines essential Kubernetes best‑practice guidelines for production environments, covering health probes, resource allocation, RBAC, cluster configuration, networking policies, monitoring, logging, stateless design, autoscaling, runtime security, and strategies for zero‑downtime and failure recovery.

KubernetesOperationsmonitoring
0 likes · 12 min read
Essential Kubernetes Production Best Practices for Reliable Ops
Top Architect
Top Architect
Apr 27, 2021 · Operations

Understanding Flame Graphs for Performance Analysis in Java Applications

This article explains the concept, features, and practical usage of flame graphs—including how to generate them from Java thread dumps with Perl scripts—to help developers visualize call‑stack frequencies and quickly identify performance bottlenecks in backend services.

Operationsflamegraphjstack
0 likes · 11 min read
Understanding Flame Graphs for Performance Analysis in Java Applications
Architecture Digest
Architecture Digest
Apr 26, 2021 · Backend Development

How to Write Effective Error Logs for Better Debugging

This article explains why well‑structured error logs are essential for troubleshooting, analyzes common sources of errors, and provides concrete guidelines and code examples to make logs complete, specific, and actionable for developers and operations teams.

Error LoggingOperationsdebugging
0 likes · 18 min read
How to Write Effective Error Logs for Better Debugging
Youzan Coder
Youzan Coder
Apr 23, 2021 · Operations

How to Build a Generic API Robustness Scanning System for Automated Test Case Generation

This article presents a comprehensive, automated solution for API robustness testing that extracts baseline cases, generates exhaustive parameter‑level test data, executes them at scale, and analyzes results to identify abnormal responses without manual effort, thereby improving testing efficiency and software quality.

API testingOperationsSoftware Testing
0 likes · 13 min read
How to Build a Generic API Robustness Scanning System for Automated Test Case Generation
Baidu Intelligent Testing
Baidu Intelligent Testing
Apr 21, 2021 · Operations

Intelligent Delivery System for Baidu's Large‑Scale Information Flow Recommendation: Practices and Solutions

This article presents Baidu's end‑to‑end intelligent delivery system for its massive information‑flow recommendation platform, detailing challenges in continuous integration, testing, deployment, and operations, and describing the architectural, algorithmic, and process innovations that enable high‑speed, low‑cost, and largely unmanned releases.

Intelligent DeliveryOperationsci/cd
0 likes · 15 min read
Intelligent Delivery System for Baidu's Large‑Scale Information Flow Recommendation: Practices and Solutions
Efficient Ops
Efficient Ops
Apr 20, 2021 · Operations

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

This article details Dada Group’s implementation of an intelligent elastic scaling architecture that automatically adjusts capacity during peak promotions and low‑traffic periods, improving delivery reliability, reducing costs, and supporting multi‑cloud and multi‑runtime environments through sophisticated monitoring and auto‑scaler mechanisms.

Auto ScalingOperationscapacity management
0 likes · 17 min read
How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance
IT Architects Alliance
IT Architects Alliance
Apr 15, 2021 · Operations

Design and Implementation of a Simple Gray Release System

This article explains the concept of gray (canary) release, outlines a basic architecture with essential components, describes common gray release strategies such as header, cookie, and parameter based routing, and provides practical guidance for implementing gray releases using Nginx, gateway services, and handling complex scenarios like multi‑service and database migrations.

A/B testingMicroservicesNginx
0 likes · 7 min read
Design and Implementation of a Simple Gray Release System
High Availability Architecture
High Availability Architecture
Apr 15, 2021 · Cloud Native

Meituan Elastic Scaling System: Architecture, Challenges, and Business Enablement

This article presents Meituan's elastic scaling platform, detailing its evolution from Hulk 1.0 to Hulk 2.0, the technical and operational challenges faced, the solutions implemented for resource management and multi‑tenant scaling, and real‑world business scenarios such as holiday, peak‑hour, and emergency capacity provisioning.

MeituanOperationsResource Management
0 likes · 22 min read
Meituan Elastic Scaling System: Architecture, Challenges, and Business Enablement
Efficient Ops
Efficient Ops
Apr 11, 2021 · Operations

Essential Safety Checklist for Dangerous Linux Commands

This guide outlines critical precautions and best‑practice tips for executing risky Linux commands—such as rm, chmod, cat, dd, tar, and MySQL—by verifying environments, backing up data, using safe aliases, and avoiding common pitfalls that can cause catastrophic data loss.

BackupLinuxOperations
0 likes · 8 min read
Essential Safety Checklist for Dangerous Linux Commands
ITPUB
ITPUB
Apr 7, 2021 · Operations

8 Real-World Production Failures and How to Diagnose Them Quickly

The article shares eight authentic production incident cases—from frequent JVM Full GC and memory leaks to cache avalanches, DNS hijacking, and database deadlocks—detailing their root causes, diagnostic steps, code snippets, and practical remediation strategies for engineers facing similar challenges.

CacheJVMOperations
0 likes · 17 min read
8 Real-World Production Failures and How to Diagnose Them Quickly
Programmer DD
Programmer DD
Apr 2, 2021 · Operations

Why a Data Center Fire Can Sink Your Startup: Disaster Recovery Lessons

The article uses the OVH data‑center fire as a stark reminder that startups must design robust data disaster‑recovery strategies, explaining why backups, off‑site storage, and proper architectural planning are essential to prevent catastrophic data loss and potential business collapse.

OperationsSystem Architecturedata backup
0 likes · 8 min read
Why a Data Center Fire Can Sink Your Startup: Disaster Recovery Lessons
Sohu Tech Products
Sohu Tech Products
Mar 31, 2021 · Operations

Improving Cluster Stability: CI/CD, Monitoring, Logging, Documentation, and Traffic Management Solutions

The article analyzes the instability of a company's Kubernetes clusters, identifies root causes such as unstable release processes, lack of monitoring, logging, and documentation, and proposes comprehensive solutions including a Kubernetes‑centric CI/CD pipeline, federated Prometheus monitoring, Elasticsearch logging, centralized documentation, and integrated traffic management with Kong and Istio.

DevOpsKubernetesOperations
0 likes · 10 min read
Improving Cluster Stability: CI/CD, Monitoring, Logging, Documentation, and Traffic Management Solutions