Tagged articles

Operations

3329 articles · Page 13 of 34

Mar 15, 2023 · Industry Insights

How Baidu Feed Scales Millions of Users with Serverless: A Multi‑Dimensional Elasticity Blueprint

This article details Baidu Feed's serverless transformation, describing how multi‑dimensional service profiling (elasticity, traffic, capacity) and three elastic strategies—predictive, load‑feedback, and timed—enable automatic scaling that reduces resource waste while maintaining 24/7 stability for billions of users.

Baidu FeedElastic ScalingOperations

0 likes · 19 min read

How Baidu Feed Scales Millions of Users with Serverless: A Multi‑Dimensional Elasticity Blueprint

DeWu Technology

Mar 15, 2023 · Operations

Blue-Green Deployment: Process, Traffic Scheduling, and Component Support

The article explains blue‑green deployment as a release strategy that improves large‑scale microservice rollouts by extracting traffic from a blue cluster, incrementally shifting it to a green environment, using global and local traffic scheduling, central metadata, compatible components, and careful considerations such as idempotent consumption and version compatibility.

Blue-Green DeploymentOperationsTraffic Scheduling

0 likes · 12 min read

Blue-Green Deployment: Process, Traffic Scheduling, and Component Support

JD Cloud Developers

Mar 15, 2023 · Operations

Designing Seamless Offline Delivery for Private Cloud Environments

This article outlines a general, process‑focused approach to offline delivery in private or dedicated cloud environments, covering the need for internal mirrors, plug‑in architecture, dependency awareness, full automation, and best‑practice process design to reduce SRE effort and ensure consistent production.

AutomationKubernetesOperations

0 likes · 5 min read

Designing Seamless Offline Delivery for Private Cloud Environments

NetEase Smart Enterprise Tech+

Mar 15, 2023 · Operations

How Yidun Automates Performance Testing to Overcome Real‑World Pain Points

This article explains performance testing fundamentals, why it matters, the specific challenges Yidun faced such as complex execution, human‑dependent monitoring, data isolation, and cost loss, and describes their automated, gradient‑based testing platform with quantified monitoring and future visualisation plans.

AutomationData IsolationOperations

0 likes · 8 min read

How Yidun Automates Performance Testing to Overcome Real‑World Pain Points

IT Architects Alliance

Mar 14, 2023 · Operations

Key Practices for Achieving High Availability in Internet Services

The article outlines essential high‑availability techniques for internet‑scale systems, covering availability metrics, microservice modularization, database redundancy, load balancing, rate limiting, circuit breaking, isolation, retry strategies, rollback plans, stress testing, monitoring, and on‑call procedures.

High AvailabilityOperationsSystem Design

0 likes · 10 min read

Key Practices for Achieving High Availability in Internet Services

dbaplus Community

Mar 13, 2023 · Cloud Native

From Bare Metal to Cloud‑Native: How Zhuanzhuan Reinvented Log Collection

This article traces Zhuanzhuan's evolution of log collection—from a bare‑metal scribe + flume pipeline, through a container‑aware log‑pilot solution, to a cloud‑native filebeat and fb‑advisor architecture—detailing the motivations, technical designs, performance gains, and trade‑offs of each stage.

Operationscontainerfilebeat

0 likes · 12 min read

From Bare Metal to Cloud‑Native: How Zhuanzhuan Reinvented Log Collection

FunTester

Mar 13, 2023 · Operations

How Chaos Engineering Can Strengthen System Reliability: A Practical Guide

This article explains the origins and principles of chaos engineering, illustrates how fault‑injection scenarios expose system weaknesses, outlines step‑by‑step implementation—from tool selection and metric definition to execution and post‑mortem—and highlights its role in achieving high‑availability service level agreements.

Fault InjectionOperationsReliability

0 likes · 10 min read

How Chaos Engineering Can Strengthen System Reliability: A Practical Guide

MaGe Linux Operations

Mar 10, 2023 · Operations

249 Ready-to-Use Shell Scripts to Boost Your Linux Ops Skills

Discover a curated collection of 249 practical shell script examples, complete with clear documentation and usage guidelines, designed to help Linux operations engineers improve efficiency, master scripting conventions, and quickly solve common admin tasks, all available for free download via the provided QR code.

AutomationOperationsShell

0 likes · 7 min read

249 Ready-to-Use Shell Scripts to Boost Your Linux Ops Skills

Alimama Tech

Mar 8, 2023 · Industry Insights

How Alibaba’s Dynamic Compute Transforms Ad Engine Efficiency

This article details Alibaba Mama’s dynamic compute system—its architecture, offline and online tidal‑compute mechanisms, city‑level mutual backup, RT control, large‑scale promotion handling, metric integration, and recent infrastructure upgrades—showcasing concrete performance gains and future challenges in green, intelligent ad‑engine resource management.

AlibabaOperationsPerformance Optimization

0 likes · 16 min read

How Alibaba’s Dynamic Compute Transforms Ad Engine Efficiency

Python Programming Learning Circle

Mar 6, 2023 · Operations

Intelligent Operations: AI‑Driven Anomaly Detection, Alarm Compression, and Log Analysis Techniques

This article presents an AI‑enhanced operations framework that combines metric anomaly detection, alarm compression, log anomaly detection, and intelligent analysis using machine learning methods such as DBSCAN clustering, SARIMAX modeling, Apriori association rules, and LSTM‑based log parsing to improve fault detection and reduce operational costs.

AIOpsAnomaly DetectionMachine Learning

0 likes · 15 min read

Intelligent Operations: AI‑Driven Anomaly Detection, Alarm Compression, and Log Analysis Techniques

Efficient Ops

Mar 1, 2023 · Operations

How China Galaxy Securities Achieved Leading‑Edge DevOps Maturity with CMDB Platform

China Galaxy Securities’ CMDB platform recently earned an excellent rating in the China Academy of Information and Communications Technology’s DevOps system and tool standards, showcasing how standardized, tool‑enabled DevOps practices can boost efficiency, safety, and digital transformation for large financial enterprises.

CMDBOperationsdevops

0 likes · 11 min read

How China Galaxy Securities Achieved Leading‑Edge DevOps Maturity with CMDB Platform

NetEase Smart Enterprise Tech+

Mar 1, 2023 · Operations

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

This article explains the origins and meaning of software stability and stability testing, outlines key standards such as GB/T 16260 and industry definitions, and presents a comprehensive framework for stability quality assurance covering system elements, external disturbances, baseline setting, robust design, monitoring, and rapid incident response.

OperationsSREStability

0 likes · 17 min read

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

Efficient Ops

Feb 28, 2023 · Operations

How a Chinese Bank’s Wealth Management System Mastered DevOps Level‑3 Continuous Delivery

The Agricultural Bank of China's Wealth Management Customer Share Management System passed the CAICT DevOps Level‑3 Continuous Delivery assessment, showcasing a comprehensive DevOps transformation that improved code quality, automated testing, and deployment efficiency while delivering measurable performance gains across the organization.

AutomationOperationsbanking

0 likes · 10 min read

How a Chinese Bank’s Wealth Management System Mastered DevOps Level‑3 Continuous Delivery

DeWu Technology

Feb 27, 2023 · Operations

Message Push Monitoring and SLA Practices

The team implemented SLA‑based, node‑level monitoring for mobile push messages—splitting the workflow, measuring latency, blocking volume, and success rates, isolating metrics with Spring AOP, and tracking third‑party vendors—resulting in clear latency standards, doubled peak throughput, faster issue resolution, and improved overall reliability.

Message PushOperationsSLA

0 likes · 11 min read

Message Push Monitoring and SLA Practices

Continuous Delivery 2.0

Feb 27, 2023 · Operations

What Is Infrastructure as Code (IaC) and Its Benefits and Drawbacks

Infrastructure as Code (IaC) is a DevOps practice that defines, creates, and manages infrastructure through machine‑readable code, offering reproducibility, efficiency, collaboration, cost savings, and flexibility, while also presenting challenges such as a steep learning curve, dependency management, potential code errors, drift, and initial costs.

AutomationCloudIaC

0 likes · 5 min read

What Is Infrastructure as Code (IaC) and Its Benefits and Drawbacks

Architects Research Society

Feb 25, 2023 · Fundamentals

Common Business Capabilities: A Guide to Enterprise Capability Modeling

This article explains how a customizable generic list of business capabilities can serve as a starting point for enterprise capability modeling, accelerating value delivery while outlining the pros and cons of using pre‑built capability models across multiple levels of detail.

Operationsbusiness capabilitycapability modeling

0 likes · 8 min read

Common Business Capabilities: A Guide to Enterprise Capability Modeling

Architecture Digest

Feb 24, 2023 · Operations

Understanding Prometheus Alerting: When Alerts Fire and Why They May Not

This article explains the principles behind Prometheus alerts, when they trigger, why they sometimes stay silent, and how Alertmanager’s routing tree and notification pipeline work together to manage alert noise, grouping, silencing, and deduplication.

AlertingAlertmanagerGolang

0 likes · 18 min read

Understanding Prometheus Alerting: When Alerts Fire and Why They May Not

Su San Talks Tech

Feb 24, 2023 · Backend Development

Why We’re Dropping RabbitMQ for Kafka: A Complete Migration Blueprint

Facing chaotic usage, maintenance challenges, partition tolerance issues, and performance bottlenecks with RabbitMQ, our middleware team decided to fully migrate to Kafka, outlining reasons, comparative models, migration strategies, and verification steps to ensure a smooth, high‑availability, high‑performance transition.

KafkaMessage QueueOperations

0 likes · 13 min read

Why We’re Dropping RabbitMQ for Kafka: A Complete Migration Blueprint

ITPUB

Feb 23, 2023 · Operations

Why Did Microservices Drop After Zookeeper Restart? Session Mechanics & Fixes

A mistaken Zookeeper restart caused a 30‑minute outage of all microservices; this article analyzes the ZK session mechanism, why temporary nodes were not recreated, and presents two concrete solutions and best‑practice recommendations to prevent similar failures.

OperationsRPCZookeeper

0 likes · 11 min read

Why Did Microservices Drop After Zookeeper Restart? Session Mechanics & Fixes

dbaplus Community

Feb 21, 2023 · Operations

How Standardized Application Monitoring Boosts Operational Efficiency

This article reviews G Bank's multi‑year journey to standardize application monitoring, detailing the methodology, models, metrics, automation mechanisms, and quantitative evaluation that together improve visibility, early fault detection, and overall operations management for both traditional and distributed systems.

AIOpsOperationsStandardization

0 likes · 18 min read

How Standardized Application Monitoring Boosts Operational Efficiency

Zhuanzhuan Tech

Feb 21, 2023 · Databases

Fast and Stable MySQL Data Center Migration: Choosing and Implementing the Optimal Strategy

This article details the background, migration plan selection, and step‑by‑step procedures—including pre‑building cascades, service pause, automated batch operations, cluster tiering, pre‑ and post‑checks, and gray‑scale validation—to achieve a fast, stable MySQL data‑center migration for a large‑scale production environment.

AutomationCloudMySQL

0 likes · 11 min read

Fast and Stable MySQL Data Center Migration: Choosing and Implementing the Optimal Strategy

Efficient Ops

Feb 20, 2023 · Operations

How a Major Bank Accelerated Digital Transformation with a Unified DevOps Platform

In 2022, China’s leading telecom standards bodies recognized Bank of Communications for achieving advanced DevOps maturity, highlighting its unified engineering platform that streamlines end‑to‑end continuous delivery across over 200 core systems and boosts development‑operations efficiency.

OperationsPlatformbanking

0 likes · 5 min read

How a Major Bank Accelerated Digital Transformation with a Unified DevOps Platform

21CTO

Feb 16, 2023 · Operations

Which Log Management Tool Is Right for You? A Comprehensive Comparison of 9 Solutions

This article provides a detailed comparison of nine popular log management tools—including Filebeat, Graylog, LogDNA, ELK, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their main features, pricing, advantages, and disadvantages to guide readers in selecting the most suitable solution for their needs.

ELKOperationslog management

0 likes · 16 min read

Which Log Management Tool Is Right for You? A Comprehensive Comparison of 9 Solutions

Code Ape Tech Column

Feb 16, 2023 · Databases

Understanding and Solving BigKey and HotKey Issues in Redis Clusters

BigKey and HotKey are common Redis cluster problems that can degrade performance, cause timeouts, network congestion, and even system-wide failures; this article explains their definitions, impacts, detection methods, and practical mitigation strategies—including key splitting, local caching, and migration optimizations—based on real-world production cases.

HotKeyOperationsPerformance

0 likes · 22 min read

Understanding and Solving BigKey and HotKey Issues in Redis Clusters

Efficient Ops

Feb 15, 2023 · Operations

How China Agricultural Bank’s ARROW Platform Mastered DevOps Continuous Delivery

The article details China Agricultural Bank’s ARROW platform achieving third‑level DevOps continuous delivery certification, outlining its end‑to‑end pipeline, quality gates, metric‑driven improvements, and how these practices boost code quality, delivery speed, and support the bank’s digital transformation.

ArrowOperationscontinuous delivery

0 likes · 8 min read

How China Agricultural Bank’s ARROW Platform Mastered DevOps Continuous Delivery

Cloud Native Technology Community

Feb 15, 2023 · Industry Insights

DevOps vs FinOps: Key Differences and How They Combine for Cost‑Effective Delivery

This article compares DevOps and FinOps, outlines their ten major differences, and explains how integrating the two practices can create a more efficient and financially optimized software development lifecycle.

Cost OptimizationFinOpsOperations

0 likes · 6 min read

DevOps vs FinOps: Key Differences and How They Combine for Cost‑Effective Delivery

Zhuanzhuan Tech

Feb 15, 2023 · Operations

Automating TiDB Operations at ZuanZuan: From Manual Management to Platform‑Based Automation

This article details ZuanZuan's journey of automating TiDB operations, covering the initial operational pain points, the implementation of metadata and resource management, comprehensive upgrades, alarm redesign, and the development of a work‑order‑driven platform that streamlines node, scaling, decommission, and monitoring tasks while significantly reducing manual effort and costs.

AutomationDatabase ManagementOperations

0 likes · 18 min read

Automating TiDB Operations at ZuanZuan: From Manual Management to Platform‑Based Automation

Hulu Beijing

Feb 14, 2023 · Operations

How Hulu Scaled Its Live Streaming for the Super Bowl: Inside the War Room

This article details how Hulu's Beijing engineering teams prepared, scaled, and operated the live streaming infrastructure for the 2024 Super Bowl, handling a 20% traffic surge with advanced load‑testing, auto‑scaling, and coordinated on‑call support to ensure a flawless broadcast.

CloudHuluOperations

0 likes · 3 min read

How Hulu Scaled Its Live Streaming for the Super Bowl: Inside the War Room

Code Ape Tech Column

Feb 14, 2023 · Backend Development

High‑Availability Architecture for a Billion‑Scale Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, and MySQL Migration

This article describes how a membership platform serving over ten billion users achieves high performance and fault tolerance through a dual‑center Elasticsearch cluster, traffic‑isolated three‑cluster ES design, Redis multi‑center caching, and a seamless migration from SQL Server to a partitioned MySQL architecture, while detailing operational safeguards and fine‑grained flow‑control strategies.

ElasticsearchMySQLOperations

0 likes · 23 min read

High‑Availability Architecture for a Billion‑Scale Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, and MySQL Migration

DataFunSummit

Feb 8, 2023 · Product Management

Content‑Driven Data Product Management: Challenges, Governance Frameworks, and Implementation Strategies

This article shares practical insights from a data product expert on the problems faced by content‑oriented data products, outlines a comprehensive governance methodology—including DAMA, Huawei, and Alibaba frameworks—and demonstrates how to operationalize these ideas through concrete examples such as event‑tracking and metric governance.

Big DataData GovernanceData Product Management

0 likes · 16 min read

Content‑Driven Data Product Management: Challenges, Governance Frameworks, and Implementation Strategies

JD Cloud Developers

Feb 8, 2023 · Operations

Boosting Log Anomaly Detection with NLP and Deep Learning

This article presents a log anomaly detection approach that leverages NLP techniques such as Part‑of‑Speech tagging and Named Entity Recognition combined with deep neural networks, detailing a six‑step model, experimental validation on three datasets, and superior performance compared with existing DeepLog and LogClass methods.

DNNNERNLP

0 likes · 13 min read

Boosting Log Anomaly Detection with NLP and Deep Learning

Efficient Ops

Feb 7, 2023 · Operations

Why SRE Is Essential for Reliable Internet Services – Chinese Experts Share Insights

Site Reliability Engineering (SRE), introduced by Google in 2003, has become a cornerstone for ensuring the reliability and stability of large‑scale internet platforms, and Chinese experts now share home‑grown practices and a new book that distills two decades of SRE experience for building high‑availability applications.

BookOperationsReliability

0 likes · 3 min read

Why SRE Is Essential for Reliable Internet Services – Chinese Experts Share Insights

DataFunSummit

Feb 7, 2023 · Operations

Understanding RPA: Concepts, Core Modules, Element Analyzer, and Development Stages

This article provides a comprehensive overview of Robotic Process Automation (RPA), covering its definition, integration with AI (IPA), common AI techniques, value propositions, evolution from RPA 1.0 to 4.0, core platform and control‑center modules, element analyzer fundamentals, automation technology classifications, and a brief Q&A session.

AIAutomationOperations

0 likes · 16 min read

Understanding RPA: Concepts, Core Modules, Element Analyzer, and Development Stages

dbaplus Community

Feb 6, 2023 · Operations

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

This article outlines Vivo's multi‑year journey of designing, evolving, and operating a cloud‑native, AIOps‑enabled monitoring platform that supports tens of thousands of hosts, databases, containers, and services, detailing its architecture, challenges, and future directions for observability and reliability.

AIOpsObservabilityOperations

0 likes · 18 min read

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

Efficient Ops

Feb 6, 2023 · Operations

Agricultural Bank of China's DevOps Journey: Building an Integrated Development System

Facing rapid digital transformation demands, Agricultural Bank of China launched a comprehensive DevOps initiative in 2019, establishing an integrated development lifecycle that combines CMMI, TMMi, ITIL, and automated pipelines across five key streams—process, tools, data, standards, and culture—to boost delivery speed, quality, and operational efficiency.

Banking TechnologyIntegrated DevelopmentOperations

0 likes · 14 min read

Agricultural Bank of China's DevOps Journey: Building an Integrated Development System

Efficient Ops

Feb 5, 2023 · Operations

How China’s Telecom Giants Accelerate IT Efficiency with DevOps Maturity Models

Amid digital transformation, six leading Chinese telecom operators adopted the CAICT‑led DevOps Capability Maturity Model, completing 31 assessments that showcase improved IT efficiency, integrated team resources, and accelerated business support across continuous delivery, technical operation, security, and system tooling.

Capability Maturity ModelIT efficiencyOperations

0 likes · 14 min read

How China’s Telecom Giants Accelerate IT Efficiency with DevOps Maturity Models

Efficient Ops

Feb 2, 2023 · R&D Management

How China’s Leading Banks Boost IT Efficiency with DevOps Maturity Models

This article reviews how six major state‑owned Chinese banks and their subsidiaries applied the China Information and Communication Research Institute's DevOps Capability Maturity Model, detailing assessment numbers, project case studies, implementation challenges, and measurable improvements in continuous delivery, cloud architecture, security, and overall IT performance.

BankingITCloudComputingContinuousDelivery

0 likes · 20 min read

How China’s Leading Banks Boost IT Efficiency with DevOps Maturity Models

ITPUB

Feb 2, 2023 · Operations

Why 80% of Digital Transformations Fail and How to Ensure Success

This article explains why digital transformation is now a must for enterprises, outlines its core purpose of boosting efficiency and revenue, describes the three progressive stages—digitization, data-driven, and intelligent automation—and highlights the strategic, organizational, cultural, and technological factors that determine success.

AIData-DrivenOperations

0 likes · 11 min read

Why 80% of Digital Transformations Fail and How to Ensure Success

Laravel Tech Community

Feb 1, 2023 · Operations

RabbitMQ 3.11.8 Release Highlights: Core Server Enhancements, CLI Updates, Plugin Fixes, and Dependency Upgrades

RabbitMQ 3.11.8, a maintenance release of the Erlang‑based AMQP broker, introduces streaming throughput improvements for tiny messages, new CLI commands, several plugin bug fixes, and upgrades to internal dependencies, providing enhanced performance and stability for messaging workloads.

CLIOperationsPlugins

0 likes · 3 min read

RabbitMQ 3.11.8 Release Highlights: Core Server Enhancements, CLI Updates, Plugin Fixes, and Dependency Upgrades

Architecture Breakthrough

Jan 30, 2023 · Operations

How to Turn Invisible Dev Work into Self‑Service with the RQN Error‑Message Pattern

The article examines why developers spend excessive time on low‑value, invisible tasks such as answering integration tests and production issues, and proposes the RQN error‑message format plus supporting query interfaces to automate responses, reduce manual effort, and improve operational efficiency.

API designError handlingOperations

0 likes · 3 min read

How to Turn Invisible Dev Work into Self‑Service with the RQN Error‑Message Pattern

Efficient Ops

Jan 29, 2023 · Operations

How Linux Kernel Handles TCP Connections: Deep Dive into sock_common and Lookup

This article explores Linux kernel TCP connection handling by examining socket data structures, port range and file descriptor tuning, core functions like tcp_v4_rcv, and lookup mechanisms, while offering practical tips to boost client-side concurrent connections beyond traditional limits.

Linux kernelOperationsPerformance Tuning

0 likes · 9 min read

How Linux Kernel Handles TCP Connections: Deep Dive into sock_common and Lookup

Practical DevOps Architecture

Jan 29, 2023 · Cloud Native

Docker, Jenkins, and Kubernetes: From Basics to Practice – Course Outline and Materials

This article presents a comprehensive course syllabus covering Docker, Jenkins, and Kubernetes fundamentals, advanced operations topics, IT career development, and provides downloadable teaching materials such as Dockerfiles, Helm charts, and deployment scripts for hands‑on practice.

DockerKubernetesOperations

0 likes · 6 min read

Docker, Jenkins, and Kubernetes: From Basics to Practice – Course Outline and Materials

Alibaba Cloud Native

Jan 19, 2023 · Cloud Native

How Java Evolved for Cloud‑Native Operations: Key Features from JDK 9‑19

Since JDK 9, Java has accelerated its release cadence and added a suite of cloud‑native capabilities—such as container‑aware metrics, single‑file execution, refined JVM options, fast‑fail memory controls, class‑data sharing, compact strings, active‑processor detection, and Unix‑domain sockets—to better serve modern containerized workloads.

JDKJavaOperations

0 likes · 17 min read

How Java Evolved for Cloud‑Native Operations: Key Features from JDK 9‑19

Efficient Ops

Jan 18, 2023 · Operations

How Zhongyuan Bank Accelerated Digital Transformation with DevOps: A Case Study

This article details Zhongyuan Bank's award-winning DevOps implementation and digital transformation journey, highlighting its rapid delivery improvements, security enhancements, pandemic response initiatives, and numerous industry recognitions that showcase the bank's operational excellence.

Case StudyOperationsbanking

0 likes · 8 min read

How Zhongyuan Bank Accelerated Digital Transformation with DevOps: A Case Study

MaGe Linux Operations

Jan 18, 2023 · Operations

How Many Files and TCP Connections Can a Linux Server Actually Handle?

This article explains the Linux kernel parameters that limit open files and TCP connections, shows how to increase those limits with sysctl and limits.conf, and estimates the maximum number of concurrent connections a server or client can support based on memory and port constraints.

LinuxOperationsTCP connections

0 likes · 14 min read

How Many Files and TCP Connections Can a Linux Server Actually Handle?

DevOps

Jan 18, 2023 · Operations

Qualitative Analysis as a Metric for Software Quality Measurement

The article explains how qualitative analysis serves as a measurable metric throughout the software lifecycle, outlines five key qualitative methods—interviews, root‑cause analysis, maturity assessment, reviews, and post‑mortems—and demonstrates their practical application for continuous quality improvement.

Maturity AssessmentOperationsRoot Cause Analysis

0 likes · 8 min read

Qualitative Analysis as a Metric for Software Quality Measurement

MaGe Linux Operations

Jan 15, 2023 · Operations

How to Slim Down Your Application Logs by Up to 80%

This article explains why oversized logs hurt system performance, then presents a step‑by‑step methodology—including printing only necessary logs, merging duplicate entries, and simplifying payloads—illustrated with real Java code and a concrete case study that reduces daily log volume from 5 GB to under 1 GB.

JavaLoggingOperations

0 likes · 8 min read

How to Slim Down Your Application Logs by Up to 80%

ITPUB

Jan 12, 2023 · Operations

How to Build a Truly High‑Availability System: 6 Essential Design Layers

This article breaks down the essential design and operational considerations for achieving high availability across six layers—development standards, application services, storage, product strategy, operations deployment, and incident response—providing concrete practices, metrics, and safeguards to reach four‑nine (99.99%) uptime.

Disaster RecoveryOperationsSystem Design

0 likes · 25 min read

How to Build a Truly High‑Availability System: 6 Essential Design Layers

Efficient Ops

Jan 11, 2023 · Operations

How Guangdong Mobile’s CRM Achieved Leading DevOps Operational Maturity

Guangdong Mobile’s CRM system, supporting over 130 million users, passed the China Information & Communication Research Institute’s DevOps technical‑operation 2+ level assessment, showcasing a landmark achievement in standardized, tool‑enabled DevOps practices that boost quality, safety, and market competitiveness.

CRMCase StudyOperations

0 likes · 11 min read

How Guangdong Mobile’s CRM Achieved Leading DevOps Operational Maturity

Efficient Ops

Jan 11, 2023 · Operations

How a Securities Firm Achieved DevSecOps Maturity to Boost Transformation

The article details how China’s CITIC Securities leveraged the national DevOps and DevSecOps maturity models, passed Level 2 security assessments, and integrated cultural, procedural, and technical practices to enhance its institutional business service platform, improve security, and accelerate its digital transformation.

Case StudyDevSecOpsOperations

0 likes · 11 min read

How a Securities Firm Achieved DevSecOps Maturity to Boost Transformation

Efficient Ops

Jan 11, 2023 · Operations

How to Effectively Monitor and Operate a DevOps System: From Metrics to NOC/MSP

This article explains how to maintain a DevOps environment by implementing comprehensive monitoring, handling fault detection and performance metrics, automating alerts in a continuously changing cloud landscape, and integrating NOC and MSP practices for 24/7 reliability and efficient incident response.

AutomationCloudMSP

0 likes · 17 min read

How to Effectively Monitor and Operate a DevOps System: From Metrics to NOC/MSP

Efficient Ops

Jan 10, 2023 · Operations

How a New Distributed Core Trading System Earned Top DevOps Ratings at China Securities

In a recent interview, the head of the System Operations Department at China Merchants Securities explains how their next‑generation core trading system, built on a distributed micro‑service architecture with open‑source components and cloud‑native tools, achieved Level 2 technical‑operation DevOps certification, detailing the challenges, improvements, and future plans for digital transformation.

Case StudyOperationscloud-native

0 likes · 15 min read

How a New Distributed Core Trading System Earned Top DevOps Ratings at China Securities

Alibaba Cloud Developer

Jan 10, 2023 · Operations

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

This article explains what stability assurance is, outlines a systematic workflow—including anomaly identification, monitoring configuration, impact assessment, and solution planning—and provides practical methods such as capacity estimation, traffic limiting, load testing, scaling, and pre‑heating to ensure services remain stable during both daily operations and high‑traffic events.

OperationsStabilitycapacity planning

0 likes · 25 min read

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

NetEase Yanxuan Technology Product Team

Jan 9, 2023 · Operations

Loggie: A High-Performance Log Collection Agent System Design and Implementation

Loggie is a cloud-native, Go-based log-collection agent that replaces Filebeat and Flume by using a micro-kernel producer-consumer architecture with hot-swappable pipelines, achieving 2 GB/s read speeds, 1.6‑2.6× higher throughput while using only a quarter of the CPU, and providing built-in observability, reliability, and latency monitoring for large-scale enterprise deployments.

Operationsgolog agent

0 likes · 16 min read

Loggie: A High-Performance Log Collection Agent System Design and Implementation

Ctrip Technology

Jan 6, 2023 · Operations

iDesk Service Platform: Architecture, Development Stages, Core Features, and Operational Insights

The iDesk service platform is a comprehensive internal tool that evolved through three development phases, adopts a BS+Service architecture with modular local services, offers extensive software management and self‑service utilities, integrates tightly with TripPal and service accounts, and implements robust operational monitoring to achieve high availability and user satisfaction.

OperationsSoftware Managementbackend-architecture

0 likes · 15 min read

iDesk Service Platform: Architecture, Development Stages, Core Features, and Operational Insights

Alibaba Cloud Native

Jan 5, 2023 · Operations

Build Real‑Time MySQL Monitoring & Alerting with Prometheus on Alibaba Cloud

This guide explains why MySQL monitoring is critical, defines five key metric dimensions, shows how to collect them with Prometheus and the MySQL Exporter, provides ready‑to‑use alert rules, and walks through the full setup and dashboard creation on Alibaba Cloud.

AlertingAlibaba CloudMySQL

0 likes · 7 min read

Build Real‑Time MySQL Monitoring & Alerting with Prometheus on Alibaba Cloud

Zhuanzhuan Tech

Jan 4, 2023 · Operations

Evolution of Zhaozhuan Test Environment Governance: From Physical Isolation to Tag‑Based Traffic Routing

This article describes how Zhaozhuan’s testing environment evolved through three versions—physical isolation, automatic‑IP‑tag routing, and manual‑tag routing—detailing the architectural background, implementation principles, advantages, drawbacks, and supporting tools that dramatically reduced deployment time and resource consumption while introducing new operational challenges.

OperationsService Governancecloud-native

0 likes · 23 min read

Evolution of Zhaozhuan Test Environment Governance: From Physical Isolation to Tag‑Based Traffic Routing

MaGe Linux Operations

Jan 3, 2023 · Operations

Quickly Package Go Binaries into RPMs with go-bin-rpm and Makefile

This guide walks you through installing the required build tools, configuring go-bin-rpm with a concise rpm.json file, generating RPM packages, and automating the process with a Makefile for seamless Go binary deployment on RPM‑based systems.

Operationsgo-bin-rpmmakefile

0 likes · 5 min read

Quickly Package Go Binaries into RPMs with go-bin-rpm and Makefile

Efficient Ops

Jan 2, 2023 · Operations

How China’s Bank of Communications Achieved Leading DevOps Maturity

In this interview, Liu Lei, General Manager of the Bank of Communications Software Development Center, explains how three flagship projects passed the DevOps Continuous Delivery Level‑3 assessment, detailing the standards, metrics, tooling improvements and the broader impact on the bank’s digital transformation.

Bank of CommunicationsMaturity AssessmentOperations

0 likes · 14 min read

How China’s Bank of Communications Achieved Leading DevOps Maturity

Java High-Performance Architecture

Jan 2, 2023 · Backend Development

How to Build a High‑Availability Payment System with Smart Routing

This article explains how a fintech payment platform achieves high availability and optimal channel selection by using decision‑tree routing, sliding‑window negative‑feedback, pressure‑detection services, and component fallback strategies such as RabbitMQ with Redis, supporting millions of daily transactions.

High AvailabilityOperationsRouting Algorithm

0 likes · 13 min read

How to Build a High‑Availability Payment System with Smart Routing

Architecture and Beyond

Jan 1, 2023 · Operations

How to Build Reliable Enterprise SaaS: Stability, Risk Modeling, and Governance

This article defines enterprise‑grade SaaS, contrasts it with consumer products, and presents a comprehensive framework for product, data, and system stability—including isolation requirements, SLA metrics, risk modeling, mitigation plans, and continuous review—to help SaaS teams deliver dependable services.

OperationsProduct ManagementReliability

0 likes · 23 min read

How to Build Reliable Enterprise SaaS: Stability, Risk Modeling, and Governance

Top Architect

Dec 31, 2022 · Operations

Optimizing System Performance and Workflow: From Technical Metrics to DevOps Process Improvement

The article illustrates how to improve the efficiency of an image‑recognition service by measuring performance, redesigning architecture with parallel processing and message queues, and then extends the analogy to enterprise workflow optimization, emphasizing the need to quantify, visualize, and continuously refine DevOps processes.

Operationsdevopssystem architecture

0 likes · 11 min read

Optimizing System Performance and Workflow: From Technical Metrics to DevOps Process Improvement

Efficient Ops

Dec 31, 2022 · Operations

How Nanjing Bank Achieved Leading DevOps Maturity with Its Digital Credit Card

This article details Nanjing Bank's successful passage of the CAICT DevOps Continuous Delivery Level 3 assessment for its digital credit card project, highlighting the bank's DevOps practices, challenges, benefits, and future plans within a broader digital transformation context.

Case StudyOperationsbanking

0 likes · 10 min read

How Nanjing Bank Achieved Leading DevOps Maturity with Its Digital Credit Card

Architecture Digest

Dec 31, 2022 · Operations

Log Size Reduction Techniques: Methodology and Case Study

This article explains why excessive INFO‑level logs can cause performance problems, presents three practical strategies—printing only necessary logs, merging log entries, and simplifying log content with code examples—and demonstrates their impact through a real‑world Java bean pipeline case that cuts daily log volume from about 5 GB to under 1 GB.

JavaOperationslog optimization

0 likes · 7 min read

Log Size Reduction Techniques: Methodology and Case Study

Open Source Linux

Dec 30, 2022 · Operations

Top 7 Kubernetes Management Tools to Simplify Cluster Operations

This article introduces seven popular Kubernetes management solutions—including K9s, Rancher, the native Dashboard with Kubectl and Kubeadm, Helm, KubeSpray, Kontena Lens, and WKSctl—detailing their key features, usage scenarios, and how they help streamline cluster monitoring, deployment, scaling, and security across cloud‑native environments.

KubernetesOperationsTools

0 likes · 9 min read

Top 7 Kubernetes Management Tools to Simplify Cluster Operations

Efficient Ops

Dec 29, 2022 · Operations

How China Agricultural Bank Earned Level‑3 DevOps Application Design Certification

China Agricultural Bank’s distributed core customer information project passed the Level‑3 DevOps Application Design assessment, showcasing a cloud‑native micro‑service architecture, comprehensive DevOps practices, and measurable improvements in scalability, observability, and security that set a new industry benchmark.

Case StudyOperationsapplication design

0 likes · 13 min read

How China Agricultural Bank Earned Level‑3 DevOps Application Design Certification

MaGe Linux Operations

Dec 28, 2022 · Cloud Native

Master Essential kubectl Commands: A Practical Guide for Kubernetes Ops

This comprehensive guide covers kubectl autocomplete, context configuration, object creation, resource viewing, updating, patching, editing, scaling, deletion, pod and node interaction, as well as the versatile kubectl set commands, formatted output options, and visual references for effective Kubernetes cluster management.

KubernetesOperationscloud-native

0 likes · 15 min read

Master Essential kubectl Commands: A Practical Guide for Kubernetes Ops

Tencent Cloud Developer

Dec 28, 2022 · Operations

Technical Architecture, Observability, and Operational Practices of Tencent Health Code System

The article details how Tencent’s health‑code platform leveraged a cloud‑native, serverless architecture, extensive observability (Prometheus, Grafana, RUM), rigorous capacity testing, chaos engineering, and ITIL‑based change management to sustain billions of page views, support massive concurrency, and ensure reliable, scalable epidemic‑control services.

Health CodeObservabilityOperations

0 likes · 16 min read

Technical Architecture, Observability, and Operational Practices of Tencent Health Code System

Efficient Ops

Dec 28, 2022 · Operations

How Guojin Securities Reached Level‑3 DevOps Continuous Delivery – A Success Story

Guojin Securities' Commission Treasure App achieved Level‑3 continuous delivery in the CAICT DevOps assessment, showcasing how standardized DevOps practices, tool integration, and a unified platform boosted development efficiency, security, and digital transformation across the financial services sector.

Case StudyOperationscontinuous delivery

0 likes · 15 min read

How Guojin Securities Reached Level‑3 DevOps Continuous Delivery – A Success Story

Efficient Ops

Dec 28, 2022 · Operations

Mastering Ansible: 16 Visual Guides to Automate Your Operations

Ansible, a rapidly popular open‑source automation tool built on Python, enables batch system configuration, program deployment, and command execution through thousands of built‑in modules, offering a simple yet powerful solution for operations engineers, illustrated here with 16 comprehensive images.

AnsibleAutomationOperations

0 likes · 3 min read

Mastering Ansible: 16 Visual Guides to Automate Your Operations

MaGe Linux Operations

Dec 27, 2022 · Operations

Master Essential Linux Commands for Efficient System Operations

This article shares practical Linux command techniques—including xargs, background execution, process monitoring, multitail, continuous ping logging, TCP state inspection, and SSH port forwarding—to help system administrators streamline tasks, improve script efficiency, and troubleshoot performance issues.

LinuxOperationsShell Scripting

0 likes · 10 min read

Master Essential Linux Commands for Efficient System Operations

Efficient Ops

Dec 27, 2022 · Operations

How China’s Bank Achieved Industry‑Leading DevOps Maturity: A Deep Dive

An in‑depth interview with Liu Lei, General Manager of Bank of Communications' Software Development Center, reveals how three flagship projects passed the Level‑3 Continuous Delivery assessment, illustrating the bank's DevOps transformation, metric improvements, and future roadmap within China's digital banking landscape.

Operationsbankingcontinuous delivery

0 likes · 17 min read

How China’s Bank Achieved Industry‑Leading DevOps Maturity: A Deep Dive

Efficient Ops

Dec 26, 2022 · Operations

How China’s Bank of Communications Achieved Industry‑Leading DevOps Maturity

An in‑depth interview with Liu Lei, GM of Bank of Communications' Software Development Center, reveals how the bank’s three flagship projects passed the DevOps Continuous Delivery Level‑3 assessment, boosting automation, efficiency, and digital transformation across its financial services.

Maturity ModelOperationsbanking

0 likes · 15 min read

How China’s Bank of Communications Achieved Industry‑Leading DevOps Maturity

Efficient Ops

Dec 26, 2022 · Operations

What Do China’s Latest DevOps Maturity Assessments Reveal About Enterprise Success?

The China Academy of Information and Communications Technology released the latest results of its DevOps Capability Maturity Model assessments, showing how standardization, tool empowerment and continuous delivery pipelines boost quality, efficiency, security and competitiveness across banks, telecom, finance and internet enterprises.

CAICTEnterprise StandardsMaturity Model

0 likes · 6 min read

What Do China’s Latest DevOps Maturity Assessments Reveal About Enterprise Success?

Efficient Ops

Dec 26, 2022 · Operations

What Is AIOps? Exploring China’s New AI‑Driven Operations Maturity Model

The article introduces the AIOps (Artificial Intelligence for IT Operations) capability maturity model developed by China’s Information and Communication Research Institute, explains its two parts—general capabilities and system/tool technical requirements—lists the evaluated modules, and announces the upcoming certification ceremony and contact details for participation.

AIOpsArtificial IntelligenceIT Operations

0 likes · 5 min read

What Is AIOps? Exploring China’s New AI‑Driven Operations Maturity Model

Efficient Ops

Dec 26, 2022 · Operations

China Agricultural Bank’s DevOps & AIOps Success: Key Lessons for Enterprises

China Agricultural Bank’s recent DevOps and AIOps assessments, covering 17 projects across continuous delivery, security, application design, and intelligent operations, showcase how standardized processes, tool empowerment, and rigorous evaluation boosted efficiency, safety, and digital transformation, offering actionable insights for large enterprises seeking similar maturity.

AIOpsEnterprise StandardsOperations

0 likes · 16 min read

China Agricultural Bank’s DevOps & AIOps Success: Key Lessons for Enterprises

Programmer DD

Dec 26, 2022 · Operations

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

On December 18, 2022, Alibaba Cloud's Hong Kong Region Zone C suffered a massive service interruption—the longest in its operational history—prompting a detailed incident response, extensive service impact across compute, storage, and networking, and a thorough analysis that led to concrete infrastructure and communication improvements.

Alibaba CloudIncident ReportOperations

0 likes · 13 min read

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

Architecture Digest

Dec 23, 2022 · Backend Development

Case Study: Microservice Migration Challenges and Lessons Learned

This case study examines a data‑service company's transition to a microservice architecture, detailing the initial benefits such as improved visibility and reduced deployment cost, the subsequent explosion of complexity, queue‑head blocking, shared‑library versioning issues, and the trade‑offs that led the team to partially revert to a monolithic design.

DeploymentOperationsarchitecture

0 likes · 11 min read

Case Study: Microservice Migration Challenges and Lessons Learned

Architecture Digest

Dec 21, 2022 · Operations

Designing High‑Availability Systems: Principles and Practices Across Six Layers

This article systematically explores high‑availability system design from development standards, capacity planning, application services, storage, product strategies, operations deployment, to incident response, presenting key concepts, architectural patterns, and practical guidelines for building resilient services.

DeploymentHigh AvailabilityOperations

0 likes · 27 min read

Designing High‑Availability Systems: Principles and Practices Across Six Layers

Baidu Geek Talk

Dec 20, 2022 · Industry Insights

How AI‑Powered Fault Localization Transforms Automated Testing at Scale

This article explores Baidu's intelligent testing practices, covering spectrum‑based root‑cause localization, error‑code driven build‑system diagnostics, revenue‑change stop‑loss decision workflows, and search UI case‑level tracing, illustrating how data, algorithms, and engineering combine to reduce manual effort and accelerate issue resolution.

Fault LocalizationOperationsautomated testing

0 likes · 10 min read

How AI‑Powered Fault Localization Transforms Automated Testing at Scale

Zhuanzhuan Tech

Dec 20, 2022 · Operations

Alertmanager Alert System Refactoring: Issues, Solutions, and Implementation Details

This article analyzes common problems in a Prometheus‑Alertmanager monitoring setup—such as alert noise, lack of escalation, suppression and silence management—and presents a comprehensive refactor that introduces per‑cluster Alertmanager instances, custom escalation logic, suppression tables, and Python scripts to handle alert routing, silencing, and recovery.

Alert SuppressionAlertmanagerOperations

0 likes · 18 min read

Alertmanager Alert System Refactoring: Issues, Solutions, and Implementation Details

Cloud Native Technology Community

Dec 20, 2022 · Operations

Platform Engineering: The Evolution from DevOps to Internal Developer Platforms

The article explains how platform engineering, emerging from DevOps fatigue, unifies development and operations by providing internal developer platforms that reduce cognitive load, improve self‑service, and enable teams to focus on core product work, especially as organizations grow beyond twenty developers.

Internal Developer PlatformOperationsdevops

0 likes · 11 min read

Platform Engineering: The Evolution from DevOps to Internal Developer Platforms

Efficient Ops

Dec 19, 2022 · Operations

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

This article details Tencent CDN's challenges and solutions for business continuity, covering bandwidth and device resource constraints, massive request handling, fault‑management lifecycle, automation bottlenecks, and the implementation of AIOps, intelligent alerts, capacity planning, and root‑cause analysis to ensure reliable service.

AIOpsAutomationCDN

0 likes · 21 min read

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

Ops Development Stories

Dec 19, 2022 · Operations

Master Shell Scripting: Practical Handbook and 100 Ready-to-Use Scripts

This article introduces the evolution of operations automation, highlights the power of concise shell scripts, and presents a 70‑page handbook covering fundamentals to advanced topics together with 100 ready‑to‑run script examples for Linux system administration and DevOps tasks.

AutomationLinuxOperations

0 likes · 5 min read

Master Shell Scripting: Practical Handbook and 100 Ready-to-Use Scripts

Efficient Ops

Dec 18, 2022 · Operations

Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

This article explains how to design effective Prometheus metrics, choose appropriate vectors, labels, buckets, and naming conventions, and offers Grafana usage tricks to help engineers monitor online services, batch jobs, and offline processing systems with clear, actionable insights.

GrafanaObservabilityOperations

0 likes · 9 min read

Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

Alibaba Cloud Native

Dec 15, 2022 · Operations

How to Prevent ZooKeeper Disk Exhaustion: Snapshots, Logs, and Tuning Tips

This article explains why ZooKeeper can run out of disk space due to excessive snapshots and transaction logs, describes the underlying file‑generation mechanism, and provides concrete configuration parameters and best‑practice recommendations to control file growth and keep the cluster stable.

OperationsSnapshotsTransaction Log

0 likes · 9 min read

How to Prevent ZooKeeper Disk Exhaustion: Snapshots, Logs, and Tuning Tips

Efficient Ops

Dec 12, 2022 · Operations

How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management

This article chronicles Bilibili's five‑year evolution of Site Reliability Engineering, detailing the introduction of SRE culture, the construction of high‑availability and multi‑active architectures, capacity management with Kubernetes, VPA/HPA, incident case studies, and the ongoing transformation of SRE practices across the organization.

High AvailabilityKubernetesOperations

0 likes · 24 min read

How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management

Alibaba Cloud Big Data AI Platform

Dec 9, 2022 · Operations

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

This article details Alibaba Cloud's Flink Cluster Inspector, explaining the business challenges of hotspot machines, the analysis of resource over‑use, and the four‑stage solution—pre‑profiling, in‑process self‑healing, post‑recovery, and observability—that reduces latency, cuts costs, and improves operational efficiency.

FlinkHotSpotOperations

0 likes · 19 min read

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

37 Interactive Technology Team

Dec 8, 2022 · Operations

Log Alarm Optimization and Grafana Chart Integration Guide

This guide details how to configure Alibaba Cloud Log Service alarms—setting one‑day tokens, handling 1024‑byte truncation, removing record limits with analysis statements, adding a 10‑second query offset for timeliness—and shows how to visualize the data in Grafana using SQL queries for multi‑line and pie charts with timestamp conversion and time‑series filling.

Cloud LoggingGrafanaLog Monitoring

0 likes · 6 min read

Log Alarm Optimization and Grafana Chart Integration Guide

vivo Internet Technology

Dec 7, 2022 · Databases

vivo's Database Operations Platform: Challenges and Solutions in the Cloud-Native Era

Vivo’s Database‑as‑a‑Service platform tackles cloud‑native challenges by automating massive instance management with self‑service work orders and self‑healing, enabling elastic scaling through mixed‑deployment and multi‑threaded Redis tools, optimizing costs via automatic package shrinkage, and safeguarding personal data with full‑chain encryption, while outlining a roadmap toward AI‑driven fault handling, container‑based resources, and advanced privacy governance.

DaaSEncryptionMySQL

0 likes · 14 min read

vivo's Database Operations Platform: Challenges and Solutions in the Cloud-Native Era

Full-Stack DevOps & Kubernetes

Dec 7, 2022 · Cloud Native

How to Scale Kubernetes to 5,000 Nodes: Master, API Server, and Component Tuning

This guide explains how to push a Kubernetes cluster toward its theoretical limit of 5,000 nodes by detailing official limits, master node sizing for GCE and AWS, kube‑apiserver high‑availability and connection‑count tuning, scheduler and controller‑manager leader election settings, kubelet optimizations, and DNS anti‑affinity configuration.

KubernetesOperationsPerformance Tuning

0 likes · 6 min read

How to Scale Kubernetes to 5,000 Nodes: Master, API Server, and Component Tuning

Alibaba Cloud Native

Dec 6, 2022 · Operations

How to Monitor Windows Servers with Prometheus: Metrics, Dashboards, and Alerts

This guide explains how to collect essential Windows metrics with Prometheus, set up Grafana dashboards for CPU, memory, disk, network, and process monitoring, and configure alert rules, while also comparing self‑hosted and Alibaba Cloud Prometheus solutions for seamless Windows observability.

AlertingGrafanaOperations

0 likes · 12 min read

How to Monitor Windows Servers with Prometheus: Metrics, Dashboards, and Alerts

Java High-Performance Architecture

Dec 6, 2022 · Cloud Native

How to Build Resilient Microservices: Patterns for Fault Tolerance and High Availability

Learn essential techniques for designing fault‑tolerant microservices, including graceful degradation, change management, health checks, self‑healing, failover caching, retry strategies, rate limiting, circuit breakers, and testing failures, to ensure high availability and reliability in distributed cloud‑native systems.

OperationsReliabilitycloud-native

0 likes · 15 min read

How to Build Resilient Microservices: Patterns for Fault Tolerance and High Availability

Bilibili Tech

Dec 2, 2022 · Big Data

Data Quality Management: Expectations, Measurement, Assurance, and Operation

The article outlines a complete data‑quality‑management framework that first captures business expectations, then translates them into basic and personalized measurement rules, defines four assurance approaches for handling violations, and scales operation with indicators, tooling, and metrics to continuously improve data quality across the lifecycle.

Data GovernanceData QualityOperations

0 likes · 19 min read

Data Quality Management: Expectations, Measurement, Assurance, and Operation

Qunar Tech Salon

Dec 2, 2022 · Product Management

Building a User Experience Digital Platform: Metrics, Data Collection, and Operational Practices at Qunar

The article details Qunar's user‑experience digital platform, explaining its background, measurement model, metric scoring, data‑collection mechanisms for Android and iOS, and the operational plan that drives continuous improvement across teams and products.

Digital PlatformOperationsuser experience

0 likes · 12 min read

Building a User Experience Digital Platform: Metrics, Data Collection, and Operational Practices at Qunar

Efficient Ops

Dec 1, 2022 · Operations

Why Choose Loki Over ELK? A Hands‑On Guide to Deploying and Using Grafana Loki

This article explains the motivations for selecting Grafana Loki instead of ELK/EFK, introduces its core concepts and features, provides step‑by‑step deployment instructions for Promtail and Loki, and demonstrates how to configure Grafana, query logs, and handle label indexing, dynamic tags, and high‑cardinality challenges.

GrafanaKubernetesObservability

0 likes · 15 min read

Why Choose Loki Over ELK? A Hands‑On Guide to Deploying and Using Grafana Loki

DevOps

Dec 1, 2022 · Cloud Native

Why Dapr Is a 10× Better Cloud‑Native Runtime: Benefits for Developers, Operators, and Architects

The article explains the 10×‑better theory, introduces Dapr as a cloud‑native sidecar framework, and details how it improves productivity for developers, enhances security, resilience and observability for operators, and offers multi‑language, multi‑environment flexibility for architects, while also acknowledging its drawbacks.

10xDaprOperations

0 likes · 22 min read

Why Dapr Is a 10× Better Cloud‑Native Runtime: Benefits for Developers, Operators, and Architects

Liangxu Linux

Nov 30, 2022 · Operations

Essential Ops Checklist: Prevent Data Loss, Secure Servers, and Optimize Performance

This guide shares practical operations best practices, covering safe online procedures, data protection, security hardening, daily monitoring, performance tuning, and the right mindset to avoid costly mistakes and keep production environments stable and secure.

Operationsbackupmonitoring

0 likes · 11 min read

Essential Ops Checklist: Prevent Data Loss, Secure Servers, and Optimize Performance

Alibaba Cloud Native

Nov 30, 2022 · Operations

How to Observe RocketMQ Message Lifecycle with OpenTelemetry Metrics

This article explains how RocketMQ's message lifecycle can be fully observed using OpenTelemetry‑based metrics, covering producer, broker, and consumer stages, and shows practical monitoring, alerting, and troubleshooting practices for cloud‑native deployments.

ObservabilityOpenTelemetryOperations

0 likes · 12 min read

How to Observe RocketMQ Message Lifecycle with OpenTelemetry Metrics

Data Thinking Notes

Nov 28, 2022 · Big Data

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

This comprehensive guide explains how metadata connects source data, warehouses, and applications, outlines its technical and business classifications, demonstrates its value for data management, profiling, portals, and ETL development, and details optimization, storage, lifecycle, and quality practices essential for robust big‑data operations.

Big DataData QualityData Warehouse

0 likes · 35 min read

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality