Tagged articles

3281 articles

Page 9 of 33

Mar 18, 2024 · Cloud Native

Is Your Kubernetes Setup Secure? A Complete Best‑Practice Checklist

This article provides a thorough checklist covering application deployment, service governance, and cluster configuration in Kubernetes, including health probes, graceful shutdown, fault tolerance, resource limits, labeling, logging, scaling, RBAC, network policies, and compliance with CIS benchmarks.

Cloud NativeKubernetesOperations

0 likes · 27 min read

Is Your Kubernetes Setup Secure? A Complete Best‑Practice Checklist

Architecture Development Notes

Mar 18, 2024 · Operations

Designing an Operations Platform: Architecture, Core Components, and Extensions

This article explains how an operations platform can automate and streamline IT management by detailing its core value, essential components such as CMDB, monitoring, automation tools, ticketing, and analytics, and outlining implementation steps, technology choices, and advanced extensions like AI and DevOps integration.

CMDBDevOpsOperations

0 likes · 7 min read

Designing an Operations Platform: Architecture, Core Components, and Extensions

php Courses

Mar 18, 2024 · Operations

Understanding Load Balancing and Its Implementation with Docker and Nginx

This article explains the concept and importance of load balancing, then demonstrates a practical Docker‑Compose setup with multiple PHP containers and an Nginx reverse proxy, including configuration files and test results that show how traffic is distributed to improve system reliability and performance.

DockerNginxOperations

0 likes · 5 min read

Understanding Load Balancing and Its Implementation with Docker and Nginx

Architect

Mar 16, 2024 · Operations

How Unified Alert Convergence Can Slash Monitoring Noise and Boost MTTA/MTTR

This article analyzes the shortcomings of fragmented monitoring systems, defines key metrics such as MTTA and MTTR, proposes a unified alert convergence architecture using Redis delayed queues, and details design, implementation, and future AI‑enhanced improvements to reduce alert fatigue and accelerate incident response.

MTTAMTTROperations

0 likes · 22 min read

How Unified Alert Convergence Can Slash Monitoring Noise and Boost MTTA/MTTR

Alibaba Cloud Native

Mar 15, 2024 · Operations

How Cloud Disk Types Affect Kafka Instance Performance: A Hands‑On Test

This guide demonstrates how cloud disk type influences the performance of Alibaba Cloud's Kafka instances, detailing a CADT‑driven deployment, step‑by‑step load‑testing procedure, required prerequisites, and architecture overview to help users select optimal specifications.

Alibaba CloudCADTCloud Disk

0 likes · 5 min read

How Cloud Disk Types Affect Kafka Instance Performance: A Hands‑On Test

Practical DevOps Architecture

Mar 15, 2024 · Operations

Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development

This multi‑chapter guide provides in‑depth, hands‑on instruction for configuring and optimizing all Prometheus components, exploring Kubernetes monitoring, source‑code analysis, custom exporter development, high‑availability setups, service discovery, resource‑efficient scraping, and integrating Thanos for long‑term storage.

KubernetesOperationsPrometheus

0 likes · 4 min read

Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development

Efficient Ops

Mar 13, 2024 · Operations

What Does an Operations Engineer Do? Skills, Tools, and Career Path

This article explains the role of an operations (运维) engineer, covering daily responsibilities, essential knowledge such as Linux and networking, common monitoring tools, and emerging career paths like DevOps, AIOps, and SRE, helping newcomers understand how to start and grow in the field.

DevOpsLinuxOperations

0 likes · 6 min read

What Does an Operations Engineer Do? Skills, Tools, and Career Path

Model Perspective

Mar 13, 2024 · Operations

Evaluating City Efficiency with DEA’s CCR and BCC Models

This article introduces Data Envelopment Analysis (DEA) as a non‑parametric method for assessing relative efficiency of decision‑making units, explains the CCR and BCC models, and demonstrates their application in evaluating and comparing the efficiency of various U.S. cities using real‑world data.

BCCCCRDEA

0 likes · 9 min read

Evaluating City Efficiency with DEA’s CCR and BCC Models

Linux Cloud Computing Practice

Mar 13, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, typical use cases, key advantages, and real‑world examples, helping professionals streamline automation, monitoring, configuration, and deployment tasks and improve overall system reliability.

InfrastructureOperationsmonitoring

0 likes · 6 min read

Top 10 Essential Tools Every Operations Engineer Should Master

macrozheng

Mar 12, 2024 · Operations

Why HertzBeat Could Be Your Next Agentless Monitoring Solution

This article introduces HertzBeat, an open‑source real‑time monitoring and alerting system that offers powerful template‑based monitoring without agents, explains its Docker‑quick start, demonstrates how to monitor Redis and SpringBoot services, and walks through email alarm configuration.

Operationsagentlessredis

0 likes · 7 min read

Why HertzBeat Could Be Your Next Agentless Monitoring Solution

Efficient Ops

Mar 11, 2024 · Operations

Essential Linux Ops: Proven Troubleshooting Steps for Common Failures

This guide outlines a systematic Linux operations troubleshooting framework—emphasizing error messages, log analysis, root‑cause isolation, and step‑by‑step solutions for six real‑world scenarios ranging from filesystem corruption to inode exhaustion and read‑only file‑system errors.

LinuxOperationsShell Commands

0 likes · 7 min read

Essential Linux Ops: Proven Troubleshooting Steps for Common Failures

21CTO

Mar 11, 2024 · Operations

How Netlify’s AI Debugger Turns Failed Deploys into Quick Fixes

Netlify’s new AI‑assisted deployment feature automatically analyzes build failures, offers diagnostic suggestions, and helps developers resolve issues faster, though its recommendations are best‑effort and may require manual verification.

AI debuggingDeploymentNetlify

0 likes · 5 min read

How Netlify’s AI Debugger Turns Failed Deploys into Quick Fixes

DevOps Operations Practice

Mar 10, 2024 · Operations

Key Competencies for an Excellent Operations Director

The article outlines the essential technical knowledge, team management, project management, cross‑department coordination, strategic planning, and leadership abilities required for a senior operations director to succeed and advance toward executive roles.

LeadershipOperationsProject Management

0 likes · 5 min read

Key Competencies for an Excellent Operations Director

Open Source Linux

Mar 7, 2024 · Operations

How to Fix Disk‑Full Issues in Legacy Kubernetes Clusters Using Docker

This guide explains why old Kubernetes clusters that use Docker can run out of disk space, describes the symptoms such as pods stuck in ContainerCreating, and provides step‑by‑step commands to clean Docker files, prune images, adjust kubelet settings, and prevent future disk‑full problems.

Disk CleanupGarbage CollectionOperations

0 likes · 11 min read

How to Fix Disk‑Full Issues in Legacy Kubernetes Clusters Using Docker

dbaplus Community

Mar 5, 2024 · Operations

How to Recover a Failing Elasticsearch Cluster: Master Loss, Shard Corruption, and More

This guide explains Elasticsearch cluster architecture, node roles, and metadata storage, then details step‑by‑step recovery procedures for master‑node loss, complete master outage, data‑node failures, shard allocation problems, corrupted shards, translog issues, and missing segment files, including relevant API commands and tool usage.

Cluster RecoveryData NodeElasticsearch

0 likes · 17 min read

How to Recover a Failing Elasticsearch Cluster: Master Loss, Shard Corruption, and More

JD Retail Technology

Mar 5, 2024 · Operations

Rethinking DevOps: The Rise of Platform Engineering and Its Impact on Software Delivery

This article examines the growing tension between traditional DevOps practices and the emerging concept of platform engineering, exploring why developers resist operational duties, the core principles of platform engineering, success factors, metrics, and future trends shaping software delivery in modern organizations.

Operationsinternal platformsplatform engineering

0 likes · 14 min read

Rethinking DevOps: The Rise of Platform Engineering and Its Impact on Software Delivery

Open Source Tech Hub

Mar 5, 2024 · Operations

How to Expose Intranet Web Services with Custom Domains Using frp

This guide explains what frp is, why it’s a strong reverse‑proxy choice, and provides step‑by‑step instructions—including configuration files, port opening, and domain setup—to expose internal web services through custom domains securely.

Custom DomainNetwork ConfigurationOperations

0 likes · 7 min read

How to Expose Intranet Web Services with Custom Domains Using frp

Architecture Digest

Mar 3, 2024 · Operations

Graceful Shutdown in Microservices: Concepts, Kubernetes Example, and Optimizations

This article explains the concept of graceful shutdown, outlines general steps, presents a detailed Kubernetes‑SpringBoot‑Nacos case study, discusses common pitfalls, and provides practical optimization techniques for reliable service termination in cloud‑native environments.

Graceful ShutdownNacosOperations

0 likes · 10 min read

Graceful Shutdown in Microservices: Concepts, Kubernetes Example, and Optimizations

Open Source Linux

Mar 1, 2024 · Operations

How Two‑Site Three‑Center Disaster Recovery Boosts Business Continuity with Oracle Data Guard

The two‑site three‑center disaster recovery model combines a production site, a same‑city backup, and a remote backup to ensure data integrity and rapid recovery, leveraging Oracle Data Guard for synchronized and asynchronous replication, thereby improving RPO and RTO across various disaster scenarios.

OperationsOracle Data Guardbusiness continuity

0 likes · 4 min read

How Two‑Site Three‑Center Disaster Recovery Boosts Business Continuity with Oracle Data Guard

Efficient Ops

Feb 27, 2024 · Operations

Master Docker Logging and Graylog Integration: A Step‑by‑Step Guide

This guide explains how Docker captures container output, stores it as JSON logs, configures various log drivers, and integrates with Graylog for centralized log management, including deployment, input setup, and sending logs from containers via Docker run or docker‑compose.

ContainerDockerDocker Compose

0 likes · 8 min read

Master Docker Logging and Graylog Integration: A Step‑by‑Step Guide

Volcano Engine Developer Services

Feb 22, 2024 · Cloud Native

How BMQ’s Cloud‑Native Compute‑Storage Separation Revolutionizes Message Queues

This article explains how ByteDance’s BMQ, a cloud‑native message engine with a compute‑storage separated architecture, overcomes Kafka’s scalability and operational limits by using Proxy, Broker, Coordinator, and Controller modules, a distributed storage model, and advanced caching to achieve rapid scaling, high throughput, and resilient operations.

Cloud NativeMessage QueueOperations

0 likes · 15 min read

How BMQ’s Cloud‑Native Compute‑Storage Separation Revolutionizes Message Queues

Efficient Ops

Feb 21, 2024 · Operations

Why Organizational DevOps Assessments Are Critical for 2024‑2027 Tech Maturity

The article explains how Gartner predicts DevOps will reach production maturity by 2024‑2027, describes China CAICT's organization‑level DevOps assessment framework, its standards, classification rules, statistical results across industries, and the tangible benefits reported by participating enterprises.

Capability MaturityDevOpsOperations

0 likes · 8 min read

Why Organizational DevOps Assessments Are Critical for 2024‑2027 Tech Maturity

Efficient Ops

Feb 19, 2024 · Operations

Mastering Prometheus: Practical Tips for Effective Application Monitoring

This article explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, golden metrics, label conventions, naming rules, histogram bucket choices, and Grafana visualization tricks to help engineers build reliable observability pipelines.

GrafanaMetricsOperations

0 likes · 10 min read

Mastering Prometheus: Practical Tips for Effective Application Monitoring

Alibaba Cloud Developer

Feb 18, 2024 · Operations

Why Software Supply Chain Consistency Is the Hidden Cost of Scaling

Software development involves both value‑creating features and unavoidable maintenance costs; this article explains how the hidden software supply chain—frameworks, libraries, runtime, cloud services, and configurations—creates consistency challenges, and proposes strategies such as explicit declarations, IaC, serverless, and mono‑repo to reduce scaling costs.

OperationsScalabilityServerless

0 likes · 21 min read

Why Software Supply Chain Consistency Is the Hidden Cost of Scaling

ITPUB

Feb 17, 2024 · Operations

Why Ops Professionals Must Look Up: The 4+1+1+1 Framework Explained

The article reflects on the relentless challenges of IT operations, outlines the never‑ending skill gaps, standards, trends and blame, and introduces a 4+1+1+1 model that separates developers, testers, security staff from four core ops responsibilities to guide systematic ops system construction.

4+1+1+1 modelIT opsInfrastructure Management

0 likes · 6 min read

Why Ops Professionals Must Look Up: The 4+1+1+1 Framework Explained

Architects' Tech Alliance

Feb 17, 2024 · Operations

How to Design Highly Reliable Servers: Principles, Methods, and Testing

This article explains why server reliability matters, clarifies core reliability concepts, outlines key analysis techniques, and presents practical testing and verification methods to help engineers build more dependable server systems.

OperationsPerformance TestingSystem Design

0 likes · 3 min read

How to Design Highly Reliable Servers: Principles, Methods, and Testing

Open Source Linux

Feb 17, 2024 · Operations

Master Linux System Logs: Command-Line Tools, Files, and GUI Utilities

Learn how to view and analyze Linux system logs using command-line utilities like journalctl and dmesg, explore key log files in /var/log, and leverage graphical tools such as GNOME Logs, KSystemLog, and Logwatch for effective troubleshooting and performance monitoring.

LinuxLog ManagementOperations

0 likes · 5 min read

Master Linux System Logs: Command-Line Tools, Files, and GUI Utilities

Code Ape Tech Column

Feb 16, 2024 · Operations

Building a Linux Host Monitoring System with Prometheus, Grafana, and Node Exporter

This guide walks through installing and configuring Prometheus, Grafana, and Node Exporter on a Linux server using Docker, shows how to set up monitoring dashboards, customize PromQL queries, and verify the monitoring system, providing complete steps and code snippets for a functional host monitoring solution.

GrafanaLinux monitoringOperations

0 likes · 9 min read

Building a Linux Host Monitoring System with Prometheus, Grafana, and Node Exporter

Ops Development & AI Practice

Feb 16, 2024 · Operations

How to Make systemd Services Recognize the Correct PATH Variable

This guide explains why systemd services often miss user environment variables like PATH and provides three practical solutions—including setting Environment= in the unit file, using a wrapper script, and configuring a global PATH—to ensure services locate commands reliably.

Environment VariablesLinuxOperations

0 likes · 6 min read

How to Make systemd Services Recognize the Correct PATH Variable

MaGe Linux Operations

Feb 15, 2024 · Operations

How to Fix Linux Memory and Disk Space Problems with Swap and File Management

This guide explains why Linux servers run out of memory or disk space, how to create and enable swap files, locate and remove large or numerous small files, use soft links to expand storage, and release space held by deleted files still opened by processes.

LinuxMemory ManagementOperations

0 likes · 7 min read

How to Fix Linux Memory and Disk Space Problems with Swap and File Management

Architect's Guide

Feb 15, 2024 · Operations

Common ELK Deployment Architectures and Practical Solutions for Log Management

This article introduces the core components of the ELK stack, compares three typical deployment architectures—including Logstash‑only, Filebeat‑assisted, and Kafka‑backed designs—and provides concrete configuration examples and troubleshooting tips for multiline merging, timestamp handling, and module‑level log filtering.

ELKElasticsearchFilebeat

0 likes · 11 min read

Common ELK Deployment Architectures and Practical Solutions for Log Management

21CTO

Feb 7, 2024 · Operations

Master Your Developer Workflow: Proven Time‑Management Techniques

This article explains why effective time management is essential for developers, explores psychological, physiological, and technical dimensions, and presents practical techniques such as weekly planning, the Pomodoro method, goal‑based planning, and the Eisenhower matrix to boost productivity and work‑life balance.

Developer WorkflowOperationspomodoro

0 likes · 13 min read

Master Your Developer Workflow: Proven Time‑Management Techniques

JD Cloud Developers

Feb 6, 2024 · Operations

How We Boosted Nginx Performance 50× by Tuning Gzip Settings

This article documents a real‑world Nginx optimization case where adjusting gzip compression levels and switching to static gzip reduced CPU usage dramatically, enabling a 9‑wan QPS load to be handled with only 7% CPU and achieving over a 50‑fold performance gain.

BackendGzipNginx

0 likes · 8 min read

How We Boosted Nginx Performance 50× by Tuning Gzip Settings

Rare Earth Juejin Tech Community

Feb 5, 2024 · R&D Management

Comprehensive Guide to the Workflow Management System: Framework, Features, Process Design, and Operations

This document provides a detailed English guide to a workflow management system, covering its underlying frameworks, feature list, process design operations, UI components, form design, deployment steps, and user interactions such as initiating, reviewing, and handling tasks.

OperationsProcess DesignR&D management

0 likes · 17 min read

Comprehensive Guide to the Workflow Management System: Framework, Features, Process Design, and Operations

dbaplus Community

Feb 4, 2024 · Operations

How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations

This article details Ant Group's practical implementation of Service Level Objectives (SLO) and AIOps to achieve fine‑grained operations, covering SLO fundamentals, health‑score architecture, GitOps‑based data pipelines, error‑budget alerting, AI‑driven anomaly detection, fault localization techniques, and real‑world case studies on dashboards, Kubernetes SLOs, and emergency response workflows.

Error BudgetFault LocalizationKubernetes

0 likes · 38 min read

How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations

Liangxu Linux

Feb 3, 2024 · Operations

Why Does My Kubernetes Pod Get OOMKilled Before Reaching Its Memory Limit?

A pod in a Kubernetes cluster repeatedly restarted with exit code 137 despite staying well below its 6 Gi memory limit, prompting an investigation that uncovered the role of QoS classes, oom_score calculations, and node‑level memory pressure in the eviction process.

KubernetesOOMKillOperations

0 likes · 9 min read

Why Does My Kubernetes Pod Get OOMKilled Before Reaching Its Memory Limit?

MaGe Linux Operations

Jan 31, 2024 · Operations

Master Zabbix Monitoring: SQL Queries, Binlog Status, Replication & Cleanup Scripts

This guide shows how to use Zabbix to monitor database SQL results, binlog health, master‑slave replication, and how to clean up old Zabbix history with practical Bash scripts for MySQL/GreatDB and ClickHouse.

Database MonitoringOperationsZabbix

0 likes · 9 min read

Master Zabbix Monitoring: SQL Queries, Binlog Status, Replication & Cleanup Scripts

Data Thinking Notes

Jan 30, 2024 · Operations

How Banks Can Build an Effective Data Governance Framework

This article outlines a two‑step approach for banks to design a data governance system—clarifying organizational responsibilities and constructing a layered institutional framework—while detailing cross‑department collaboration, head‑office and branch coordination, and practical policy, procedure, and work‑detail levels to sustain continuous improvement and support digital transformation.

BankingData GovernanceData Management

0 likes · 10 min read

How Banks Can Build an Effective Data Governance Framework

dbaplus Community

Jan 29, 2024 · Artificial Intelligence

How Meituan Uses AIOps to Revolutionize Incident Management

This article details Meituan's two‑year exploration of AIOps for incident management, covering the challenges of massive, real‑time operational data, the AI‑driven modules for risk prevention, fault detection, diagnosis, and similar‑incident recommendation, and future directions such as intelligent log detection and change recognition.

OperationsRoot Cause Analysisaiops

0 likes · 22 min read

How Meituan Uses AIOps to Revolutionize Incident Management

21CTO

Jan 28, 2024 · Operations

Why IPv4 Is Getting Expensive and How to Overcome IPv6 Migration Challenges

The article explains IPv4 address exhaustion, the emerging fees for public IPv4, and the technical, operational, and tooling hurdles that organizations face when transitioning to IPv6, while outlining three strategic options and real‑world migration experiences.

IPv4IPv6Network Migration

0 likes · 13 min read

Why IPv4 Is Getting Expensive and How to Overcome IPv6 Migration Challenges

Architect

Jan 28, 2024 · Operations

How We Built Real‑Time SLA Monitoring for Message Push and Doubled Throughput

This article details the end‑to‑end design, node‑level splitting, metric definition, and Spring‑based implementation of SLA monitoring for a high‑volume message‑push system, showing how precise latency and vendor‑stability metrics uncovered bottlenecks, enabled rapid issue detection, and ultimately doubled overall throughput.

Message PushMicroservicesOperations

0 likes · 14 min read

How We Built Real‑Time SLA Monitoring for Message Push and Doubled Throughput

Architect

Jan 27, 2024 · Industry Insights

How We Built a Scalable Smart Customer Service System for an Activity Platform

This article details the end‑to‑end design, implementation, and operational results of a smart customer‑service platform that automates FAQ capture, leverages both Elasticsearch and LLM‑based models, and provides a low‑code, multi‑team backend for rapid issue resolution.

ElasticsearchMicroservicesOperations

0 likes · 13 min read

How We Built a Scalable Smart Customer Service System for an Activity Platform

Python Programming Learning Circle

Jan 27, 2024 · Operations

Automating Log Monitoring, Email Reporting, and DingTalk Alerts with Python

This article presents a Python‑based solution that queries LogEasy data, calculates key metrics such as total requests, 5xx errors, average response time, and unique visitors, formats the results into Excel and HTML reports, and automatically sends them via email and DingTalk alerts for operational monitoring.

DingTalkLog MonitoringOperations

0 likes · 30 min read

Automating Log Monitoring, Email Reporting, and DingTalk Alerts with Python

IT Services Circle

Jan 25, 2024 · Operations

How to Resolve Online Message Queue Backlog Issues

This article explains why message queues can become backlogged, identifies producer and consumer causes, and provides practical strategies—including adding consumers, increasing queue capacity, optimizing consumption logic, implementing failure handling, and rapid remediation steps—to quickly resolve backlog in production environments.

BacklogMessage QueueOperations

0 likes · 7 min read

How to Resolve Online Message Queue Backlog Issues

Efficient Ops

Jan 24, 2024 · Backend Development

Mastering Nginx: Reverse Proxy, Load Balancing, and High Availability Explained

This comprehensive guide introduces Nginx’s high‑performance architecture, explains forward and reverse proxy concepts, demonstrates load‑balancing and static‑dynamic content separation, provides practical configuration commands, and walks through real‑world setups for reverse proxy, load‑balancing, static‑dynamic separation, and high‑availability using Keepalived.

NginxOperationsServer Configuration

0 likes · 16 min read

Mastering Nginx: Reverse Proxy, Load Balancing, and High Availability Explained

DevOps

Jan 23, 2024 · Operations

Collection of Bash Scripts for Server Monitoring, Automation, and Deployment

This article provides a curated set of Bash scripts covering MySQL replication monitoring, directory change detection, bulk user creation, website health checks, remote command execution, LNMP stack deployment, server resource reporting, high‑resource process identification, and automated deployment of Java and PHP projects, offering practical automation tools for system administrators.

BashDeploymentOperations

0 likes · 12 min read

Collection of Bash Scripts for Server Monitoring, Automation, and Deployment

Efficient Ops

Jan 23, 2024 · Operations

How Shenwan Hongyuan Securities Automated Operations: Key Takeaways from GOPS 2023

The 21st GOPS Global Operations Conference in Shanghai featured Shenwan Hongyuan Securities' Yusi Song presenting an in‑depth look at automated operations, covering achievements, experience summaries, and future plans, with slide images and a downloadable PPT for attendees.

DevOpsOperationsSRE

0 likes · 2 min read

How Shenwan Hongyuan Securities Automated Operations: Key Takeaways from GOPS 2023

dbaplus Community

Jan 22, 2024 · Operations

How NetEase Cloud Music Built a Resilient RPC Framework for Microservices

This article details the practical steps and architectural choices NetEase Cloud Music took to improve RPC stability in a micro‑service environment, covering service discovery, connection management, cloud‑native challenges, SLO design, log governance, degradation, rate limiting, outlier detection, thread‑pool isolation, fast‑failure handling, registry optimizations, multi‑registry support, and post‑incident knowledge‑base building.

Cloud NativeOperationsRPC

0 likes · 14 min read

How NetEase Cloud Music Built a Resilient RPC Framework for Microservices

MaGe Linux Operations

Jan 22, 2024 · Operations

How to Simulate CPU, Memory, and Disk Load on Linux with Stress and dd

This guide explains how to use the stress and dd utilities together with a custom shell script to artificially consume CPU, memory, and disk resources on idle cloud servers, helping avoid budget cuts by keeping resource utilization high for a configurable period.

LinuxOperationsShell

0 likes · 21 min read

How to Simulate CPU, Memory, and Disk Load on Linux with Stress and dd

Data Thinking Notes

Jan 18, 2024 · Product Management

How Management Dashboards Transform E‑Commerce Data Operations: A Practical Guide

This article explores the design, implementation, and iterative improvement of management dashboards in fast‑moving e‑commerce data operations, covering metric system construction, product interaction, data accuracy, user experience, and common challenges with actionable solutions.

DashboardData AnalyticsMetrics

0 likes · 13 min read

How Management Dashboards Transform E‑Commerce Data Operations: A Practical Guide

IT Services Circle

Jan 17, 2024 · Operations

How to Disable PCDN in QQ Music Desktop Client to Prevent Upstream Bandwidth Consumption

The article explains that QQ Music desktop client can unexpectedly upload large amounts of data by acting as a PCDN node, but users can easily stop this behavior by disabling the playback acceleration service in the client settings, preserving their upstream bandwidth on both Windows and Mac.

Desktop ClientOperationsPCDN

0 likes · 3 min read

How to Disable PCDN in QQ Music Desktop Client to Prevent Upstream Bandwidth Consumption

Architecture Digest

Jan 17, 2024 · Operations

Comprehensive Guide to Workflow Process Design, Deployment, and Management

This guide explains how to create, view, edit, and design workflow processes, describes the components of the process designer—including drag‑panel, canvas, property and control panels—covers form design, deployment, process definition, request initiation, task handling, approval actions, delegation, and related source code references.

OperationsProcess Designform design

0 likes · 10 min read

Comprehensive Guide to Workflow Process Design, Deployment, and Management

Efficient Ops

Jan 16, 2024 · Operations

How Top Chinese Exchanges Accelerated DevOps Maturity: Insights from CAICT Assessments

Amid a nationwide digital transformation push, four leading Chinese exchanges adopted the CAICT DevOps Capability Maturity Model, achieving multiple level‑3 and level‑2 assessments that boosted IT efficiency, integrated resources, and better supported business systems, offering valuable lessons for the industry.

Continuous DeliveryDevOpsDigital Transformation

0 likes · 8 min read

How Top Chinese Exchanges Accelerated DevOps Maturity: Insights from CAICT Assessments

Open Source Linux

Jan 16, 2024 · Operations

Essential Linux Command Cheat Sheet: Master Files, Processes, and Shell Basics

This comprehensive guide covers essential Linux commands for navigating directories, managing files, controlling processes, setting permissions, using search utilities, customizing the shell, and performing common administrative tasks, providing clear examples and syntax for each operation.

BashOperationsShell

0 likes · 19 min read

Essential Linux Command Cheat Sheet: Master Files, Processes, and Shell Basics

Efficient Ops

Jan 15, 2024 · Operations

How Chinese City Banks Boost IT Efficiency with the DevOps Maturity Model

Amid a nationwide digital transformation push, twelve Chinese city commercial banks adopted the CAICT‑led DevOps Capability Maturity Model, achieving higher IT efficiency, integrated resources, and faster, higher‑quality service delivery across continuous delivery, technical operations, security, and performance measurement standards.

Continuous DeliveryDevOpsDigital Transformation

0 likes · 18 min read

How Chinese City Banks Boost IT Efficiency with the DevOps Maturity Model

Efficient Ops

Jan 15, 2024 · Operations

How China’s Top Banks Accelerate IT Efficiency with DevOps Maturity Assessments

Seven leading Chinese joint‑stock banks have evaluated a total of 62 projects against the CAICT DevOps Capability Maturity Model, revealing how continuous delivery, technical operation, security, and performance measurement standards are driving IT efficiency, cultural change, and faster value delivery across the financial sector.

DevOpsIT efficiencyMaturity Model

0 likes · 18 min read

How China’s Top Banks Accelerate IT Efficiency with DevOps Maturity Assessments

Liangxu Linux

Jan 14, 2024 · Operations

Deploy and Manage Linux Servers Easily with the Open‑Source 1Panel Dashboard

This guide introduces the free, secure, and continuously updated 1Panel visual management panel for Linux servers, explains its key features, shows one‑line installation commands for CentOS and Ubuntu, and details access, security, backup, and upgrade procedures.

1PanelLinuxOperations

0 likes · 5 min read

Deploy and Manage Linux Servers Easily with the Open‑Source 1Panel Dashboard

DevOps

Jan 12, 2024 · Operations

Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability

The article analyses why truly never‑failing systems cannot exist—citing entropy and Murphy’s laws—examines the organizational and technical obstacles to continuous high availability, and offers practical cultural and engineering practices such as testing, code review, monitoring, and regular system health checks to mitigate risk.

Murphy's LawOperationsSRE

0 likes · 14 min read

Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability

Liangxu Linux

Jan 10, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This guide introduces ten widely used operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical scenarios, advantages, and practical examples to help engineers choose the right solution for automation, monitoring, and management tasks.

Configuration ManagementOperationsdevops tools

0 likes · 8 min read

Efficient Ops

Jan 9, 2024 · Operations

35 Must‑Know Linux Operations Interview Questions & Answers

This comprehensive guide compiles 35 essential Linux operations interview questions covering server management, RAID configurations, load balancing with LVS/Nginx/HAProxy, proxy choices, middleware, MySQL troubleshooting, networking tools, security practices, and practical scripts, providing concise answers to help candidates ace DevOps and sysadmin roles.

LinuxOperationsinterview

0 likes · 34 min read

Efficient Ops

Jan 9, 2024 · Operations

What Do 2023 DevOps & AIOps Assessments Reveal About China’s Digital Transformation?

Amid China's sweeping digital, networked, and intelligent transformation, over 100 leading enterprises across banking, finance, communications, manufacturing, and other sectors have participated in DevOps and AIOps maturity model evaluations, providing a comprehensive view of industry adoption, capability levels, and emerging best practices for 2023.

DevOpsDigital TransformationOperations

0 likes · 15 min read

What Do 2023 DevOps & AIOps Assessments Reveal About China’s Digital Transformation?

High Availability Architecture

Jan 9, 2024 · Operations

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

This article presents Meituan's two‑year exploration of AIOps in incident management, detailing risk‑prevention change detection, real‑time anomaly discovery, automated root‑cause diagnosis, multi‑dimensional KPI analysis, and similar‑event recommendation, while sharing architectural designs, algorithmic techniques, performance results, and future directions.

NLPOperationsRoot Cause Analysis

0 likes · 24 min read

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

dbaplus Community

Jan 8, 2024 · Operations

How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories

Three real-world operations mishaps are recounted—a mistaken system‑time change that logged out thousands of users, an accidental bulk delete of database accounts, and a failed glibc downgrade that stalled a software release—illustrating the cascading impact of small errors and the urgent remediation steps taken.

LinuxOperationsSysadmin

0 likes · 8 min read

How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories

Efficient Ops

Jan 8, 2024 · Operations

What Do 2023 DevOps & AIOps Assessments Reveal About China’s Digital Transformation?

Amid China's sweeping digital transformation, the China Academy of Information and Communications Technology (CAICT) reports that 104 leading enterprises across banking, securities, insurance, telecom, manufacturing and other sectors have completed 336 DevOps maturity assessments and 23 enterprises have finished 45 AIOps assessments in 2023, highlighting industry‑wide adoption of DevOps and AIOps standards and offering detailed breakdowns by sector, evaluation levels, and future guidance.

DevOpsDigital TransformationMaturity Model

0 likes · 16 min read

Efficient Ops

Jan 8, 2024 · Information Security

How a Securities Firm Built a 100‑Day DevSecOps Prototype

At the 21st GOPS Global Operations Conference in Shanghai, Shenwan Hongyuan Securities' application security lead Wang Biansi detailed a step‑by‑step 100‑day journey to create a DevSecOps sample room, covering goal setting, research, platform design, tool integration, and security training.

Application SecurityDevSecOpsInformation Security

0 likes · 5 min read

How a Securities Firm Built a 100‑Day DevSecOps Prototype

FunTester

Jan 7, 2024 · Operations

Integrating Monitoring and Observability for Effective Application Performance Management

The article explains how combining traditional monitoring with modern observability, supported by data quality practices and unified workflows, enables more reliable, scalable, and insightful application performance management in agile and cloud‑native environments.

APMData QualityOperations

0 likes · 18 min read

Integrating Monitoring and Observability for Effective Application Performance Management

Zhuanzhuan Tech

Jan 5, 2024 · Operations

Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai

This article presents a detailed case study of how ZhaiZhai designed and implemented a unified monitoring platform—combining business services, middleware, and operations resources—by selecting Prometheus and M3DB, automating Grafana dashboards, creating a low‑noise alerting system, and achieving large‑scale observability with significant cost and efficiency gains.

AlertingM3DBOperations

0 likes · 21 min read

Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai

MaGe Linux Operations

Jan 3, 2024 · Operations

Master Apache Access & Error Logs: Formats, Analysis, and Monitoring Tips

This article explains what Apache access and error logs are, details the information they record, describes common log formats, shows where logs are stored on different operating systems, and offers guidance on analyzing and monitoring these logs for performance, security, and troubleshooting.

Access LogApacheOperations

0 likes · 15 min read

Master Apache Access & Error Logs: Formats, Analysis, and Monitoring Tips

DevOps Engineer

Dec 31, 2023 · Operations

Automating GitHub Release Notes Classification with Release.yml and Release Drafter

This article explains two practical methods—using GitHub's native release.yml configuration and the third‑party Release Drafter tool—to automatically categorize GitHub Release Notes by title, complete with example configurations, code snippets, and a comparison of their features and limitations.

GitHubOperationsRelease Drafter

0 likes · 9 min read

Automating GitHub Release Notes Classification with Release.yml and Release Drafter

21CTO

Dec 30, 2023 · Operations

How G Bank Turns Application Monitoring into Business‑Driven Visual Operations

This article examines how G Bank builds an application monitoring system based on ITIL and Google SRE principles, identifies its shortcomings, and evolves the platform into a visualized operations solution that aligns technical and business perspectives for faster incident resolution and improved customer experience.

BankingITILOperations

0 likes · 11 min read

How G Bank Turns Application Monitoring into Business‑Driven Visual Operations

Architect

Dec 29, 2023 · Industry Insights

How Bilibili Built a Scalable Anti‑Crawling System: Architecture, Data Flow, and Real‑World Impact

The article details Bilibili's comprehensive anti‑crawling solution, covering the problem background, a two‑layer detection framework integrated with APIGW and GAIA, risk perception, strategy iteration, verification mechanisms, quantitative results, and future improvement directions, all illustrated with concrete examples and performance numbers.

API SecurityBilibiliOperations

0 likes · 23 min read

How Bilibili Built a Scalable Anti‑Crawling System: Architecture, Data Flow, and Real‑World Impact

JD Retail Technology

Dec 29, 2023 · Operations

Bug Bash Practice Guide for Big Data Real‑Time Platform Teams

This guide details how the Big Data Real‑Time Platform department organized a Bug Bash activity to train new staff, enhance cross‑product knowledge, improve product quality, and strengthen team collaboration through structured preparation, execution, and post‑event analysis.

Big DataBug BashOperations

0 likes · 8 min read

Bug Bash Practice Guide for Big Data Real‑Time Platform Teams

WeiLi Technology Team

Dec 28, 2023 · Operations

Why Pods Get Evicted: Diagnosing DiskPressure in Kubernetes Nodes

This article walks through a real‑world Kubernetes incident where a node’s disk usage exceeded the eviction threshold, causing pods to enter the Evicted state, and details the investigation steps, root‑cause analysis, and practical remediation actions.

AWSDiskPressureKarpenter

0 likes · 6 min read

Why Pods Get Evicted: Diagnosing DiskPressure in Kubernetes Nodes

ITPUB

Dec 27, 2023 · Operations

When a Snapshot Share Became a Data Leak: Lessons from a Cloud Ops Failure

A developer mistakenly set a cloud disk snapshot to public, exposing a major client’s data, and recounts the frantic rollback, the ensuing panic among teammates, and the hard‑won operational lessons about high‑risk manual tasks, proper safeguards, and the need for visualized tooling.

Operationsdata securityincident response

0 likes · 10 min read

When a Snapshot Share Became a Data Leak: Lessons from a Cloud Ops Failure

Selected Java Interview Questions

Dec 25, 2023 · Operations

Understanding ByteDance (Douyin) Data Center Bandwidth and Server Scale

This article explains how ByteDance's Douyin platform achieves massive concurrent user capacity by operating data centers with multi‑terabit outbound bandwidth, extensive server fleets, CDN acceleration, and dual‑link designs, providing a technical overview of its infrastructure and bandwidth estimates.

ByteDanceCDNOperations

0 likes · 10 min read

Understanding ByteDance (Douyin) Data Center Bandwidth and Server Scale

Su San Talks Tech

Dec 25, 2023 · Operations

Why Our E‑commerce Home Page Slowed to 20 seconds and How We Fixed It

A recent e‑commerce incident caused the home page to take 20 seconds to load due to a Redis memory overload, and the team resolved it by expanding memory, redesigning data structures, and implementing a layered caching strategy with local cache, MongoDB, and fallback mechanisms.

MongoDBOperationscloud

0 likes · 9 min read

Why Our E‑commerce Home Page Slowed to 20 seconds and How We Fixed It

Zhuanzhuan Tech

Dec 23, 2023 · Operations

Investigation of Zookeeper 3.4.6 Election Port (3888) Failure Caused by Malformed Packets

This article details a troubleshooting investigation of a Zookeeper 3.4.6 cluster where the election port 3888 became unresponsive due to a NegativeArraySizeException triggered by malformed packets, explains the diagnostic steps, root‑cause analysis, and recommends upgrading to a newer version to fix the issue.

ApacheZookeeperClusterTroubleshootingElectionPort

0 likes · 11 min read

Investigation of Zookeeper 3.4.6 Election Port (3888) Failure Caused by Malformed Packets

Efficient Ops

Dec 21, 2023 · Operations

How China Galaxy Securities Achieved Level 3 DevOps Continuous Delivery – A Success Story

China Galaxy Securities detailed how three core projects passed the DevOps Continuous Delivery Level‑3 assessment, highlighting tool upgrades, process improvements, metric gains, cultural shifts, and future plans that illustrate the tangible benefits of standardized DevOps practices in a financial institution.

Jinzhou Bank’s mobile banking investment service microservice transformation project passed the CAICT DevOps Continuous Delivery Level 3 assessment, showcasing how standardized DevOps practices, tool empowerment, and agile adoption dramatically improved delivery speed, quality, and competitive advantage in the financial sector.

At the 21st GOPS Global Operations Conference in Shanghai, Qunar Travel’s tech expert Zou Sheng shared a detailed hybrid‑cloud container stability practice covering IDC‑first deployment, resource utilization over 60%, phased migration, reliability improvements, AZ monitoring, and cost‑saving strategies.

Container StabilityDevOpsOperations

0 likes · 3 min read

Hybrid Cloud Container Stability: Qunar Travel’s Proven Practices from GOPS 2023

Ctrip Technology

Dec 14, 2023 · Operations

Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies

This article describes Ctrip's optical transport network (TOTN) architecture, analyzes frequent fiber‑cut incidents and resulting device port flapping, presents technical research on fast optical switching and alarm delay, and details an optimization plan that achieved sub‑100 ms fault‑free switchover and stable Redis performance.

DCILink DelayNetwork Reliability

0 likes · 11 min read

Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies

Alibaba Cloud Big Data AI Platform

Dec 14, 2023 · Operations

How GitOps Transforms Change Management: Automation, Code, and Transparency

GitOps leverages Git's version‑control strengths to automate, codify, and make transparent infrastructure changes, combining IaC, merge requests, and CI/CD, while exploring its principles, toolchains like FluxCD, ArgoCD, Jenkins X, and practical implementations such as SRE Stack for end‑to‑end change management.

Cloud NativeGitOpsInfrastructure as Code

0 likes · 17 min read

How GitOps Transforms Change Management: Automation, Code, and Transparency

dbaplus Community

Dec 13, 2023 · Databases

Tackling the Top 8 Challenges of Domestic Databases in Banking and Proven Strategies

The article examines the rapid growth of domestic databases in China’s banking sector, identifies eight critical pain points—from product stability and resource consumption to tooling gaps and migration difficulties—and offers detailed countermeasures covering version upgrade planning, resource optimization, functional testing, skill development, monitoring, ecosystem building, data migration, and backup‑recovery improvements.

Operationsdatabasesdomestic

0 likes · 16 min read

Tackling the Top 8 Challenges of Domestic Databases in Banking and Proven Strategies

Qunhe Technology Quality Tech

Dec 12, 2023 · Operations

How We Built a Stable Offline Testing Environment with Cloud‑Native Practices

This article details the challenges of managing a complex, multi‑layered offline testing environment at KuJiaLe, outlines the standardization of baseline, functional, and integration environments, and explains the comprehensive stability measures—including infrastructure upgrades, automated checks, emergency response, and daily operations—that dramatically improved reliability.

Cloud NativeOperationsenvironment management

0 likes · 14 min read

How We Built a Stable Offline Testing Environment with Cloud‑Native Practices