Tagged articles

Operations

3329 articles · Page 26 of 34

Feb 14, 2019 · Operations

Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons

This article details Ctrip's journey of building and operating a massive container cloud platform, covering its architectural evolution, operational challenges, tooling, capacity management, and future directions, offering practical insights for large‑scale cloud‑native environments.

KubernetesOperationscapacity management

0 likes · 17 min read

Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons

ITPUB

Feb 12, 2019 · Operations

Why Docker Might Be a Dangerous Gamble: Uncovering Its Design Flaws

The article presents a detailed critique of Docker, arguing that despite its marketed benefits of portability, security, and resource management, its design introduces significant complexity, hidden costs, and operational risks that many organizations overlook when adopting it for production workloads.

DockerOperationsSoftware Architecture

0 likes · 29 min read

Why Docker Might Be a Dangerous Gamble: Uncovering Its Design Flaws

ITPUB

Feb 11, 2019 · Operations

How to Make Enterprise Networks Transparent and Efficient with Simple Monitoring Tools

This article explains how network engineers can use lightweight monitoring solutions, log analysis, traffic and error tracking, and custom automation scripts to gain visibility, reduce troubleshooting time, and safely automate routine network tasks in enterprise environments.

AutomationNetwork MonitoringOperations

0 likes · 10 min read

How to Make Enterprise Networks Transparent and Efficient with Simple Monitoring Tools

21CTO

Feb 8, 2019 · Operations

Baidu’s Secret to Handling 9 Billion Spring Festival Red Envelope Interactions

During the 2019 Chinese New Year Gala, Baidu mobilized a massive technical operation—scaling cloud resources, isolating traffic, and deploying AI‑driven security—to flawlessly process over 9 billion red‑packet interactions despite unprecedented traffic spikes and login surges.

Operationslarge-scale trafficsecurity

0 likes · 9 min read

Baidu’s Secret to Handling 9 Billion Spring Festival Red Envelope Interactions

21CTO

Feb 1, 2019 · Cloud Native

Is Docker a Hidden Trap? Uncovering the Real Costs Behind Container Hype

The article critically examines Docker’s promised benefits—portability, security, and orchestration—highlighting its design shortcomings, hidden complexities, lock‑in risks, and the often‑overlooked alternatives that can deliver the same goals with far less overhead.

ContainersOperationscloud-native

0 likes · 28 min read

Is Docker a Hidden Trap? Uncovering the Real Costs Behind Container Hype

ITPUB

Jan 31, 2019 · Operations

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

This article explains how to approach monitoring for a newly introduced system by focusing on white‑box metric collection, distinguishing basic and business metrics, outlining common collection methods, and detailing Google SRE's four golden indicators—error, latency, traffic, and saturation—to guide effective observability.

ObservabilityOperationsSRE

0 likes · 10 min read

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

JD Tech

Jan 31, 2019 · Operations

Understanding White‑Box and Black‑Box Monitoring: Data Collection Methods and the Four Golden Metrics

This article explains the differences between white‑box and black‑box monitoring, outlines common data‑collection techniques for both basic and business metrics, and details Google SRE’s four golden indicators—error, latency, traffic, and saturation—to help engineers design effective monitoring solutions.

OperationsSREblack-box

0 likes · 9 min read

Understanding White‑Box and Black‑Box Monitoring: Data Collection Methods and the Four Golden Metrics

Efficient Ops

Jan 30, 2019 · Operations

From Rookie to Ops Manager: Key Lessons on Linux, Infrastructure, and Career Growth

The author shares a journey from a college Linux basics class to becoming an operations manager, detailing early hands‑on tasks, challenges in chaotic server environments, the creation of monitoring systems, and three key career lessons about learning, deepening technical understanding, and evaluating workplace fit.

Career AdviceLinuxOperations

0 likes · 6 min read

From Rookie to Ops Manager: Key Lessons on Linux, Infrastructure, and Career Growth

Alibaba Cloud Developer

Jan 30, 2019 · Operations

How Youku Scaled IPv6 from Zero to 500K Users in Days

This article details Youku's rapid, large‑scale IPv6 rollout—from initial pilot to half‑million users—covering the motivations, phased migration plan, technical challenges, implementation steps across client and server, gray‑release strategies, monitoring, and future outlook.

IPv6Large-Scale DeploymentNetwork Migration

0 likes · 16 min read

How Youku Scaled IPv6 from Zero to 500K Users in Days

Efficient Ops

Jan 24, 2019 · Operations

Why Goal‑Oriented Operations Platforms Are the Future of Infrastructure Management

As internet traffic surges, organizations must juggle legacy and new architectures while balancing stability and cost, prompting a shift toward goal‑oriented operations platforms that let systems automatically determine optimal actions based on real‑time conditions and knowledge bases.

AutomationOperationsPlatform

0 likes · 5 min read

Why Goal‑Oriented Operations Platforms Are the Future of Infrastructure Management

Efficient Ops

Jan 24, 2019 · Information Security

How Alibaba Scales Host Security Across Its Global Economic Ecosystem

This talk outlines Alibaba’s massive global host infrastructure, the evolving security governance from manual controls to data‑driven, automated systems, the challenges of compliance and operational efficiency, and future directions such as zero‑trust and invisible security.

ComplianceHost SecurityInformation Security

0 likes · 16 min read

How Alibaba Scales Host Security Across Its Global Economic Ecosystem

UCloud Tech

Jan 24, 2019 · Operations

How UCloud Executed a Seamless Hot Migration of Its Seoul Data Center

This article details UCloud's five‑month, multi‑department hot migration of its Seoul data center, covering planning, ZooKeeper scaling, udatabase and MySQL migration strategies, deployment platforms, and the final cut‑over steps that ensured zero user impact.

Data Center MigrationHot MigrationMySQL

0 likes · 14 min read

How UCloud Executed a Seamless Hot Migration of Its Seoul Data Center

Efficient Ops

Jan 23, 2019 · Operations

Designing an Operations Monitoring Platform: Tools & Best Practices

This article explores the essential concepts for selecting and building an operations monitoring platform, reviewing popular tools such as Cacti, Nagios, Zabbix, Ganglia, Centreon, Prometheus, and Grafana, and outlines a six‑layer architecture and practical strategies for scaling, alerting, and high‑availability in diverse environments.

AlertingOperationsdevops

0 likes · 19 min read

Designing an Operations Monitoring Platform: Tools & Best Practices

Efficient Ops

Jan 10, 2019 · Operations

Essential DBA & Ops Practices to Prevent System Failures

This article outlines ten practical guidelines for DBAs and system administrators—including rollback‑ready changes, cautious use of destructive commands, prompt customization, reliable backups, production respect, thorough handovers, alerting, monitoring, careful failover, meticulous checks, and the virtue of simplicity—to minimize costly system outages.

LinuxOperationsOracle

0 likes · 7 min read

Essential DBA & Ops Practices to Prevent System Failures

Ctrip Technology

Jan 7, 2019 · Artificial Intelligence

AIOps Practices and Exploration at Ctrip: Challenges, Solutions, and Future Outlook

This article presents Ctrip's extensive AIOps exploration, detailing operational challenges caused by massive monitoring data, the evolution of DevOps practices, the design of intelligent anomaly detection and diagnosis systems, practical use cases, and a forward‑looking perspective on the future of AI‑driven operations.

AIOpsFourier TransformMachine Learning

0 likes · 20 min read

AIOps Practices and Exploration at Ctrip: Challenges, Solutions, and Future Outlook

JD Tech

Jan 3, 2019 · Operations

Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches

This article systematically explains how to enhance e‑commerce platform availability by implementing both black‑box monitoring to detect functional failures and white‑box monitoring to pinpoint root causes, detailing core order‑process metrics, common issues, mitigation strategies, and illustrative Grafana dashboards.

GrafanaOperationsSRE

0 likes · 9 min read

Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches

Efficient Ops

Jan 2, 2019 · Operations

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

This guide outlines critical operational practices for Linux server management, emphasizing thorough testing, cautious command execution, regular backups, strict access controls, comprehensive monitoring, performance tuning, and a disciplined mindset to avoid costly incidents and ensure system stability.

OperationsServer Managementmonitoring

0 likes · 12 min read

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

Ops Development Stories

Dec 26, 2018 · Operations

How to Set Up Mailx and Zabbix Alerts with Email and WeChat Integration

This guide walks you through installing and configuring mailx, creating Zabbix email alert scripts, and integrating WeChat notifications by setting up the necessary scripts, parameters, and media actions within the Zabbix web interface.

OperationsWeChat integrationalert script

0 likes · 8 min read

How to Set Up Mailx and Zabbix Alerts with Email and WeChat Integration

58 Tech

Dec 26, 2018 · Operations

Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture

The 58 Intelligent Monitoring System provides a flexible, 24/7, multi‑dimensional monitoring solution that covers network, server, system, application and business layers, incorporates AI‑driven prediction, anomaly detection, alarm merging, root‑cause analysis and self‑healing, and offers both PC and WeChat interfaces for operators.

AlertingAutomationMachine Learning

0 likes · 16 min read

Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture

Efficient Ops

Dec 24, 2018 · Operations

How Baidu’s Noah Platform Unifies Ops Data with Pull, Push, and Lazy ETL

This article explains how Baidu Cloud's Noah intelligent operations product builds a unified operations knowledge base by categorizing metadata, status, and event data and applying three ETL approaches—Pull, Push, and Lazy—to handle offline, near‑line, and real‑time data integration.

Cloud ComputingData IntegrationETL

0 likes · 8 min read

How Baidu’s Noah Platform Unifies Ops Data with Pull, Push, and Lazy ETL

MaGe Linux Operations

Dec 24, 2018 · Operations

How to Quickly Diagnose and Fix High CPU Usage on a Data Platform Server

This guide walks through a step‑by‑step investigation of a sudden 98% CPU spike on a data‑platform server, showing how to pinpoint the offending process, trace the problematic Java thread, analyze the root cause in a time‑utility method, and apply an optimized solution that reduces CPU load by thirtyfold.

Backend DevelopmentCPU troubleshootingJava

0 likes · 7 min read

How to Quickly Diagnose and Fix High CPU Usage on a Data Platform Server

Programmer DD

Dec 23, 2018 · Operations

How to Implement Service Degradation for High Availability

This article explains the concept of service degradation, why it is needed to maximize limited resources during traffic spikes, outlines common degradation strategies, and provides practical steps and code examples for ranking, sequencing, and implementing degradation in both front‑end and back‑end systems.

High AvailabilityOperationsSystem Design

0 likes · 11 min read

How to Implement Service Degradation for High Availability

DevOps

Dec 20, 2018 · Operations

What Is Kanban? Ten Things You Need to Know

The article introduces the Kanban method as a lean approach to managing professional services, outlines ten essential principles such as focusing on flow, incremental change, risk management, and scalability, and concludes with a recruitment announcement seeking DevOps engineers in Beijing.

KanbanOperationsdevops

0 likes · 8 min read

What Is Kanban? Ten Things You Need to Know

Youku Technology

Dec 20, 2018 · Operations

Youku IPv6 Migration: Planning, Implementation, and Lessons Learned

Youku’s pioneering IPv6 migration, launched in early 2018 and completed by Double 11, progressed through external, dual‑stack internal, and IPv6‑only phases, tackled test‑environment, MTU, and library issues, employed sophisticated gray‑release and monitoring, and ultimately unlocked unlimited address space, enhanced security, and faster, scalable video delivery.

CloudIPv6Network Migration

0 likes · 15 min read

Youku IPv6 Migration: Planning, Implementation, and Lessons Learned

Efficient Ops

Dec 19, 2018 · Cloud Computing

How to Build and Operate a National-Scale Private Cloud: Lessons and Trends

This talk outlines why organizations pursue cloud adoption, defines cloud‑native goals, reviews emerging trends such as bare‑metal and hyper‑convergence, and shares practical private‑cloud operation experiences, including ITIL processes, project management, and tooling, offering a comprehensive view of national‑level private‑cloud practice.

Bare MetalITILOperations

0 likes · 12 min read

How to Build and Operate a National-Scale Private Cloud: Lessons and Trends

AntTech

Dec 19, 2018 · Information Security

Red‑Blue Technical Attack‑Defense Exercises and SRE Practices at Ant Financial

Ant Financial’s internal red‑blue technical attack‑defense program, driven by a dedicated blue team and SRE‑based red team, continuously probes system weaknesses, refines fault‑injection tools like Awatch, and evolves high‑availability and self‑healing mechanisms to strengthen risk control and operational reliability.

Fault InjectionInformation SecurityOperations

0 likes · 10 min read

Red‑Blue Technical Attack‑Defense Exercises and SRE Practices at Ant Financial

JD Tech

Dec 17, 2018 · Operations

Improving JD Intelligent Supply Chain Efficiency and System Stability for Major Sales Events

The article details JD's intelligent supply chain enhancements—including machine‑learning demand forecasting, a new "explosive product warehouse" model, non‑stock fulfillment visualization, blockchain‑based product traceability, and comprehensive system‑stability measures such as data‑consistency checkpoints, throughput buffering, and 24/7 incident response—to boost efficiency and reliability during large‑scale promotions.

Big DataMachine LearningOperations

0 likes · 7 min read

Improving JD Intelligent Supply Chain Efficiency and System Stability for Major Sales Events

21CTO

Dec 15, 2018 · Information Security

When Deleting Databases Becomes Revenge: Real‑World Cases and What You Must Do

This article recounts several real incidents where disgruntled engineers or admins deleted critical databases as retaliation, highlighting the severe consequences and stressing that proper backups and cautious use of destructive commands are essential for any organization.

Operationsincidentrm

0 likes · 5 min read

When Deleting Databases Becomes Revenge: Real‑World Cases and What You Must Do

Java Captain

Dec 15, 2018 · Fundamentals

Understanding Distributed and Cluster Deployments: A Restaurant Analogy

The article uses a restaurant scenario to explain the differences between centralized, cluster, and distributed system deployments, illustrating how performance, security, scalability, and availability map to user requirements and why scaling from a single server to clusters and distributed architectures is essential as demand grows.

OperationsPerformancescalability

0 likes · 7 min read

Understanding Distributed and Cluster Deployments: A Restaurant Analogy

JD Tech

Dec 13, 2018 · Operations

Monitoring Puppet Configuration Management: Workflow, Metrics, and Troubleshooting

This article explains how to monitor the Puppet configuration management system, covering its request‑response‑execution‑report workflow, key monitoring metrics, black‑box and white‑box monitoring approaches, common issues, and practical solutions for ensuring large‑scale cluster consistency.

OperationsPuppetTroubleshooting

0 likes · 8 min read

Monitoring Puppet Configuration Management: Workflow, Metrics, and Troubleshooting

High Availability Architecture

Dec 13, 2018 · Operations

Microservice Architecture Visualization: Practices and Benefits at Alibaba

The article explains why visualizing microservice architectures is essential for high availability, describes common and advanced visualization methods, discusses how to make visualization effective, handle architectural changes, identify key components, and leverage visual data for operations and reliability improvements.

AlibabaOperationsarchitecture visualization

0 likes · 14 min read

Microservice Architecture Visualization: Practices and Benefits at Alibaba

Efficient Ops

Dec 11, 2018 · Operations

How Alibaba’s AI‑Powered Monitoring Tackles Complex Business Anomalies

In this talk, Alibaba senior tech expert Wang Zhaogang explains how intelligent monitoring, powered by machine‑learning algorithms and multi‑metric analysis, addresses the challenges of diverse business scenarios, enhances anomaly detection, improves root‑cause analysis, and shapes the future of smart operations.

Anomaly DetectionMachine LearningOperations

0 likes · 23 min read

How Alibaba’s AI‑Powered Monitoring Tackles Complex Business Anomalies

DevOps

Dec 9, 2018 · Cloud Native

Understanding Docker: What It Is, How Containers Differ, and Their Role in Modern Operations

The article explains Docker’s rapid rise, clarifies the distinction between Docker and containers, compares containers with virtual machines, and describes why Docker simplifies application deployment, while also noting a related DevOps live‑stream event and promotional details.

DockerOperationscloud-native

0 likes · 8 min read

Understanding Docker: What It Is, How Containers Differ, and Their Role in Modern Operations

Efficient Ops

Dec 9, 2018 · Operations

Recover Accidentally Deleted Linux Files with extundelete – Step‑by‑Step Guide

This guide walks you through preparing a Linux disk, safely protecting the partition, installing extundelete, and using it to restore both files and directories that were mistakenly removed, providing practical commands and screenshots for each stage.

LinuxOperationsextundelete

0 likes · 6 min read

Recover Accidentally Deleted Linux Files with extundelete – Step‑by‑Step Guide

Programmer DD

Dec 9, 2018 · Operations

What Can Nginx Do Without Third‑Party Modules? A Practical Guide

This article details the core capabilities of Nginx without third‑party modules, including reverse proxy, various load‑balancing strategies, static and dynamic HTTP serving, forward proxy setup, and hot‑reload commands, providing clear configuration examples for each feature.

HTTP serverNginxOperations

0 likes · 10 min read

What Can Nginx Do Without Third‑Party Modules? A Practical Guide

Python Crawling & Data Mining

Dec 7, 2018 · Operations

Step‑by‑Step Guide to Configuring Cluster VM Network Settings on Linux

This guide walks through configuring static IP, netmask, gateway, and DNS for each node in a virtual‑machine cluster on Linux, showing exact file edits for the master and slave machines, testing connectivity with ping commands, and confirming the network setup is complete.

LinuxNetwork ConfigurationOperations

0 likes · 3 min read

Step‑by‑Step Guide to Configuring Cluster VM Network Settings on Linux

JD Tech

Dec 6, 2018 · Operations

Shortening Decision Chains: End-to-End Inventory Management and Intelligent Replenishment in JD's Supply Chain

JD's chief scientist Shen Zuo‑jun explains how shortening the decision chain with end‑to‑end algorithms and intelligent multi‑level replenishment dramatically improves inventory turnover, stock availability, and forecasting accuracy, showcasing a novel supply‑chain research direction that integrates AI, big data, and human expertise.

End-to-EndMachine LearningOperations

0 likes · 9 min read

Shortening Decision Chains: End-to-End Inventory Management and Intelligent Replenishment in JD's Supply Chain

Architect's Tech Stack

Dec 5, 2018 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

The article shares a comprehensive, experience‑driven guide on building fault‑tolerant systems—covering retry mechanisms, dynamic node removal, timeout settings, service degradation, decoupling, and business‑level safeguards—to enable a platform that scales from millions to billions of daily requests without relying on manual fire‑fighting.

OperationsSystem Designfault tolerance

0 likes · 21 min read

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

MaGe Linux Operations

Dec 4, 2018 · Operations

Essential Linux Skills Every Beginner Must Master

This guide outlines why Linux dominates the internet, recommends starting with CentOS or RHEL, suggests effective learning resources, and lists the core knowledge, tools, and advanced topics every aspiring Linux operations engineer should master.

Beginner GuideLinuxOperations

0 likes · 6 min read

Essential Linux Skills Every Beginner Must Master

Alibaba Cloud Infrastructure

Nov 28, 2018 · Operations

Lingjing System: Alibaba's Integrated Hardware‑Software Performance Diagnosis Platform

The Lingjing system, built by Alibaba Infrastructure, provides an end‑to‑end hardware‑software performance diagnosis platform that collects fine‑grained metrics, visualizes data, automatically detects anomalies, and helps optimize resource utilization across complex data‑center stacks.

AlibabaOperationsdiagnostics

0 likes · 8 min read

Lingjing System: Alibaba's Integrated Hardware‑Software Performance Diagnosis Platform

JD Tech

Nov 28, 2018 · Operations

Technical Systems Behind JD Logistics for the 11.11 Global Shopping Festival

The article details how JD Logistics’ extensive warehouse, routing, distribution, and fulfillment systems—leveraging big data, AI, GIS, IoT, and distributed architectures—were engineered and optimized to handle the massive order surge during the 11.11 Global Shopping Festival with high throughput, low latency, and zero incidents.

AIBig DataGIS

0 likes · 8 min read

Technical Systems Behind JD Logistics for the 11.11 Global Shopping Festival

Efficient Ops

Nov 27, 2018 · Operations

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Alibaba’s massive data‑center operations face growing hardware failures, so they built the DAM (Dammo) platform that integrates Tianji management, predictive fault detection, automated remediation, and self‑balancing cluster reconstruction, achieving near‑complete hardware issue coverage and reducing manual intervention across hundreds of thousands of servers.

AIOpsCloud ComputingOperations

0 likes · 17 min read

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Efficient Ops

Nov 25, 2018 · Operations

Top 13 Essential Linux Tools for System Monitoring and Security

This article introduces thirteen practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, Fail2ban, and more—providing concise descriptions, download links, and step‑by‑step installation commands to help system administrators monitor performance, network traffic, and protect against attacks.

Command-line ToolsLinuxOperations

0 likes · 11 min read

Top 13 Essential Linux Tools for System Monitoring and Security

Architects Research Society

Nov 25, 2018 · Operations

eBay Scalability Best Practices: Functional Partitioning, Horizontal Sharding, Asynchronous Decoupling, and More

The article outlines eBay's key scalability best practices—including functional partitioning, horizontal sharding, avoiding distributed transactions, aggressive asynchronous decoupling, moving work to async pipelines, pervasive virtualization, and intelligent caching—to achieve linear or better resource usage as load grows.

CachingOperationsSharding

0 likes · 14 min read

eBay Scalability Best Practices: Functional Partitioning, Horizontal Sharding, Asynchronous Decoupling, and More

360 Tech Engineering

Nov 22, 2018 · Artificial Intelligence

AIOps Practices at 360: Cost Reduction, Efficiency Gains, and Intelligent Operations

This article presents 360's AIOps project, detailing how AI-driven capacity forecasting, host classification, resource recycling, intelligent MySQL scheduling, anomaly detection, alarm convergence, and root‑cause analysis have saved millions, improved efficiency, and paved the way for a fully automated operations workflow.

AIOpsAnomaly DetectionCapacity Forecasting

0 likes · 14 min read

AIOps Practices at 360: Cost Reduction, Efficiency Gains, and Intelligent Operations

AntTech

Nov 21, 2018 · Operations

Building a High‑Availability Wireless Test Cluster for Mobile Apps at Ant Financial

The article details Ant Financial's development of a highly available wireless test cluster that supports automated testing for its massive mobile app ecosystem, describing its architecture, data‑driven monitoring, full integration, and the All‑in‑One solution that enables rapid, cost‑effective iteration across dozens of services and IoT scenarios.

Device FarmOperationsautomated testing

0 likes · 9 min read

Building a High‑Availability Wireless Test Cluster for Mobile Apps at Ant Financial

Didi Tech

Nov 20, 2018 · Operations

Didi's Message Queue Architecture, Migration Strategies, and RocketMQ Operational Practices

At Didi, the team replaced a chaotic mix of Kafka, Redis, and other queues with a custom, RocketMQ‑based service, using dual‑write and dual‑read migration, extensive performance testing, custom failover, batch extensions, and operational tweaks to achieve stable high‑throughput, low‑latency messaging at massive scale.

Message QueueOperationsRocketMQ

0 likes · 17 min read

Didi's Message Queue Architecture, Migration Strategies, and RocketMQ Operational Practices

Alibaba Cloud Developer

Nov 19, 2018 · Operations

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

This article explains how Alibaba’s massive data‑center operations detect hardware failures early, automatically isolate faulty servers, and execute self‑healing workflows through a centralized, cloud‑native platform, detailing detection methods, convergence rules, architecture evolution, and the benefits of a closed‑loop AIOps system.

AIOpsOperationscloud-native

0 likes · 15 min read

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

Python Crawling & Data Mining

Nov 18, 2018 · Operations

Step-by-Step Guide to Expanding a VMware Virtual Machine Disk

This tutorial walks you through the complete process of safely expanding a VMware virtual machine's disk, from powering off the VM and logging into vSphere Client to adding a new hard disk, configuring its size, and confirming the successful expansion with screenshots.

Disk ExpansionOperationsVMware

0 likes · 4 min read

Step-by-Step Guide to Expanding a VMware Virtual Machine Disk

HomeTech

Nov 16, 2018 · Operations

Open-Sourcing Windows Agent for Open-Falcon Monitoring

The article announces the open-source release of the Windows Agent component under the Apache license, its integration into the Open-Falcon community, future feature enhancements, and gratitude to contributors, while providing links to the source code and related documentation.

Apache LicenseOperationsWindows Agent

0 likes · 5 min read

Open-Sourcing Windows Agent for Open-Falcon Monitoring

Efficient Ops

Nov 14, 2018 · Operations

How Zabbix Tackles FinTech Monitoring Challenges in the VUCA Era

This article explores how the VUCA-driven volatility of modern FinTech demands robust, multi‑layered monitoring solutions and explains why Zabbix, with its open‑source flexibility, automated discovery, and deep integration capabilities, is a compelling choice for achieving resilient, automated operations.

AutomationFinTechOperations

0 likes · 19 min read

How Zabbix Tackles FinTech Monitoring Challenges in the VUCA Era

Alibaba Cloud Developer

Nov 14, 2018 · Operations

How Alibaba Evolved Double 11 Capacity Planning in Five Key Stages

This article chronicles Alibaba's decade‑long journey of capacity planning for Double 11, detailing five evolutionary phases—from manual estimates to full‑link testing ecosystems—while balancing cost, stability, and efficiency in massive distributed systems.

AlibabaDouble 11Operations

0 likes · 12 min read

How Alibaba Evolved Double 11 Capacity Planning in Five Key Stages

DevOps

Nov 13, 2018 · Operations

Reflections on DevOps Organizational Transformation: Lessons from Development‑Operations Integration, Product Teams, and IT Ops Decentralization

The article shares practical reflections on a two‑year DevOps transformation, examining the integration of development and operations, the shift to product‑oriented teams, and the decentralization of the IT operations department, while highlighting emerging challenges and key lessons for supporting global business.

IT opsOperationsProduct Management

0 likes · 11 min read

Reflections on DevOps Organizational Transformation: Lessons from Development‑Operations Integration, Product Teams, and IT Ops Decentralization

58 Tech

Nov 12, 2018 · Operations

Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms

The article summarizes the 58 Group technical salon where experts from Momo and 58 shared practical experiences on monitoring platform architectures, coverage, alarm configurations, convergence techniques, custom dimensions, multi‑view dashboards, and future directions for intelligent and automated monitoring across the company.

AlertingObservabilityOperations

0 likes · 9 min read

Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms

Alibaba Cloud Infrastructure

Nov 11, 2018 · Operations

A Decade of Double 11: Technical Evolution and Operational Lessons from Alibaba

Over ten years of Alibaba's Double 11, the company transformed a modest marketing event into a global e‑commerce platform by continuously improving backend architecture, scaling strategies, full‑link stress testing, multi‑active data centers, cloud migration, and real‑time incident response, offering valuable operational insights.

AlibabaOperationsbackend

0 likes · 15 min read

A Decade of Double 11: Technical Evolution and Operational Lessons from Alibaba

MaGe Linux Operations

Nov 9, 2018 · Information Security

Essential Linux Security Practices Every Ops Engineer Should Know

This article outlines comprehensive Linux security measures—including account hardening, remote access protection, file system safeguards, rootkit detection tools, and step‑by‑step post‑attack response—to help system administrators strengthen server defenses and quickly recover from compromises.

LinuxOperationsRootkit

0 likes · 23 min read

Essential Linux Security Practices Every Ops Engineer Should Know

Zhongtong Tech

Nov 9, 2018 · Operations

How ZTO Technology Scales Logistics Systems for Double 11: From Smart Sorting to Private Cloud

Marking the 10th anniversary of Double 11, ZTO Technology details how it tackles massive traffic spikes with an automatic sorting management platform, a high‑availability IDC and private cloud, smart voice and face‑recognition services, real‑time data dashboards, and extensive performance testing to ensure stable, fast, and accurate order fulfillment.

Cloud ComputingOperationsdata dashboard

0 likes · 6 min read

How ZTO Technology Scales Logistics Systems for Double 11: From Smart Sorting to Private Cloud

Alibaba Cloud Developer

Nov 6, 2018 · Operations

How Alibaba Scaled Double 11: Lessons from a Decade of E‑commerce Mega‑Events

From its humble 2009 launch to the 2018 tenth anniversary, Alibaba’s Double 11 shopping festival evolved through relentless technical challenges—system crashes, CDN bottlenecks, over‑selling bugs, and massive load‑testing innovations—offering a decade‑long case study in operations, scalability, and resilience for large‑scale e‑commerce platforms.

Operationse-commerceload testing

0 likes · 16 min read

How Alibaba Scaled Double 11: Lessons from a Decade of E‑commerce Mega‑Events

Alibaba Cloud Developer

Nov 5, 2018 · Operations

How Alibaba Conquered Double 11: A Decade of Scaling, Crises, and Lessons

From the humble 2009 launch of Double 11 to the massive, cloud-native, multi-region architecture of 2018, Alibaba’s engineers chronicle yearly technical hurdles—traffic spikes, system crashes, CDN limits, over-selling, and the evolution of stress-testing, capacity planning, and operational safeguards that turned the shopping festival into a global engineering showcase.

Cloud ComputingOperationse-commerce

0 likes · 17 min read

How Alibaba Conquered Double 11: A Decade of Scaling, Crises, and Lessons

MaGe Linux Operations

Nov 2, 2018 · Information Security

Detecting and Recovering Linux Server Intrusions: Essential Commands

This guide walks Linux administrators through common signs of server compromise, shows how to examine logs, user files, active processes, network traffic, and demonstrates using lsof and /proc to recover deleted log files, all with concrete command examples.

Intrusion DetectionLinuxOperations

0 likes · 7 min read

Detecting and Recovering Linux Server Intrusions: Essential Commands

Tencent Cloud Developer

Nov 2, 2018 · Operations

Mastering Elasticsearch: Practical Tuning Strategies for Performance and Cost

This article shares a detailed, experience‑driven guide on Elasticsearch tuning, covering data model fundamentals, storage cost reductions, cluster stability tricks, performance‑boosting settings, and custom kernel improvements, all illustrated with real‑world diagrams and Q&A insights.

Cluster stabilityOperationsPerformance

0 likes · 15 min read

Mastering Elasticsearch: Practical Tuning Strategies for Performance and Cost

Tencent Cloud Developer

Nov 1, 2018 · Databases

Experience and Optimization of MongoDB for Mini‑Game Operations and Cloud Integration

Li Xiaohui shares Tencent Cloud MongoDB’s real‑world mini‑game operations, detailing schema‑free design, sharding, thread‑per‑connection tuning, snapshot‑based read fixes, and table‑level rollback, then demonstrates a one‑click cloud stack that provisions MongoDB, serverless functions, storage, monitoring and security for mini‑program developers.

Cloud ServicesGame DevelopmentMongoDB

0 likes · 12 min read

Experience and Optimization of MongoDB for Mini‑Game Operations and Cloud Integration

Alibaba Cloud Infrastructure

Nov 1, 2018 · Operations

Accurate Real-Time Server Downtime Detection and False‑Positive Reduction

The article explains how to achieve precise, real‑time detection of physical server outages, reduce false alarms through heartbeat monitoring, network and special‑case interference filtering, and detailed analysis, ultimately improving detection accuracy and coverage for reliable operations.

OperationsReliabilityServer monitoring

0 likes · 7 min read

Accurate Real-Time Server Downtime Detection and False‑Positive Reduction

Efficient Ops

Oct 31, 2018 · Operations

How to Build an Automated Operations System for Game Companies

This article examines why automated operations are essential for growing game businesses, outlines the goals of a complete, simple, efficient, and secure system, and details the architecture and individual subsystems—including installation, platform, security, client updates, backup, and monitoring—that together form a robust DevOps solution.

AutomationOperationsSystem Design

0 likes · 19 min read

How to Build an Automated Operations System for Game Companies

Programmer DD

Oct 31, 2018 · Operations

Prevent Service Failures: Question Third Parties, Guard Users, Perfect Your Code

This article shares practical strategies for avoiding system failures by doubting third‑party services, protecting against misuse by consumers, and strengthening internal design through solid API practices, resource limits, and disciplined coding principles.

API designOperationsResource Management

0 likes · 16 min read

Prevent Service Failures: Question Third Parties, Guard Users, Perfect Your Code

Architects' Tech Alliance

Oct 30, 2018 · Operations

IO Performance Evaluation, Monitoring Metrics, Tools, and Optimization Strategies

This article explains how to assess and model system I/O capabilities, presents common disk and network I/O benchmarking tools, outlines key performance metrics and monitoring utilities, and offers detailed optimization approaches for storage, network, and low‑latency transaction scenarios.

IO performanceNetworkOperations

0 likes · 16 min read

IO Performance Evaluation, Monitoring Metrics, Tools, and Optimization Strategies

Efficient Ops

Oct 29, 2018 · Operations

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.

Incident ManagementJIRA workflowOperations

0 likes · 10 min read

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

JD Tech

Oct 29, 2018 · Operations

SGM Service Governance Monitoring Platform: Design, Features, and Use Cases

The article introduces SGM, a comprehensive service governance and monitoring solution that addresses scaling, dependency complexity, and operational challenges by providing automated topology, real‑time tracing, capacity planning, root‑cause analysis, and extensive monitoring features such as performance metrics, JVM stats, call‑chain visualization, business dashboards, and intelligent alerting.

AlertingOperationsPerformance

0 likes · 13 min read

SGM Service Governance Monitoring Platform: Design, Features, and Use Cases

Architects' Tech Alliance

Oct 24, 2018 · Operations

Data Center Facility Construction Standards and Classification Guidelines

This article outlines the scope, terminology, classification levels, site selection principles, equipment layout, and subsystem requirements—including lighting, grounding, lightning protection, HVAC, monitoring, and cabling—for building and operating data center facilities in accordance with industry standards.

Operationsclassificationconstruction standards

0 likes · 9 min read

Data Center Facility Construction Standards and Classification Guidelines

UC Tech Team

Oct 23, 2018 · Operations

Understanding Faults and Fault Isolation Strategies in Distributed Systems

The article explains what constitutes a fault, introduces key metrics such as RPO and RTO, and describes various fault isolation principles, patterns, and practical examples—including dependency degradation, failover, dynamic adjustment, fast‑fail, caching, rate limiting, and resource isolation—to improve system reliability.

OperationsRPORTO

0 likes · 14 min read

Understanding Faults and Fault Isolation Strategies in Distributed Systems

Java Backend Technology

Oct 23, 2018 · Operations

Mastering Load Balancing: 5 Core Strategies and How to Choose the Right One

This article explains what load balancing is, compares it to navigation routing, and details five common strategies—round‑robin, weighted round‑robin, least connections, fastest response, and hash‑based—along with their pros, use‑cases, and health‑check mechanisms for achieving high availability.

Operationsalgorithmbackend

0 likes · 8 min read

Mastering Load Balancing: 5 Core Strategies and How to Choose the Right One

Alibaba Cloud Developer

Oct 23, 2018 · Operations

Unlocking Resource Efficiency: Alibaba’s Mixed‑Deployment (Co‑location) Strategy

This article explains how Alibaba’s mixed‑deployment (co‑location) technology combines online transaction services and offline compute workloads on shared physical servers, detailing its architecture, scheduling mechanisms, resource‑concession strategies, achieved performance gains, and future directions for large‑scale e‑commerce infrastructure.

AlibabaCo-locationOperations

0 likes · 23 min read

Unlocking Resource Efficiency: Alibaba’s Mixed‑Deployment (Co‑location) Strategy

Efficient Ops

Oct 22, 2018 · Operations

How Ops Teams Can Find Happiness and Deliver Real Business Value

The article explores why many operations engineers feel unhappy, identifies achievement and compensation as key to happiness, explains the internal and external value of ops work, and outlines how a dedicated ops team can improve product speed, stability, cost efficiency, and overall business outcomes.

Operationsbusiness efficiencycareer satisfaction

0 likes · 6 min read

How Ops Teams Can Find Happiness and Deliver Real Business Value

Alibaba Cloud Infrastructure

Oct 22, 2018 · Operations

Server Downtime Diagnosis System: Architecture, Implementation, and Results

The article explains why a downtime diagnosis system is needed, outlines its architecture and implementation methods—including log sources, feature extraction, and API integration—and presents early results showing high automation coverage and significant operational cost savings.

AutomationOperationsdiagnosis

0 likes · 7 min read

Server Downtime Diagnosis System: Architecture, Implementation, and Results

vivo Internet Technology

Oct 22, 2018 · Operations

Jenkins Area Meetup 2018 Shenzhen: DevOps Practices and CI/CD Solutions

The Jenkins Area Meetup 2018 in Shenzhen, co‑hosted by DevOps时代社区 and vivo Mobile Internet, gathered experts who presented on hybrid‑cloud DevOps, large‑scale CI/CD with Jenkins at Tencent, DevOps‑based R&D and operations standards, and an automated CMDB‑driven operations platform, concluding with strong community engagement and available presentation materials.

CI/CDCloudJenkins

0 likes · 3 min read

Jenkins Area Meetup 2018 Shenzhen: DevOps Practices and CI/CD Solutions

dbaplus Community

Oct 21, 2018 · Artificial Intelligence

How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting

The article details Weibo Advertising's Hubble monitoring system, describing its three‑layer architecture, metric taxonomy, AI‑driven trend prediction with LSTM models, dynamic alert thresholds, and performance testing using GoReplay, illustrating how large‑scale data and machine learning enable proactive operations.

AILSTMOperations

0 likes · 22 min read

How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting

Test Development Learning Exchange

Oct 19, 2018 · Operations

Introduction to Fabric: Python Library for Remote SSH Task Automation

Fabric is a Python library that lets you write task scripts and execute them over SSH on multiple hosts, providing a powerful way to automate deployments, system administration, and other large‑scale remote operations.

AutomationOperationsPython

0 likes · 8 min read

Introduction to Fabric: Python Library for Remote SSH Task Automation

Architecture Talk

Oct 15, 2018 · Operations

Master Nginx Rate Limiting: Request & Connection Control with Practical Configs

This article explains how to use Nginx’s built‑in limit_req and limit_conn modules to implement request‑rate and connection‑based throttling, covering configuration directives, execution flow, burst handling, delay modes, whitelist setup with geo and map modules, and practical examples for IP and domain limits.

NginxOperationsWeb Server

0 likes · 9 min read

Master Nginx Rate Limiting: Request & Connection Control with Practical Configs

Efficient Ops

Oct 10, 2018 · Operations

How Alibaba’s Mixed‑Deployment Cuts Costs and Boosts Resource Utilization

This article explains Alibaba's mixed‑deployment (co‑location) technique, detailing its motivation, architecture, resource‑sharing mechanisms, scheduling strategies, performance results, and future directions for scaling and refining resource utilization across online and offline workloads.

AlibabaCo-locationOperations

0 likes · 22 min read

How Alibaba’s Mixed‑Deployment Cuts Costs and Boosts Resource Utilization

ITFLY8 Architecture Home

Oct 10, 2018 · Operations

How to Build a Highly Available Redis Service with Sentinel and Virtual IP

This article explains why Redis is a popular in‑memory key‑value store, defines high availability, enumerates failure scenarios, and walks through four incremental architectures—single instance, master‑slave with one Sentinel, dual Sentinel, and three‑Sentinel with VIP—to achieve a robust, production‑grade Redis deployment.

OperationsRedisSentinel

0 likes · 12 min read

How to Build a Highly Available Redis Service with Sentinel and Virtual IP

Java Captain

Oct 10, 2018 · Operations

Linux Command Cheatsheet and Java Diagnostic Tools for System Operations

This article compiles essential Linux commands and a suite of Java diagnostic utilities—including tail, grep, awk, find, tsar, btrace, Greys, JProfiler, and others—providing concise examples and code snippets to help engineers troubleshoot and monitor production systems efficiently.

JavaLinuxOperations

0 likes · 13 min read

Linux Command Cheatsheet and Java Diagnostic Tools for System Operations

Efficient Ops

Oct 9, 2018 · Operations

How Tencent Scales Automated Operations for Massive Services

Tencent’s architecture platform team explains how they monitor, automate, and secure billions of daily operations across storage, CDN, and live services, using multi‑dimensional metrics, real‑time and instant computation, AI‑driven anomaly detection, and a custom control platform for safe changes.

AIOpsAutomationOperations

0 likes · 23 min read

How Tencent Scales Automated Operations for Massive Services

MaGe Linux Operations

Oct 5, 2018 · Operations

How to Auto‑Copy USB Files with Python When the Drive Is Inserted

This guide explains how to silently monitor USB insertions on a computer, automatically copy the drive’s contents to a local folder or server using a Python script that periodically scans the /Volumes directory and triggers file transfer without the user noticing.

AutomationOperationsScripting

0 likes · 3 min read

How to Auto‑Copy USB Files with Python When the Drive Is Inserted

MaGe Linux Operations

Oct 4, 2018 · Operations

Why DevOps Is More Than Chef or Puppet: Principles and Full‑Stack Automation Explained

This article clarifies that DevOps extends beyond tools like Chef or Puppet, emphasizing people, processes, and culture, and outlines the comprehensive toolchain and steps needed for full‑stack automation in modern cloud‑native environments.

AutomationCI/CDOperations

0 likes · 10 min read

Why DevOps Is More Than Chef or Puppet: Principles and Full‑Stack Automation Explained

Architects' Tech Alliance

Sep 30, 2018 · Industry Insights

What Every Data Center Engineer Must Know About Rack Cabinet Standards and Design

This article provides a comprehensive overview of data‑center rack cabinets, covering size specifications, power and cooling requirements, key industry standards such as IEC 60297‑1 and EIA‑310‑D, structural components, environmental considerations, load capacity, and practical design guidelines for safe and efficient deployment.

Data CenterOperationsRack Cabinet

0 likes · 10 min read

What Every Data Center Engineer Must Know About Rack Cabinet Standards and Design

Youzan Coder

Sep 28, 2018 · Industry Insights

How Youzan Scaled Development with Containerization: Challenges and Solutions

This article examines Youzan's journey to containerize its development and testing environments using Kubernetes and Docker, detailing the motivations, architectural decisions, network and isolation challenges, image integration, logging, load balancing, debugging, and the ongoing rollout to standard production environments.

DockerEnvironment provisioningKubernetes

0 likes · 12 min read

How Youzan Scaled Development with Containerization: Challenges and Solutions

Alibaba Cloud Infrastructure

Sep 28, 2018 · Operations

8 Practical Tips for Operations Teams to Manage the Golden Week Holiday

This article offers eight practical operations‑team strategies—inspection, monitoring alerts, capacity planning, network restrictions, risk pre‑plans, data backup, on‑call mechanisms, and staying connected—to ensure system stability and enjoy the Golden Week holiday without incidents.

HolidayOn-CallOperations

0 likes · 4 min read

8 Practical Tips for Operations Teams to Manage the Golden Week Holiday

Java Backend Technology

Sep 28, 2018 · Operations

Why Your Microservices Need a Distributed Configuration Center (and How to Build One)

This article explains the shortcomings of traditional configuration files, describes why distributed configuration centers are essential for dynamic, multi‑environment microservice deployments, outlines their evolution, presents a simple design with caching and consistency improvements, and reviews popular open‑source solutions.

Operationsconfiguration managementmicroservices

0 likes · 11 min read

Why Your Microservices Need a Distributed Configuration Center (and How to Build One)

Efficient Ops

Sep 27, 2018 · Operations

Tencent Billing’s Secret to Managing Massive Promo Spikes

Tencent’s billing platform powers billions of daily transactions across 180+ countries, supporting both consumer and business payments, and employs sophisticated capacity testing, dynamic auto‑scaling, resource sharing, and change‑control mechanisms to ensure reliable large‑scale promotional events without service disruptions.

Auto ScalingOperationsTencent Billing

0 likes · 15 min read

Tencent Billing’s Secret to Managing Massive Promo Spikes

JD Tech

Sep 27, 2018 · Operations

Overview of JD Invoice System Architecture and Business Processes

The article provides a comprehensive overview of JD's invoice system, detailing its business lines, core modules, data sources, invoicing workflows—including forward and reverse invoicing—and the system's role in automating tax management and reducing operational risk.

JDOperationsbusiness process

0 likes · 9 min read

Overview of JD Invoice System Architecture and Business Processes

Architects' Tech Alliance

Sep 26, 2018 · Operations

How Goldeneye Enables Adaptive, Intelligent Business Monitoring at Scale

Goldeneye, Alibaba Mom's monitoring platform, uses big‑data pipelines, dynamic threshold prediction, mean‑shift change‑point detection, and automated metric discovery to replace manual alarm settings, reduce false alerts, and provide intelligent, scalable business monitoring across hundreds of services.

Big DataOperationsbusiness monitoring

0 likes · 19 min read

How Goldeneye Enables Adaptive, Intelligent Business Monitoring at Scale

Efficient Ops

Sep 24, 2018 · Operations

How Checklist Thinking Fuels Ops Professionals' Lifelong Growth

This talk explores how ops engineers can achieve continuous professional development by adopting checklist thinking, covering growth drivers, error classification, practical checklist applications, cognitive models, and design principles that turn complex incidents into systematic, repeatable processes.

GrowthIncident ManagementOperations

0 likes · 34 min read

How Checklist Thinking Fuels Ops Professionals' Lifelong Growth

MaGe Linux Operations

Sep 20, 2018 · Operations

Essential Linux Command Cheat Sheet for System Administrators

A comprehensive visual guide lists essential Linux commands covering file handling, network operations, user management, system monitoring, and more, helping administrators quickly reference and master command-line tasks.

LinuxOperationssystem-administration

0 likes · 1 min read

Essential Linux Command Cheat Sheet for System Administrators

UCloud Tech

Sep 20, 2018 · Operations

Why CPU Monitoring Shows 0% or 100% Spikes and How Hot Patches Fixed It

The article investigates intermittent CPU usage spikes on Linux servers caused by a kernel cputime bug, explains the root‑cause analysis, describes a cold patch applied to newer kernels, and details a hot‑patch solution that safely resolves the issue across thousands of production machines.

CPU MonitoringLinuxOperations

0 likes · 9 min read

Why CPU Monitoring Shows 0% or 100% Spikes and How Hot Patches Fixed It

Efficient Ops

Sep 18, 2018 · Operations

Mastering Internet Operations: Roles, Responsibilities, and Evolution

This article provides a comprehensive overview of internet operations, detailing how service‑centric stability, security, and efficiency are achieved through infrastructure management, monitoring, risk mitigation, and continuous optimization, while outlining the various operational roles, their duties, and the evolution of ops practices.

Operationsdevopsinfrastructure

0 likes · 21 min read

Mastering Internet Operations: Roles, Responsibilities, and Evolution

Efficient Ops

Sep 17, 2018 · Operations

How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability

Alibaba’s monitoring evolution—from fragmented early tools to the standardized Sunfire platform and now AI‑powered full‑link observability—addresses scaling challenges, introduces business‑centric metrics, automated traceability, and intelligent anomaly detection, illustrating how massive, multi‑tenant infrastructures achieve unified, proactive operations at scale.

AIOpsAlibabaObservability

0 likes · 19 min read

How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability

DevOps

Sep 17, 2018 · Operations

Key Insights from the 2018 Global DevOps State of the World Report

The 2018 Global DevOps State of the World Report, compiled by DORA with contributions from leading experts, presents extensive data from over 30,000 professionals, highlights new trends such as accelerated practices, cloud infrastructure, elite high‑performance organizations, and offers a live online session to help practitioners quickly grasp its valuable findings.

OperationsPerformanceReport

0 likes · 6 min read

Key Insights from the 2018 Global DevOps State of the World Report

Youzan Coder

Sep 15, 2018 · Big Data

How Data Empowers Operations: Insights from Youzan & NetEase’s Big Data Summit

On September 15, Youzan’s big-data team and NetEase YouShu hosted a technical sharing titled “The Road to Data-Driven Operations,” where speakers explored the evolution of Youzan’s data warehouse metadata system, the architecture of its big-data development platform, and the application of functional programming in visual data analysis, highlighting current trends and future directions.

Data VisualizationData WarehouseFunctional Programming

0 likes · 4 min read

How Data Empowers Operations: Insights from Youzan & NetEase’s Big Data Summit

JD Tech

Sep 14, 2018 · Operations

Joint‑Venture Settlement Platform Overview and Billing Architecture

This document presents a comprehensive solution for merchant settlement in joint‑venture (co‑operated) offline stores, describing business models, settlement subject abstraction, billing engine components, settlement workflow, payment collection, and reconciliation architecture with detailed tables and diagrams.

FinancialOperationsbilling

0 likes · 18 min read

Joint‑Venture Settlement Platform Overview and Billing Architecture

Mike Chen's Internet Architecture

Sep 13, 2018 · Operations

Common Open‑Source Monitoring Systems and Zabbix Monitoring Process

The article introduces common open‑source monitoring tools such as Zabbix and Nagios, explains why distributed systems need proactive health checks, compares features, and provides a detailed Zabbix monitoring workflow including data collection, storage, visualization, alerting, and specific metrics for servers, networks, JVM and MySQL.

OperationsZabbixdistributed systems

0 likes · 8 min read

Common Open‑Source Monitoring Systems and Zabbix Monitoring Process