Tagged articles

Operations

3329 articles · Page 30 of 34

Jun 14, 2017 · Operations

Scaling Alibaba's Operations: Inside StarAgent, Qingteng & Normandy

This article details Alibaba's evolution of its operations platform, describing the design, features, and performance of StarAgent, the Qingteng P2P file distribution system, and the Normandy application‑deployment platform, highlighting how these tools enable high‑availability, automation, and massive scalability across global data centers.

AlibabaAutomationOperations

0 likes · 13 min read

Scaling Alibaba's Operations: Inside StarAgent, Qingteng & Normandy

Ctrip Technology

Jun 13, 2017 · Operations

Evolution and Architecture of Ctrip's System: Operations, Frameworks, and Big Data

This article presents a comprehensive overview of Ctrip's evolving system architecture, detailing its operational strategies, framework components such as SOA and release systems, and the large‑scale UserProfile big‑data platform, illustrating how each iteration addressed prior challenges while introducing new capabilities.

Big DataCtripOperations

0 likes · 13 min read

Evolution and Architecture of Ctrip's System: Operations, Frameworks, and Big Data

360 Zhihui Cloud Developer

Jun 13, 2017 · Operations

Master Command-Line Deployments with PSSH & PSCP: A Step-by-Step Guide

This tutorial walks you through using PSSH and PSCP for parallel command‑line deployments, establishing SSH trust, packaging and uploading files, handling branch selection, environment targeting, and safe rollback, all illustrated with real‑world examples and code snippets.

OperationsRollbackcommand-line

0 likes · 7 min read

Master Command-Line Deployments with PSSH & PSCP: A Step-by-Step Guide

MaGe Linux Operations

Jun 11, 2017 · Operations

Essential Linux Ops Tools Every Sysadmin Should Master

This guide outlines the ten core toolsets—ranging from Linux basics and network services to scripting, firewalls, monitoring, clustering, and backup—that aspiring Linux operations engineers need to master for effective system administration.

LinuxOperationsScripting

0 likes · 7 min read

Essential Linux Ops Tools Every Sysadmin Should Master

Efficient Ops

Jun 11, 2017 · Operations

How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring

From early manual deployments to a sophisticated, multi-layered monitoring stack—including ELK, Zabbix, Statsd, Grafana, and Prometheus—Bilibili’s ops team shares the evolution, challenges, and lessons learned in building scalable, automated infrastructure for massive internet traffic.

ELKGrafanaOperations

0 likes · 8 min read

How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring

ITPUB

Jun 9, 2017 · Operations

Mastering Effective Monitoring: From Basics to the USE Method

This article explains the fundamentals of monitoring, distinguishes traditional OPS from SRE perspectives, defines monitoring objects and metrics, introduces quantitative thinking with SLI/SLO, and presents the USE method with a MySQL example to help engineers detect and prevent failures efficiently.

OperationsSLISLO

0 likes · 10 min read

Mastering Effective Monitoring: From Basics to the USE Method

MaGe Linux Operations

Jun 6, 2017 · Operations

Step-by-Step Guide to Deploying Ceph Block Storage on CentOS 7

This article walks through the hardware and software preparation, configuration, deployment, testing, and troubleshooting steps required to set up a Ceph distributed block storage cluster on CentOS 7, including node roles, RAID setups, and Ceph‑deploy commands.

CentOS 7CephDistributed storage

0 likes · 7 min read

Step-by-Step Guide to Deploying Ceph Block Storage on CentOS 7

Baidu Waimai Technology Team

Jun 1, 2017 · Operations

Design and Implementation of Nginx Overload Protection Using Lua

This article describes the background, design concepts, selection rationale, implementation principles, algorithm details, and configuration of a Lua‑based overload protection module for Nginx that monitors system load and dynamically rejects traffic to safeguard backend services.

LuaNginxOperations

0 likes · 6 min read

Design and Implementation of Nginx Overload Protection Using Lua

Alibaba Cloud Developer

Jun 1, 2017 · Operations

How Alibaba Engineers Capacity Planning and Full‑Link Load Testing for Massive Sales Events

This article explains Alibaba's four‑step capacity‑planning methodology, the various single‑machine load‑testing techniques, the design of a full‑link load‑testing platform for Double‑11, and the dynamic flow‑control framework that together ensure system stability during extreme traffic spikes.

AlibabaOperationscapacity planning

0 likes · 18 min read

How Alibaba Engineers Capacity Planning and Full‑Link Load Testing for Massive Sales Events

ITPUB

May 29, 2017 · Operations

Why df and du Show Different Disk Usage on Linux and How to Fix It

This article explains why the Linux commands df and du often report different disk usage figures, detailing three main causes—reserved space, phantom (deleted) files, and data present before mounting—and provides concrete commands and steps to identify and resolve each discrepancy.

FilesystemOperationsdf

0 likes · 4 min read

Why df and du Show Different Disk Usage on Linux and How to Fix It

360 Zhihui Cloud Developer

May 25, 2017 · Operations

How HULK Turned Manual Ops into a Productized Cloud Platform

This article recounts how the 2013 HULK private‑cloud team evolved their operations from manual, repetitive tasks to a fully productized, automated platform, detailing two major upgrades—tooling automation and product‑oriented services—while sharing practical insights on monitoring, alarm management, and user‑centric design.

Operationscloud platformproductization

0 likes · 12 min read

How HULK Turned Manual Ops into a Productized Cloud Platform

MaGe Linux Operations

May 16, 2017 · Operations

How Distributed Clusters Achieve Load Balancing: Principles and Practices

This article explains the concepts of distributed clusters and load balancing, contrasting clusters and distributed systems with real‑world analogies, describing various load‑balancing techniques such as DNS, LVS, and reverse proxies, and offers practical guidance on designing simple, reliable, and efficient load‑balancing solutions for distributed back‑ends.

Operationsclustersdistributed systems

0 likes · 11 min read

How Distributed Clusters Achieve Load Balancing: Principles and Practices

DevOps

May 16, 2017 · Operations

DevOps Evolution: Key Takeaways from Patrick Debois’ DevOpsDays Austin Slides

The article presents a visual recap of Patrick Debois' DevOpsDays Austin presentation, illustrating the history, culture, practices, challenges, and future directions of DevOps through a series of themed paintings and captions that highlight automation, measurement, feedback loops, and the human side of the movement.

AutomationCultureOperations

0 likes · 9 min read

DevOps Evolution: Key Takeaways from Patrick Debois’ DevOpsDays Austin Slides

ITPUB

May 15, 2017 · Operations

Mastering Online Incident Management: From Detection to Prevention

This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.

Incident ManagementOperationsSRE

0 likes · 18 min read

Mastering Online Incident Management: From Detection to Prevention

Alibaba Cloud Developer

May 12, 2017 · Operations

How Alibaba Engineers Fault Governance and Chaos Engineering for E‑commerce

This article recounts Alibaba's middleware team's QCon Beijing 2017 presentation on fault governance and fault‑drill practices, covering distributed‑system dependency failures, strong/weak dependency concepts, multi‑stage technical evolution, and the design of their chaos‑engineering platform for large‑scale e‑commerce.

AlibabaOperationschaos engineering

0 likes · 21 min read

How Alibaba Engineers Fault Governance and Chaos Engineering for E‑commerce

MaGe Linux Operations

May 12, 2017 · Operations

From ¥2k to ¥30k: The Ops Engineer Salary Ladder and Skill Roadmap

This article analyzes how operations engineers in Beijing progress from entry‑level salaries of a few thousand yuan to senior roles earning over thirty thousand, by examining job postings, required skills, and experience levels to map a clear career growth path.

AutomationCareer PathOperations

0 likes · 11 min read

From ¥2k to ¥30k: The Ops Engineer Salary Ladder and Skill Roadmap

DevOps

May 11, 2017 · Operations

Understanding DevOps: Principles, Differences from Traditional Models, Challenges, and Measurement

This article explains what DevOps is, contrasts it with traditional development‑operations workflows, discusses its benefits and drawbacks, outlines key challenges such as balancing efficiency with stability, responsibility allocation, and assessment, and presents four metrics for evaluating DevOps effectiveness.

AutomationEngineering EfficiencyOperations

0 likes · 9 min read

Understanding DevOps: Principles, Differences from Traditional Models, Challenges, and Measurement

Tongcheng Travel Technology Center

May 11, 2017 · Operations

Design and Experience of a Near Real-Time Log System Based on Kafka and Elasticsearch

This article describes the architecture, deployment, configuration, maintenance, and performance results of a large‑scale near real‑time logging platform built with Kafka, Flume, and Elasticsearch, highlighting practical lessons and future plans for resource‑efficient operation.

ElasticsearchKafkaOperations

0 likes · 6 min read

Design and Experience of a Near Real-Time Log System Based on Kafka and Elasticsearch

ITFLY8 Architecture Home

May 10, 2017 · Operations

Mastering F5 Load Balancer: Quick Guide to Hardware Overview and Web Configuration

This article introduces the widely used F5 load‑balancing appliance, detailing its front‑panel indicators, network interfaces, status LEDs, and step‑by‑step procedures for initial web‑based configuration, including default IP, login credentials, and essential system settings such as hostname and root password policies.

F5Load BalancerOperations

0 likes · 5 min read

Mastering F5 Load Balancer: Quick Guide to Hardware Overview and Web Configuration

DevOps

May 9, 2017 · Operations

A Clear and Concise DevOps Implementation Framework: 11 Core Service Capabilities

This article introduces a straightforward DevOps implementation framework that maps eleven essential service capabilities across the software development lifecycle, explains why adopting DevOps is a multi‑year journey, and uses a fitness analogy to illustrate how enterprises can progressively build these capabilities.

Continuous IntegrationOperationscontinuous delivery

0 likes · 4 min read

A Clear and Concise DevOps Implementation Framework: 11 Core Service Capabilities

JD Retail Technology

May 9, 2017 · Backend Development

Node.js Deployment with Tomcat: Architecture Options and Step‑by‑Step Implementation

This article outlines the rationale for adopting Node.js in the 京友邦 project, compares two deployment architectures—separate Node and Tomcat services versus co‑locating them in a single Docker container—and provides detailed step‑by‑step instructions for packaging, scripting, Nginx configuration, and monitoring to achieve a successful rollout.

DeploymentDockerOperations

0 likes · 8 min read

Node.js Deployment with Tomcat: Architecture Options and Step‑by‑Step Implementation

Architects Research Society

May 8, 2017 · Operations

Decentralized Procurement of IoT Solutions: Challenges and Best Practices

The article examines why IoT purchases are decentralized across IT, business, and operations units, identifies the true buyers, explains how IoT differs from traditional IT, and offers six practical best‑practice steps for vendors to successfully sell IoT solutions within complex enterprise environments.

DecentralizedIoTOperations

0 likes · 9 min read

Decentralized Procurement of IoT Solutions: Challenges and Best Practices

DevOps

May 7, 2017 · Operations

Understanding Agile, Continuous Integration, DevOps, and Continuous Delivery: Concepts, Relationships, and Practical Guidance

The article explains Agile software development, Continuous Integration, DevOps, and Continuous Delivery, examines their inter‑relationships from both technical and human perspectives, and offers practical steps, maturity models, and real‑world case insights for teams seeking faster, reliable software delivery.

Continuous IntegrationOperationscontinuous delivery

0 likes · 12 min read

Understanding Agile, Continuous Integration, DevOps, and Continuous Delivery: Concepts, Relationships, and Practical Guidance

Tencent Architect

May 5, 2017 · Operations

Automation and Operational Management for Large-Scale Infrastructure at the Architecture Platform

The article explains how the Architecture Platform team builds a comprehensive, automated operations system—including CMDB, cost budgeting, monitoring, permission management, self‑service tools, and mobile access—to safely and efficiently manage tens of thousands of servers and massive storage services.

AutomationCMDBCloud

0 likes · 14 min read

Automation and Operational Management for Large-Scale Infrastructure at the Architecture Platform

DevOps

May 4, 2017 · Operations

What Security Teams Can Learn from DevOps to Build a Secure Architecture

This article explains how security professionals can adopt DevOps practices—such as cross‑functional collaboration, continuous delivery, and visualized security status—to build a resilient security architecture that aligns with agile development and reduces risk through frequent, small releases.

Operationscontinuous deliverydevops

0 likes · 7 min read

What Security Teams Can Learn from DevOps to Build a Secure Architecture

Efficient Ops

May 3, 2017 · Operations

How Tencent Scales NBA Live Streams to Millions: Behind the Tech and Operations

This article details Tencent's large‑scale live streaming architecture for NBA games, covering the rapid growth of live video, key technical features, network transmission challenges, multi‑angle production, CDN deployment, monitoring, big‑data processing, and strategies for ensuring low latency and high reliability for millions of concurrent viewers.

Big DataCDNLive Streaming

0 likes · 25 min read

How Tencent Scales NBA Live Streams to Millions: Behind the Tech and Operations

Continuous Delivery 2.0

May 1, 2017 · Operations

Implementing Periodic Releases: Strategies, Challenges, and Automation in Software Development

The article describes how a development team transitioned to short‑cycle, periodic releases, outlining the goals, benefits, operational concerns, and a comprehensive set of improvements—including testing strategy, configuration and environment management, and automated deployment pipelines—to maintain quality while increasing release frequency.

AutomationOperationsTesting

0 likes · 14 min read

Implementing Periodic Releases: Strategies, Challenges, and Automation in Software Development

dbaplus Community

Apr 27, 2017 · Big Data

Why Kafka’s __consumer_offsets Topic Can Fill Your Disk and How to Fix It

The article explains Kafka’s default consumer offset storage mechanism, why the __consumer_offsets system topic can consume massive disk space due to frequent synchronous commits and misconfigured cleanup, and outlines practical steps to reduce offset data and enable proper log compaction.

Consumer offsetOffset ManagementOperations

0 likes · 6 min read

Why Kafka’s __consumer_offsets Topic Can Fill Your Disk and How to Fix It

Huawei Cloud Developer Alliance

Apr 27, 2017 · Operations

How Shared Thinking Is Reshaping Data Center Infrastructure and Business Models

The article examines how the shared‑economy mindset is transforming data‑center infrastructure—from modular designs and integrated solutions to new business models—driving lower costs, higher efficiency, and a shift from construction‑focused to operation‑focused competition across the entire ecosystem.

Data CenterIndustry AnalysisOperations

0 likes · 12 min read

How Shared Thinking Is Reshaping Data Center Infrastructure and Business Models

Efficient Ops

Apr 26, 2017 · Operations

Unlock Nginx: Reverse Proxy, Load Balancing & Static Serving Without Add‑ons

This article explains how Nginx can function as a reverse proxy, load balancer, HTTP server with static‑file handling, and forward proxy without relying on third‑party modules, providing configuration examples and discussing built‑in load‑balancing strategies such as round‑robin, weight, ip_hash, fair, and url_hash.

HTTP serverNginxOperations

0 likes · 11 min read

Unlock Nginx: Reverse Proxy, Load Balancing & Static Serving Without Add‑ons

MaGe Linux Operations

Apr 26, 2017 · Operations

Master Zabbix Log Monitoring: Key Configurations and Best Practices

This guide explains how to configure Zabbix log monitoring using log and logtr keys, details each parameter, describes the monitoring mechanics, and shows how to set up items and view results, enabling reliable detection of specific log patterns.

Log MonitoringOperationsconfiguration

0 likes · 7 min read

Master Zabbix Log Monitoring: Key Configurations and Best Practices

DevOps

Apr 25, 2017 · Operations

Analyzing and Visualizing Docker Logs with the ELK Stack (Part Two)

This article explains how to analyze and visualize Docker container logs using the ELK stack, covering preparation, parsing tips, Kibana query techniques, and example visualizations to help monitor Dockerized environments effectively in production.

DockerELKKibana

0 likes · 7 min read

Analyzing and Visualizing Docker Logs with the ELK Stack (Part Two)

dbaplus Community

Apr 23, 2017 · Operations

From Legacy to Scalable: How TianpiaoChe Revamped Its Ops Architecture

Li Qiang, Operations Director at TianpiaoChe, shares the step‑by‑step transformation of a legacy e‑commerce infrastructure, covering network latency fixes, hardware re‑allocation, OS tuning, open‑source component upgrades, virtualization changes, and future plans, providing practical insights for large‑scale site operations.

NetworkOperationsarchitecture

0 likes · 28 min read

From Legacy to Scalable: How TianpiaoChe Revamped Its Ops Architecture

MaGe Linux Operations

Apr 22, 2017 · Operations

Essential Ops Learning Roadmap: Master CentOS, Linux Services, and Monitoring Tools

This article outlines a practical operations learning path, comparing CentOS 6 and 7, recommending foundational skills across OS, web services, databases, load balancing, caching, NoSQL, storage, version control, monitoring, and scripting to help engineers stay current and effective.

CentOSOperationslearning roadmap

0 likes · 4 min read

Essential Ops Learning Roadmap: Master CentOS, Linux Services, and Monitoring Tools

Meituan Technology Team

Apr 21, 2017 · Operations

Meituan-Dianping DevOps Automation Practices and Philosophy

The Meituan‑Dianping technical salon showcases its DevOps automation philosophy by presenting three core tools—DB automation platform, service tree, and Puppet web management—while also featuring Shanghai Zhaogang Network’s CMDB experience, illustrating how rapid O2O growth drives the need for fast, reliable, and scalable operational automation.

AutomationCMDBMeituan-Dianping

0 likes · 5 min read

Meituan-Dianping DevOps Automation Practices and Philosophy

Baidu Waimai Technology Team

Apr 20, 2017 · Databases

Greenplum (GPDB) Architecture, Features, and Operational Tools Overview

This article explains Greenplum's MPP architecture, master‑segment design, high‑availability, interconnect network, rich management tools, parallel query planning, data loading techniques, and additional capabilities such as LDAP authentication and resource queues, demonstrating why it is a strong next‑generation big‑data query engine.

Big DataGreenplumMPP

0 likes · 15 min read

Greenplum (GPDB) Architecture, Features, and Operational Tools Overview

360 Zhihui Cloud Developer

Apr 20, 2017 · Backend Development

How We Upgraded to PHP 7: Challenges, Benchmarks, and Ops Insights

This article recounts the Huajiao team's migration from PHP 5 to PHP 7, detailing the motivations, upgrade hurdles, extensive performance benchmarks, gray‑release strategy, and operational steps that together delivered up to 30% CPU savings and significant cost reductions.

BenchmarkOperationsPerformance

0 likes · 9 min read

How We Upgraded to PHP 7: Challenges, Benchmarks, and Ops Insights

MaGe Linux Operations

Apr 20, 2017 · Operations

How to Install and Configure pnp4nagios for Nagios Performance Graphs

This guide walks through installing pnp4nagios on CentOS 6.8, configuring required packages, compiling the software, testing the installation, understanding its bulk mode with npcd, and adjusting Nagios and pnp4nagios settings to enable dynamic performance graphs.

Operationsmonitoringnagios

0 likes · 9 min read

How to Install and Configure pnp4nagios for Nagios Performance Graphs

Qunar Tech Salon

Apr 19, 2017 · Backend Development

Rate Limiting Strategies for API Services: Design, Implementation, and Load Shedding

This article explains why availability and reliability are critical for web APIs, outlines four common rate‑limiting techniques used at Stripe, describes how to choose and implement request, concurrent, usage‑based, and worker‑utilization limiters, and provides practical guidance for safely deploying them in production.

APIOperationsToken Bucket

0 likes · 11 min read

Rate Limiting Strategies for API Services: Design, Implementation, and Load Shedding

DevOps

Apr 18, 2017 · Operations

Understanding Site Reliability Engineering (SRE): Roles, Responsibilities, Skills, and Differences from DevOps

This article explains the concept of Site Reliability Engineering (SRE), its origins at Google, core responsibilities such as IT operations and availability improvement, required skill sets, how it differs from DevOps, and guidance on adopting SRE practices within organizations.

IT Service ManagementOperationsReliability

0 likes · 12 min read

Understanding Site Reliability Engineering (SRE): Roles, Responsibilities, Skills, and Differences from DevOps

Baidu Waimai Technology Team

Apr 18, 2017 · Industry Insights

Baidu Waimai’s Cloud Migration, AI Logistics, and Architecture – QCon 2017

At QCon Beijing 2017, three senior Baidu Waimai engineers detailed the company’s year‑long migration from IDC to cloud using custom operation platforms, described the AI‑driven, data‑rich logistics scheduling system that outperforms manual dispatch, and shared architectural evolutions that enabled rapid, zero‑downtime scaling of the fast‑growing delivery business.

AI logisticsBig DataCloud Migration

0 likes · 5 min read

Baidu Waimai’s Cloud Migration, AI Logistics, and Architecture – QCon 2017

Hulu Beijing

Apr 18, 2017 · Operations

How Hulu Scales Live Streaming: Challenges and Key Technologies

The article details Hulu's evolution from a simple web video service to a multi‑device platform, highlighting the scalability, micro‑service architecture, DASH streaming, and comprehensive quality monitoring that enable consistent live streaming experiences across diverse US devices.

DASHHuluLive Streaming

0 likes · 6 min read

How Hulu Scales Live Streaming: Challenges and Key Technologies

Efficient Ops

Apr 18, 2017 · Operations

Boost Mobile Game Performance: Ops, Download & Real‑Time Network Hacks

This article outlines a comprehensive solution for mobile game operations, covering the value of modern ops, user‑experience metrics across download, login, gameplay, payment and sentiment, download‑service optimizations such as domain and resource hijack protection, incremental updates, and real‑time battle network enhancements including access‑network, backbone and QoS techniques.

Download OptimizationMobile GamingOperations

0 likes · 23 min read

Boost Mobile Game Performance: Ops, Download & Real‑Time Network Hacks

Continuous Delivery 2.0

Apr 16, 2017 · Operations

Baidu's Traditional Application Operations and Branch Management Process

The article explains Baidu's traditional project branch management approach, the reasons behind mainline release queues, and summarizes the team's continuous delivery transformation, highlighting clear goals, transparent planning, self‑defined processes, story‑driven development, six‑step CI, and automated testing practices.

BaiduOperationsbranch management

0 likes · 6 min read

Baidu's Traditional Application Operations and Branch Management Process

ITPUB

Apr 15, 2017 · Operations

How to Configure Nginx Load Balancing with Multiple Tomcat Instances on Windows

This step‑by‑step guide shows how to prepare two Tomcat servers, create a simple web project, configure Nginx as a reverse‑proxy load balancer with various strategies, start the services on Windows, and verify that requests are distributed across the Tomcat instances.

NginxOperationsReverse Proxy

0 likes · 6 min read

How to Configure Nginx Load Balancing with Multiple Tomcat Instances on Windows

21CTO

Apr 13, 2017 · Operations

Mastering Internet Performance Engineering and Capacity Planning

This article presents a comprehensive methodology for internet performance engineering, covering non‑functional quality goals, detailed metrics for application servers, databases, caches and message queues, a practical technical review outline, and a real‑world capacity‑planning case study with both maximal and minimal resource solutions.

Non-functional RequirementsOperationsbackend-architecture

0 likes · 24 min read

Mastering Internet Performance Engineering and Capacity Planning

Architecture Digest

Apr 13, 2017 · Operations

Methodology for Internet Architecture Technical Review and Capacity/Performance Evaluation

This article presents a comprehensive methodology for reviewing internet‑scale system architectures, focusing on non‑functional quality attributes such as performance, availability, scalability, security, and maintainability, and provides detailed guidelines, metrics tables, and a classic case study for capacity and performance planning.

Non-functional RequirementsOperationsPerformance

0 likes · 27 min read

Methodology for Internet Architecture Technical Review and Capacity/Performance Evaluation

Efficient Ops

Apr 12, 2017 · Operations

Mastering Enterprise Monitoring: From Basics to Advanced Toolchains

This comprehensive guide explains why monitoring is vital for operations, outlines clear objectives and methods, compares popular open‑source and commercial tools, details a Zabbix‑based workflow, and covers hardware, system, application, network, security, API, performance, and business metrics with practical alerting strategies.

AlertingOperationsZabbix

0 likes · 21 min read

Mastering Enterprise Monitoring: From Basics to Advanced Toolchains

21CTO

Apr 10, 2017 · Operations

Alibaba’s Secret to Scaling GitLab: Distributed Sharding and Performance Boosts

This article details how Alibaba Group transformed its GitLab deployment from a single‑node bottleneck into a horizontally scalable, sharded architecture that handles millions of daily requests with high availability, improved performance, and robust data safety.

GitLabOperationsSharding

0 likes · 15 min read

Alibaba’s Secret to Scaling GitLab: Distributed Sharding and Performance Boosts

ITPUB

Apr 4, 2017 · Operations

Real‑World Ops Pitfalls and Proven Ways to Avoid Them

This article compiles practical experiences from system administrators about common operational pitfalls, their root causes, and concrete mitigation steps, ranging from misconfigured HAProxy timeouts and risky rm commands to ansible async quirks and cron‑job failures.

AnsibleLinuxOperations

0 likes · 8 min read

Real‑World Ops Pitfalls and Proven Ways to Avoid Them

Efficient Ops

Mar 30, 2017 · Operations

Why Ops Engineers Are Always the Scapegoat—and How to Turn That Into Value

The article reflects on the challenges faced by operations engineers in small companies, illustrating why they often become scapegoats, and offers practical advice on learning, risk control, communication, and disaster‑recovery drills to increase their value and effectiveness.

Operationslearningrisk management

0 likes · 18 min read

Why Ops Engineers Are Always the Scapegoat—and How to Turn That Into Value

dbaplus Community

Mar 29, 2017 · Operations

Why Does Server IO Spike at 3 AM? Diagnose RAID Battery and Self‑Test Issues

This guide explains why server IO utilization spikes above 60% during early‑morning hours, covering hardware self‑test, RAID battery failures, cache policy misconfigurations, and step‑by‑step commands for MegaRAID and HP servers, plus BIOS adjustments and best‑practice recommendations to prevent performance degradation.

HardwareIOMegaCli

0 likes · 16 min read

Why Does Server IO Spike at 3 AM? Diagnose RAID Battery and Self‑Test Issues

Alibaba Cloud Developer

Mar 29, 2017 · Operations

How Alibaba Built the ‘Nuclear Weapon’ Full‑Link Stress Test for Double 11

This article chronicles Alibaba's evolution of the full‑link pressure testing platform—from its 2013 inception tackling massive Double 11 traffic, through data construction, isolation, traffic generation, and platform upgrades—to a mature, automated, cloud‑native solution that safeguards large‑scale e‑commerce stability.

AlibabaOperationscapacity planning

0 likes · 13 min read

How Alibaba Built the ‘Nuclear Weapon’ Full‑Link Stress Test for Double 11

Efficient Ops

Mar 28, 2017 · Operations

How We Scaled Server Authentication with OpenLDAP: A Real‑World Operations Journey

This article walks through a vehicle‑networking company's four‑stage journey—selection, requirement analysis, implementation, and evolution—to replace fragmented SSH passwords with a centralized OpenLDAP authentication platform, covering cost decisions, deployment steps, security hardening, and management automation.

OpenLDAPOperationsauthentication

0 likes · 13 min read

How We Scaled Server Authentication with OpenLDAP: A Real‑World Operations Journey

Baidu Intelligent Testing

Mar 27, 2017 · Operations

Gray Release (Canary Deployment) Strategies and Practices

The article explains gray release as a smooth, risk‑mitigating deployment method, outlines why it is needed, describes its limitations, and compares four practical gray‑release solutions—including code‑level flags, pre‑release machines, SET isolation, and dynamic routing—before recommending a combined approach.

Canary DeploymentDeployment StrategyOperations

0 likes · 11 min read

Gray Release (Canary Deployment) Strategies and Practices

DevOps

Mar 26, 2017 · Operations

DevOps Survey Findings: Adoption Rates, Benefits, Challenges, and Tool Usage

Based on a survey of 300 IT professionals, this report reveals growing DevOps adoption, key motivations such as quality and cost reduction, major obstacles like resource shortages, measurable benefits including cost savings and faster releases, preferred tools, error‑handling practices, and future investment plans.

AdoptionAutomationChallenges

0 likes · 11 min read

DevOps Survey Findings: Adoption Rates, Benefits, Challenges, and Tool Usage

ITFLY8 Architecture Home

Mar 26, 2017 · Operations

How Distributed Tracing Powers Modern Microservices: From Zipkin to EagleEye

This article explains why distributed systems need tracing, outlines design goals, compares major implementations like Zipkin, EagleEye, and Hydra, and details the data collection, storage, and analysis pipelines that enable end‑to‑end visibility and performance optimization in large‑scale services.

Distributed TracingObservabilityOperations

0 likes · 13 min read

How Distributed Tracing Powers Modern Microservices: From Zipkin to EagleEye

MaGe Linux Operations

Mar 23, 2017 · Operations

Why Operations Engineering Is the Hottest Career Path in 2024

The article reflects on eight years of operations experience, highlights the bright industry outlook, and outlines four key career paths—operations development, platform R&D, database engineering, and management—showing why skilled ops engineers are increasingly in demand.

IT jobsOperations

0 likes · 5 min read

Why Operations Engineering Is the Hottest Career Path in 2024

DevOps

Mar 21, 2017 · Operations

DevOps Evolution: Software Engineering Development, Transformation Pitfalls, Core Practices, and Ecosystem

This article traces the evolution of software engineering tools leading to DevOps, highlights common transformation pitfalls, outlines core DevOps practices such as autonomous small teams, traceable toolchains, real‑time metrics, and describes the surrounding ecosystem, offering practical guidance for organizations adopting DevOps.

AgileOperationscontinuous delivery

0 likes · 19 min read

DevOps Evolution: Software Engineering Development, Transformation Pitfalls, Core Practices, and Ecosystem

Baidu Intelligent Testing

Mar 21, 2017 · Operations

Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details

This article presents a comprehensive server‑side monitoring solution covering functional and performance requirements, monitoring objects, design choices between self‑monitoring and centralized reporting, system architecture, API definitions, key challenges such as key collisions, data formats, storage options, and operational considerations.

AlertingOperationsPerformance

0 likes · 12 min read

Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details

DevOps

Mar 20, 2017 · Operations

What DevOps Really Is (and Isn’t): History, Principles, Tools, and Culture

This article explains the origins and background of DevOps, clarifies common misconceptions about its role and title, outlines its cultural principles, surveys the essential toolchain, and discusses how organizations can adopt DevOps practices beyond just development and operations.

AutomationCultureOperations

0 likes · 13 min read

What DevOps Really Is (and Isn’t): History, Principles, Tools, and Culture

360 Zhihui Cloud Developer

Mar 20, 2017 · Operations

How 360’s DoctorStarange Boosts Ops with AI‑Driven Prediction, Correlation, and Resource Optimization

This article explains how 360’s DoctorStarange system combines time‑series forecasting, neural‑network predictions, alarm correlation, and a machine‑health scoring model to reduce false alerts, automate remediation, and maximize resource utilization across thousands of production servers.

ARIMAOperationsPredictive Monitoring

0 likes · 14 min read

How 360’s DoctorStarange Boosts Ops with AI‑Driven Prediction, Correlation, and Resource Optimization

MaGe Linux Operations

Mar 17, 2017 · Operations

10 Linux Commands That Can Wreck Your System – Avoid These at All Costs

This article warns about ten powerful Linux commands that, when misused—especially with root privileges—can irreversibly delete data, corrupt disks, or crash the entire system, and offers practical safeguards to prevent accidental disasters.

Operationscommand line safetydangerous-commands

0 likes · 8 min read

10 Linux Commands That Can Wreck Your System – Avoid These at All Costs

High Availability Architecture

Mar 15, 2017 · Operations

Highlights from SRECon17 Americas 2023 in San Francisco

The article reports on the SRECon17 Americas conference in San Francisco, summarizing keynote talks, panel sessions, and practical insights from industry leaders such as Stripe, Netflix, Google, and IBM on topics ranging from traffic control and container management to on‑call practices and cost considerations for Site Reliability Engineering.

GoogleNetflixOperations

0 likes · 6 min read

Highlights from SRECon17 Americas 2023 in San Francisco

High Availability Architecture

Mar 14, 2017 · Operations

Transforming Operations in the Cloud Era: Tencent Blue Whale’s DevOps Journey

The article examines how Tencent’s Blue Whale platform enables traditional operations teams to evolve into DevOps‑focused, cloud‑native units by automating release, change, and incident processes, integrating big‑data decision support, and delivering low‑cost SaaS tools for a wide range of internal stakeholders.

AutomationOperationsSaaS

0 likes · 20 min read

Transforming Operations in the Cloud Era: Tencent Blue Whale’s DevOps Journey

Efficient Ops

Mar 12, 2017 · Operations

How Tencent Saved 8 Million QQ Users by Migrating Legacy Services

This article recounts how Tencent's operations team tackled the urgent migration of aging data‑center infrastructure to preserve service for 8 million legacy QQ users, detailing the challenges, strategic choices, IP‑level network relocation, and the DevOps practices that ensured a successful cut‑over.

Legacy MigrationOperationsTencent

0 likes · 15 min read

How Tencent Saved 8 Million QQ Users by Migrating Legacy Services

MaGe Linux Operations

Mar 11, 2017 · Operations

A Day in the Life of a Tencent Operations Engineer: How They Structure Their Work

The article outlines a typical six‑segment workday of a Tencent operations engineer, detailing how they review past results, tackle urgent issues, take breaks, perform routine tasks, debug scripts, and handle after‑hours responsibilities, offering practical insight for aspiring sysadmins.

IT careerOperationsScripting

0 likes · 5 min read

A Day in the Life of a Tencent Operations Engineer: How They Structure Their Work

Architects' Tech Alliance

Mar 10, 2017 · Operations

Comprehensive Guide to SAN Boot: Principles, Benefits, Drawbacks, and Step‑by‑Step Configuration for Linux, HP‑UX, AIX, Solaris with Emulex HBA

This article explains the SAN Boot concept, its advantages and limitations, and provides detailed step‑by‑step instructions for configuring HBA cards and server BIOS (including legacy mode) to enable SAN Boot on Linux, HP‑UX, AIX, and Solaris systems.

HBALinuxOperations

0 likes · 9 min read

Comprehensive Guide to SAN Boot: Principles, Benefits, Drawbacks, and Step‑by‑Step Configuration for Linux, HP‑UX, AIX, Solaris with Emulex HBA

DevOps

Mar 9, 2017 · Operations

Instantiating DevOps Principles: A Four‑Dimensional Framework of People, Product, Process, and Tools

This article explains the origins of DevOps, presents the CALMS and Three Ways frameworks, and organizes practical DevOps principles into four dimensions—people, product, process, and tools—illustrating how they collectively enable continuous, on‑demand delivery of business value.

AgileAutomationCulture

0 likes · 16 min read

Instantiating DevOps Principles: A Four‑Dimensional Framework of People, Product, Process, and Tools

ITPUB

Mar 9, 2017 · Operations

How the Four‑Eyes Principle Saves IT Ops from Costly Mistakes

The article shares frontline IT operations experiences, emphasizing careful command execution, mandatory operation logs, two‑person verification, and backup strategies to prevent disastrous errors, illustrated by real incidents like a massive Deutsche Bank loss caused by a simple input mistake.

IT best practicesIncident PreventionOperations

0 likes · 4 min read

How the Four‑Eyes Principle Saves IT Ops from Costly Mistakes

MaGe Linux Operations

Mar 8, 2017 · Operations

Master Linux ‘top’ Command: Real‑Time Process Monitoring Guide

This article explains how to use the Linux top command for real‑time system and process monitoring, covering its interface, statistical and process sections, interactive shortcuts, command‑line options, and internal commands to customize and sort the displayed information.

OperationsSystem Monitoringprocess management

0 likes · 8 min read

Master Linux ‘top’ Command: Real‑Time Process Monitoring Guide

DevOps

Mar 6, 2017 · Operations

Controlling Work-in-Progress: The Lake‑Water‑Rock Effect and Principles for Setting WIP Limits

This article explains how to control work‑in‑progress using the lake‑water‑rock metaphor, outlines realistic and useful principles for setting WIP limits, describes common limiting methods such as swim‑lane caps, stage caps, and personal caps, and offers practical ways to determine initial limit values.

KanbanLeanOperations

0 likes · 9 min read

Controlling Work-in-Progress: The Lake‑Water‑Rock Effect and Principles for Setting WIP Limits

MaGe Linux Operations

Mar 6, 2017 · Operations

What Do Operations Engineers Actually Do? Key Tasks and Essential Traits

This article explains why the operations role exists in the software lifecycle, outlines the core responsibilities of an operations engineer—including process standardization, automation, monitoring, security, and technology adoption—and highlights the vital personal qualities needed for success.

Operations

0 likes · 4 min read

DevOps

Mar 5, 2017 · Operations

Controlling Work‑in‑Progress: Delay Start and Focus on Completion

The article explains how to control work‑in‑progress by postponing new starts and concentrating on finishing existing tasks, emphasizing that WIP should be measured in delivered user value rather than task count, and outlines practical control techniques for lean product development.

KanbanLeanOperations

0 likes · 7 min read

Controlling Work‑in‑Progress: Delay Start and Focus on Completion

Architecture Digest

Mar 3, 2017 · Operations

High-Concurrency Architecture: Strategies, Testing, and Practical Solutions

This article outlines the design and implementation of high‑concurrency systems, covering server architecture, load balancing, database clustering, caching strategies, message‑queue based asynchronous processing, static data handling, and operational best practices such as monitoring, redundancy, and automation.

CachingMessage QueueOperations

0 likes · 18 min read

High-Concurrency Architecture: Strategies, Testing, and Practical Solutions

DevOps

Feb 28, 2017 · Operations

Designing a Team Kanban Wall and System: Step-by-Step Guide

This article walks readers through a three-step process for designing a team’s Kanban wall and system, teaching how to analyze value streams, select appropriate visual elements, and create a customized board that supports efficient workflow management.

KanbanOperationsVisual Management

0 likes · 3 min read

Designing a Team Kanban Wall and System: Step-by-Step Guide

Efficient Ops

Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

Operationscapacity planninge-commerce

0 likes · 18 min read

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

Efficient Ops

Feb 26, 2017 · Operations

How Alibaba Scales Massive Data Platforms: Lessons in Automated Operations

This article explores the challenges of operating Alibaba's large‑scale data platforms, describes the automation platform built to address them, and shares data‑driven, fine‑grained operational practices that enable stable, efficient, and cost‑effective service delivery.

AutomationBig DataOperations

0 likes · 22 min read

How Alibaba Scales Massive Data Platforms: Lessons in Automated Operations

Programmer DD

Feb 26, 2017 · Operations

Zero‑Code Real‑Time Monitoring for Spring Boot with InfluxDB, Telegraf & Grafana

Learn how to achieve near real‑time, time‑series monitoring of Spring Boot applications without writing code by combining Spring Boot Actuator, Jolokia, InfluxDB, Telegraf, and Grafana, while evaluating alternative tools like Prometheus, Graphite, and JMXTrans and understanding their limitations.

GrafanaInfluxDBOperations

0 likes · 7 min read

Zero‑Code Real‑Time Monitoring for Spring Boot with InfluxDB, Telegraf & Grafana

ITFLY8 Architecture Home

Feb 24, 2017 · Big Data

How ELK, Kafka, and Spark Streaming Revolutionize Log Management in Big Data Environments

This article explores the evolution of log processing in the big‑data era, detailing how ELK Stack, Kafka, and Spark Streaming work together to provide scalable, real‑time log collection, analysis, and visualization for modern cloud‑native operations.

Big DataELKKafka

0 likes · 12 min read

How ELK, Kafka, and Spark Streaming Revolutionize Log Management in Big Data Environments

DevOps

Feb 23, 2017 · Operations

Comparing ITIL and DevOps: Principles, Automation, and Integration Models

The article examines the conflict and convergence between ITIL and DevOps in modern operations, outlining DevOps principles, automation in deployment and operations, and three integration models that balance management and execution, while highlighting the distinct values and scenarios for each approach.

AutomationITILOperations

0 likes · 12 min read

Comparing ITIL and DevOps: Principles, Automation, and Integration Models

360 Zhihui Cloud Developer

Feb 23, 2017 · Operations

Key Ops and Cloud Takeaways from Our Tech Exchanges with AutoHome & Sina

The addops team shares practical insights from technical exchange sessions with AutoHome and Sina, covering CMDB design, code release, containerization, monitoring, hybrid‑cloud architecture, virtualization, and machine‑learning applications in operations.

Operationscloud-nativecontainerization

0 likes · 6 min read

Key Ops and Cloud Takeaways from Our Tech Exchanges with AutoHome & Sina

Efficient Ops

Feb 21, 2017 · Mobile Development

How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes

This article details Alibaba's mobile app operational practices, covering the challenges of client-side maintenance, their high‑frequency release pipeline, gray‑release mechanisms, monitoring, trace systems, remote logging, and rapid issue resolution to ensure stability and performance at massive scale.

OperationsPerformanceTrace

0 likes · 21 min read

How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes

Architecture Digest

Feb 20, 2017 · Backend Development

YouTube Architecture Overview: High‑Concurrency, High‑Availability Design

This article examines YouTube's large‑scale architecture, detailing its platform components, web and video services, database evolution, data‑center strategy, and key lessons for building high‑concurrency, fault‑tolerant backend systems.

DatabasesOperationsYouTube

0 likes · 9 min read

YouTube Architecture Overview: High‑Concurrency, High‑Availability Design

Ctrip Technology

Feb 16, 2017 · Operations

Application‑Based Automated Capacity Management and Utilization Evaluation

The article presents a comprehensive, application‑centric approach to automated capacity management that analyzes why server utilization is low, defines safe usage thresholds, describes a load‑balancer‑driven stress‑testing workflow with regression modeling, and explains how this practice improves resource efficiency, cost savings, and developer‑ops collaboration.

AutomationOperationscapacity management

0 likes · 14 min read

Application‑Based Automated Capacity Management and Utilization Evaluation

Efficient Ops

Feb 15, 2017 · Operations

Mastering the One‑Second Rule: Boost Mobile User Experience in 2024

This article explains how mobile network characteristics, the one‑second rule, and targeted optimizations in access scheduling, protocols, and business logic can dramatically improve download success, startup speed, and overall user experience for mobile services.

NetworkOperationsPerformance

0 likes · 24 min read

Mastering the One‑Second Rule: Boost Mobile User Experience in 2024

Qunar Tech Salon

Feb 14, 2017 · Operations

Application‑Based Automated Capacity Management and Utilization Evaluation

This article explains how to automate application‑centric capacity assessment, identify the safe utilization thresholds, use load‑balancer‑driven stress testing and regression modeling to pinpoint resource bottlenecks, and improve server usage while maintaining service reliability through close DevOps collaboration.

AutomationOperationscapacity management

0 likes · 15 min read

转转QA

Feb 13, 2017 · Databases

Redis Connection Pool Saturation: A Debugging Tale

A developer recounts how a Redis connection pool overflow across dozens of clusters was traced to a single misbehaving service, diagnosed with netstat and ps commands, and resolved by adjusting configuration and stopping the offending process, illustrating practical troubleshooting of connection limits.

Connection PoolOperationsRedis

0 likes · 4 min read

Redis Connection Pool Saturation: A Debugging Tale

Efficient Ops

Feb 9, 2017 · Operations

Automating Application‑Based Capacity Management to Boost Resource Utilization

This article explains how to automate capacity management focused on application performance, identifies common causes of low resource utilization, proposes safe utilization thresholds, describes a testing framework that uses load‑balancer weighting and real‑time monitoring to pinpoint bottlenecks, and outlines how ops and developers can collaborate to improve efficiency.

AutomationOperationscapacity management

0 likes · 18 min read

Automating Application‑Based Capacity Management to Boost Resource Utilization

Efficient Ops

Feb 6, 2017 · Operations

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Operationsfault tolerancelarge-scale systems

0 likes · 25 min read

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

Qunar Tech Salon

Feb 4, 2017 · Operations

GitLab Database Deletion Incident: Lessons on Backup, Operations, and High‑Availability Design

The article recounts a GitLab production database deletion caused by a mistaken command, analyzes why the backup mechanisms failed, and offers technical and cultural recommendations—including automation, proper replication, and transparent post‑mortems—to build more reliable, high‑availability systems.

GitLabOperationsPostgreSQL

0 likes · 15 min read

GitLab Database Deletion Incident: Lessons on Backup, Operations, and High‑Availability Design

Efficient Ops

Feb 2, 2017 · Operations

What Happens When a Production Database Is Accidentally Deleted? Lessons from GitLab’s Disaster

This article recounts the GitLab production database deletion incident, analyzes why backup mechanisms failed, shares technical and cultural lessons on operational practices, and offers concrete recommendations for building resilient, high‑availability systems to prevent data loss.

Operationsbackupincident response

0 likes · 16 min read

What Happens When a Production Database Is Accidentally Deleted? Lessons from GitLab’s Disaster

21CTO

Feb 2, 2017 · Operations

What GitLab’s 300 GB Data Loss Teaches About Backup and Ops Discipline

The GitLab production database was mistakenly deleted during a manual fix, exposing gaps in backup strategies, PostgreSQL configuration, and operational practices, and prompting a detailed post‑mortem that highlights the need for automated recovery, proper tooling, and transparent incident handling.

Data lossOperationsPostgreSQL

0 likes · 15 min read

What GitLab’s 300 GB Data Loss Teaches About Backup and Ops Discipline

Efficient Ops

Feb 2, 2017 · Operations

GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned

GitLab.com suffered a catastrophic database outage on February 1, 2017 when an exhausted operator mistakenly ran a destructive rm command on the wrong server, wiping most production data; the incident’s timeline, root causes, recovery steps, and lessons learned are detailed in this post‑mortem.

Database IncidentGitLabOperations

0 likes · 12 min read

GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned

dbaplus Community

Feb 1, 2017 · Databases

When 310 GB Vanished: GitLab’s Backup Failure and What It Teaches Us

A GitLab.com database accident caused the loss of 310 GB of data, exposing multiple failed backup mechanisms and prompting a detailed analysis of technical, operational, and managerial lessons for reliable data protection.

DBAGitLabOperations

0 likes · 7 min read

When 310 GB Vanished: GitLab’s Backup Failure and What It Teaches Us

Efficient Ops

Jan 24, 2017 · Databases

Essential DBA Holiday Checklist: Keep Your Databases Safe During Chinese New Year

This guide outlines the critical tasks DBA teams should perform before, during, and after the Chinese New Year holiday, including daily security practices, pre‑holiday inspections, on‑call rotations, post‑holiday reviews, and detailed checklist scripts to ensure database reliability and prevent incidents.

DBADatabasesHoliday

0 likes · 13 min read

Essential DBA Holiday Checklist: Keep Your Databases Safe During Chinese New Year

MaGe Linux Operations

Jan 23, 2017 · Operations

Mastering Puppet: How Automated Configuration Management Scales Server Ops

This article explains Puppet's architecture, data flow, and practical examples, showing how automated configuration management can efficiently handle large numbers of servers, reduce manual errors, and improve operational reliability in modern IT environments.

AutomationOperationsPuppet

0 likes · 8 min read

Mastering Puppet: How Automated Configuration Management Scales Server Ops

Efficient Ops

Jan 22, 2017 · Operations

What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

The 2016 Ops Alert Report reveals Zabbix’s dominance, preferred notification channels, monthly and daily alert trends, peak alert times, regional distribution, and quirky usage statistics, offering valuable insights for operations teams to optimize monitoring and incident response.

OperationsZabbixalerts

0 likes · 5 min read

What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

Qudian (formerly Qufenqi) Technology Team

Jan 18, 2017 · Operations

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.

GrafanaInfluxDBOperations

0 likes · 12 min read

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

MaGe Linux Operations

Jan 8, 2017 · Operations

Master Ansible: From Basics to Advanced Modules for Efficient Operations

This guide introduces Ansible for operations, covering its core features, installation, host preparation, key management, essential modules, playbook structure, YAML syntax, handlers, tags, variables, templates, loops, and conditional execution, with practical command examples and visual illustrations.

AnsibleAutomationOperations

0 likes · 8 min read

Master Ansible: From Basics to Advanced Modules for Efficient Operations