Tagged articles
60 articles
Page 1 of 1
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 7, 2025 · Operations

How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets

This article details Alibaba's decade‑long evolution of its real‑time computing platform, the massive operational challenges of managing Flink clusters at million‑core scale, and the comprehensive strategies—including SLA metrics, self‑healing services, cloud‑native redesign, and job‑level advisory tools—used to ensure stability, cost efficiency, and performance during peak events like Double‑11.

Apache FlinkCloud NativeJob Advisory
0 likes · 19 min read
How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets
Efficient Ops
Efficient Ops
Jun 21, 2025 · Operations

What a Lychee Delivery Tale Teaches About DevOps and Operations

Through a vivid analogy of transporting lychees to ancient Chang’an, the article illustrates how operations teams must negotiate SLAs, automate monitoring, design high‑availability pipelines, document responsibilities, and avoid the endless cycle of blame, offering practical DevOps strategies for managing zero‑budget, zero‑resource projects.

DevOpsOperations ManagementSLA
0 likes · 5 min read
What a Lychee Delivery Tale Teaches About DevOps and Operations
Efficient Ops
Efficient Ops
May 18, 2025 · Operations

Mastering API Latency: What P90, P95, P99 and SLA Really Mean

This article explains key performance metrics such as API latency, SLA commitments, and percentile indicators P90, P95, and P99, illustrating how to calculate and interpret these values along with average and maximum latency to improve system reliability and user experience.

API latencyPerformance MonitoringSLA
0 likes · 5 min read
Mastering API Latency: What P90, P95, P99 and SLA Really Mean
Liangxu Linux
Liangxu Linux
Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetObservabilityOperations
0 likes · 13 min read
How to Define SLIs, SLOs, and SLAs for Effective SRE Practices
MaGe Linux Operations
MaGe Linux Operations
Dec 26, 2024 · Operations

How to Justify Maintenance Fees When the System Runs Smoothly

When a client claims there is no workload because a system is stable, this article offers practical strategies—such as proactive monitoring, artificial fault injection, transparent SLA reporting, and value‑added services—to demonstrate the necessity of ongoing maintenance fees while balancing client expectations.

IT consultingSLAclient management
0 likes · 7 min read
How to Justify Maintenance Fees When the System Runs Smoothly
DaTaobao Tech
DaTaobao Tech
Dec 25, 2024 · Operations

Fundamentals of Service Level Agreements (SLA) for Messaging Middleware

The article explains SLA fundamentals for messaging middleware, defining contracts, SLI/SLO relationships, key metrics such as availability, latency and error‑rate, dynamic lifecycle processes, template components, error‑budget calculations, industry benchmarks, internal monitoring practices, a sample SLA draft, and best‑practice recommendations for continuous improvement.

Messaging MiddlewareOperationsReliability
0 likes · 41 min read
Fundamentals of Service Level Agreements (SLA) for Messaging Middleware
JD Cloud Developers
JD Cloud Developers
Nov 27, 2024 · Operations

Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services

This article explains the core concepts of SLA, SLO, and SLI, demonstrates how to set realistic service level objectives, manage alert noise, and apply practical examples—including API, MQ, and scheduled task monitoring—to improve system reliability and performance during high‑traffic events like 11.11 promotions.

SLASLISLO
0 likes · 23 min read
Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services
Software Development Quality
Software Development Quality
Jul 30, 2024 · R&D Management

What Does a Comprehensive Quality Service SLA Look Like? A Deep Dive into Testing Standards

This article presents a detailed Service Level Agreement (SLA) for quality and testing services, outlining service items, levels, descriptions, standards, response times, quality commitments, and owners across multiple testing domains such as process control, requirement management, development testing, performance, security, weak network, and specialized testing, providing a complete framework for managing testing quality within an organization.

Performance TestingProcess ControlSLA
0 likes · 41 min read
What Does a Comprehensive Quality Service SLA Look Like? A Deep Dive into Testing Standards
Software Development Quality
Software Development Quality
Jul 29, 2024 · Operations

Comprehensive Quality Service SLA Framework: Standards, Metrics, and Management

This article presents a detailed quality service SLA framework covering service directories, customer satisfaction, quality management processes, resource management, measurement systems, training, assurance, improvement, risk management, and associated metrics, response times, and ownership responsibilities for both group and center levels.

SLAquality management
0 likes · 19 min read
Comprehensive Quality Service SLA Framework: Standards, Metrics, and Management
Software Development Quality
Software Development Quality
Jul 28, 2024 · R&D Management

Designing Effective Quality Service SLA Standards for R&D Teams

This article presents a comprehensive SLA framework for quality and testing services, detailing sections on quality policy, vision, planning, system architecture, management, testing capability, tool support, and data collection, each with service items, levels, descriptions, SLA calculations, quality commitments, and owners.

Process StandardsR&D OperationsSLA
0 likes · 11 min read
Designing Effective Quality Service SLA Standards for R&D Teams
Software Development Quality
Software Development Quality
Jul 28, 2024 · Operations

What’s Inside a Quality Service SLA? A Complete Guide to Standards and Metrics

This article presents a comprehensive SLA framework for quality and testing services, detailing the directory, quality policy, vision, planning, system architecture, and specific process standards with response times, quality commitments, and ownership responsibilities, all aimed at ensuring consistent service delivery.

SLAquality managementservice standards
0 likes · 10 min read
What’s Inside a Quality Service SLA? A Complete Guide to Standards and Metrics
Huolala Tech
Huolala Tech
Jul 11, 2024 · Operations

How LApiGateway Achieves 99.999% Uptime: Architecture, SLA & Risk Mitigation

LApiGateway, Huolala's internal micro‑service gateway, achieves five‑nine availability through a dual‑plane architecture, comprehensive monitoring, SLA definition, risk classification, heartbeat health checks, traffic migration strategies, strict change governance, and regular fault drills, all detailed in this technical overview.

LApiGatewayMicroservice GatewaySLA
0 likes · 9 min read
How LApiGateway Achieves 99.999% Uptime: Architecture, SLA & Risk Mitigation
DevOps
DevOps
Oct 26, 2023 · Operations

Design and Implementation of SLA for Object Storage Services

This article explains how to design SLA metrics for object storage services, describes the S3 protocol, proposes availability calculations, outlines monitoring and alerting rules, and provides practical implementation examples using s3cmd, Python boto, and Java SDK to ensure reliable cloud storage operations.

SLAmonitoringobject storage
0 likes · 16 min read
Design and Implementation of SLA for Object Storage Services
Architects Research Society
Architects Research Society
Sep 7, 2023 · Fundamentals

Chapter 1: Foundations of Enterprise Architecture

This article introduces the fundamentals of enterprise architecture, defining its scope, reference models, and maturity stages, and explains how architects manage complexity, control costs, ensure SLA compliance, and apply iterative, partitioning, and simplification techniques to modernize enterprise IT systems.

Cost OptimizationModernizationSLA
0 likes · 14 min read
Chapter 1: Foundations of Enterprise Architecture
ByteDance Data Platform
ByteDance Data Platform
Aug 30, 2023 · Big Data

How We Cut Offline Data Warehouse SLA Delay from 13 Days to Zero with DataLeap

The article details how the "Xingfu Li" real‑estate platform tackled a 13‑day offline data‑warehouse SLA delay by adopting Volcano Engine's DataLeap suite, outlining the challenges, the three‑step governance process, and the measurable improvements achieved across task coverage, alert reduction, and data stability.

Big DataData GovernanceData Warehouse
0 likes · 10 min read
How We Cut Offline Data Warehouse SLA Delay from 13 Days to Zero with DataLeap
dbaplus Community
dbaplus Community
Jul 27, 2023 · Operations

How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE

This article summarizes a technical talk on the challenges of cloud‑native transformation, the design of an application‑centric observability platform using CMDB, Prometheus, Thanos and VictoriaMetrics, practical solutions for high‑cardinality metrics and alerting, and future directions such as eBPF and AI‑driven fault detection.

CMDBObservabilitySLA
0 likes · 14 min read
How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE
DeWu Technology
DeWu Technology
Feb 27, 2023 · Operations

Message Push Monitoring and SLA Practices

The team implemented SLA‑based, node‑level monitoring for mobile push messages—splitting the workflow, measuring latency, blocking volume, and success rates, isolating metrics with Spring AOP, and tracking third‑party vendors—resulting in clear latency standards, doubled peak throughput, faster issue resolution, and improved overall reliability.

Message PushOperationsSLA
0 likes · 11 min read
Message Push Monitoring and SLA Practices
21CTO
21CTO
Nov 15, 2022 · Operations

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

This article explains how SRE teams should define Service Level Indicators, Objectives and Agreements, manage reliability, performance, saturation and observability, use proper metrics and tracing, handle error budgets, assess risks, and implement effective incident and project management to create robust, cloud‑native services.

Error BudgetObservabilityReliability
0 likes · 14 min read
Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems
ByteDance Data Platform
ByteDance Data Platform
Aug 24, 2022 · Big Data

How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation

This article explains ByteDance's end‑to‑end data‑point (埋点) validation system, covering its technical challenges—usability, accuracy, real‑time visibility, stability, and extensibility—along with SDK integration, QR‑code workflow, JSON‑Schema verification, push‑service architecture, SLA metrics, and future automation plans.

Big DataJSON SchemaPush Service
0 likes · 11 min read
How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation
Software Development Quality
Software Development Quality
Aug 19, 2022 · Operations

Comprehensive Quality Management SLA Framework for IT Services

This document outlines a detailed Service Level Agreement (SLA) framework covering quality service standards, management processes, testing capabilities, tool support, resource management, measurement systems, risk handling, and continuous improvement to ensure consistent delivery and customer satisfaction across IT operations.

OperationsSLATraining
0 likes · 17 min read
Comprehensive Quality Management SLA Framework for IT Services
Software Development Quality
Software Development Quality
Aug 19, 2022 · Operations

Unlocking Quality Service SLA: Complete Standards and Capabilities Guide

This article presents a comprehensive Quality Service SLA framework, detailing its directory, quality policies, planning, management structures, testing capabilities, tool support, and data collection abilities, providing clear categories, service items, and responsible parties for organizations seeking robust service level agreements.

SLAService Managementquality assurance
0 likes · 4 min read
Unlocking Quality Service SLA: Complete Standards and Capabilities Guide
Software Development Quality
Software Development Quality
Aug 19, 2022 · Operations

Designing Effective Quality & Testing Service SLAs: A Complete Guide

This article presents a detailed framework for quality and testing service SLAs, outlining standard chapters, service items, performance metrics, response times, quality commitments, calculation methods, and ownership responsibilities across data collection, test efficiency, performance monitoring, and quality tracking.

SLAService Level AgreementTesting Metrics
0 likes · 11 min read
Designing Effective Quality & Testing Service SLAs: A Complete Guide
Software Development Quality
Software Development Quality
Aug 19, 2022 · Operations

Designing Effective Quality Service SLA Standards for IT Operations

This article presents a comprehensive framework for quality service SLA standards, detailing the quality policy, vision, planning, management, testing service capabilities, tool support, data collection, and the architecture of the quality system with clear service items, levels, descriptions, SLA calculations, and ownership responsibilities.

SLAquality managementservice standards
0 likes · 10 min read
Designing Effective Quality Service SLA Standards for IT Operations
DataFunSummit
DataFunSummit
Aug 11, 2022 · Big Data

Huya Data Platform: Cost Reduction and SLA Strategies

This article presents Huya's big data platform evolution, detailing cost‑saving measures, SLA practices, multi‑datacenter architecture, containerized resources, metadata‑driven intelligence, and future directions such as hybrid‑engine materialized views to improve efficiency and service reliability.

Cost OptimizationData PlatformSLA
0 likes · 15 min read
Huya Data Platform: Cost Reduction and SLA Strategies
Architect's Guide
Architect's Guide
Aug 2, 2022 · Operations

Understanding Service Degradation and Its Practical Strategies

This article explains the concept of service degradation, defines SLA levels, and details various degradation techniques—including fallback data, rate‑limiting, timeout handling, circuit‑breaker retries, and front‑end/ back‑end strategies—to maintain high availability during traffic spikes or component failures.

FallbackSLAcircuit breaker
0 likes · 13 min read
Understanding Service Degradation and Its Practical Strategies
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 28, 2022 · Big Data

How Kuaishou Guarantees Real‑Time Data Warehouse Reliability During Billion‑Scale Events

This article details Kuaishou’s real‑time data warehouse architecture and its comprehensive assurance framework—including forward lifecycle standards, reverse fault‑injection testing, and Spring Festival event practices—highlighting challenges of massive traffic, high timeliness, accuracy, and stability, and outlining future plans for automation, batch‑stream integration, and cost reduction.

Data WarehouseFlinkReal-time Streaming
0 likes · 23 min read
How Kuaishou Guarantees Real‑Time Data Warehouse Reliability During Billion‑Scale Events
DeWu Technology
DeWu Technology
May 16, 2022 · Operations

NOC SLA Implementation for Consumer Trading Platform

To tackle growing production complexity and past incident delays, the consumer trading platform introduced a three‑tier NOC‑SLA with intelligent baselines powered by Facebook Prophet, streamlined alert rules, and an SOS‑linked workflow, boosting detection frequency, cutting critical response times to under five minutes, and improving overall system reliability while emphasizing ongoing baseline and rule maintenance.

Alert ManagementNOCOperations
0 likes · 13 min read
NOC SLA Implementation for Consumer Trading Platform
ByteDance Data Platform
ByteDance Data Platform
May 16, 2022 · Operations

How ByteDance’s SLA Assurance Platform Guarantees Data Reliability at Scale

This article explains how ByteDance’s self‑built SLA assurance platform addresses data pipeline communication costs, unclear responsibilities, and operational pressure by introducing roles, a streamlined signing workflow, checkpoint and recommendation calculations, and real‑time monitoring to achieve a 99.1% SLA compliance rate.

OperationsSLAmonitoring
0 likes · 9 min read
How ByteDance’s SLA Assurance Platform Guarantees Data Reliability at Scale
DataFunSummit
DataFunSummit
Apr 22, 2022 · Big Data

Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

The talk details Huya’s real‑time computing platform evolution from chaotic early stages to a unified, containerized system, defines core SLA metrics focused on latency compliance, describes capability enhancements such as demand monitoring, task analysis, dynamic scaling, and outlines future goals for usability, stability, openness, and unified stream‑batch processing.

FlinkReal‑Time ComputingSLA
0 likes · 12 min read
Huya Real-Time Computing SLA Practice: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook
DataFunTalk
DataFunTalk
Apr 15, 2022 · Big Data

Huya Real-Time Computing SLA Practices: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook

This article details Huya's real‑time computing platform evolution, core SLA definitions focused on latency compliance, capability enhancements such as demand management, task analysis, dynamic resource scaling, and outlines future directions emphasizing usability, stability, openness, and unified batch‑stream processing.

FlinkReal‑Time ComputingSLA
0 likes · 13 min read
Huya Real-Time Computing SLA Practices: Platform Evolution, Core SLA Definition, Capability Building, and Future Outlook
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 4, 2021 · Operations

Mastering Service Degradation: Strategies to Keep Systems Available

Service degradation involves strategically reducing or disabling non‑essential features during traffic spikes or failures to maintain core functionality, covering concepts like SLA levels, fallback data, rate‑limiting, timeout handling, circuit breaking, and front‑end and back‑end downgrade techniques for high‑availability systems.

OperationsSLAfallback data
0 likes · 14 min read
Mastering Service Degradation: Strategies to Keep Systems Available
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 1, 2021 · Operations

Mastering Service Degradation: Strategies to Keep Your System Available Under Load

Service degradation, a crucial reliability technique, involves selectively disabling non-essential features, applying rate limiting, timeout handling, fallback data, and tiered switches across front‑end, back‑end, and infrastructure layers to maintain core functionality during traffic spikes or component failures, ensuring high availability and meeting SLA targets.

FallbackOperationsReliability
0 likes · 13 min read
Mastering Service Degradation: Strategies to Keep Your System Available Under Load
IT Architects Alliance
IT Architects Alliance
Oct 1, 2021 · Operations

Understanding Service Degradation: Definitions, Levels, and Mitigation Strategies

The article explains service degradation concepts, defines SLA levels and the meaning of six nines, and details various degradation techniques such as fallback data, rate‑limiting, timeout, fault handling, read/write strategies, frontend safeguards, and the use of switches and pre‑embedding to maintain system availability during traffic spikes or failures.

FallbackOperationsSLA
0 likes · 12 min read
Understanding Service Degradation: Definitions, Levels, and Mitigation Strategies
IT Architects Alliance
IT Architects Alliance
Sep 12, 2021 · Operations

Mastering Service Degradation: Keep Your System Available Under Heavy Load

This article explains the concept of service degradation, defines SLA levels including the six‑nine metric, and details practical strategies such as fallback data, rate‑limiting, timeout handling, read/write degradation, retry mechanisms, and front‑end techniques to maintain high availability during traffic spikes.

FallbackMicroservicesSLA
0 likes · 14 min read
Mastering Service Degradation: Keep Your System Available Under Heavy Load
Architect
Architect
Sep 11, 2021 · Operations

Understanding Service Degradation and Its Practical Strategies

This article explains the concept of service degradation, its relationship with rate limiting and SLA, and presents various practical mitigation techniques such as fallback data, rate‑limit throttling, timeout handling, fault isolation, retry mechanisms, feature switches, read/write degradation, and front‑end strategies to maintain high availability during traffic spikes or component failures.

FallbackSLAcircuit breaker
0 likes · 13 min read
Understanding Service Degradation and Its Practical Strategies
dbaplus Community
dbaplus Community
Aug 25, 2021 · Databases

Master‑Slave, Replica Set, and Sharding: How MongoDB Achieves High Availability

This article explains MongoDB's evolution from Master‑Slave to Replica Set and Sharding architectures, detailing how each model provides high availability, data reliability, and scalability, and offers practical configuration tips to ensure strong consistency and minimal downtime in production deployments.

Database ArchitectureMongoDBReplica Set
0 likes · 20 min read
Master‑Slave, Replica Set, and Sharding: How MongoDB Achieves High Availability
HelloTech
HelloTech
Jul 12, 2021 · Operations

Introduction to System Stability: Concepts, Metrics, and Practices

The article explains Haro’s approach to system stability—defining high‑availability, key metrics such as SLA, RPO/RTO, MTTR/MTBF, and the 5‑5‑10 rule—while outlining cultural and technical safeguards, full‑team participation, process integration, and incremental tooling to prevent faults and ensure rapid recovery.

MTTRRPORTO
0 likes · 11 min read
Introduction to System Stability: Concepts, Metrics, and Practices
High Availability Architecture
High Availability Architecture
May 13, 2021 · Operations

Understanding High Availability: Concepts, Metrics, and Design Practices

This article explains high availability in distributed systems, covering its definition, design objectives, key metrics such as MTBF, MTTR, SLA, and practical design elements like redundancy, monitoring, failover, as well as common Q&A on cost, relationship with other architecture attributes, and implementation considerations.

Distributed SystemsReliabilitySLA
0 likes · 13 min read
Understanding High Availability: Concepts, Metrics, and Design Practices
dbaplus Community
dbaplus Community
Mar 5, 2020 · Backend Development

How to Turn Thinking Frameworks into Powerful Architecture Strategies

This article shares practical thinking frameworks such as OGSM and 5W1H, explains iterative internet mindset, demonstrates their application to software architecture design, defines architecture concepts, outlines common architectural goals, evaluation metrics like SLA, and presents a toolbox of techniques ranging from single‑machine performance to micro‑service high‑availability patterns.

5W1HMicroservicesOGSM
0 likes · 30 min read
How to Turn Thinking Frameworks into Powerful Architecture Strategies
Youzan Coder
Youzan Coder
Dec 28, 2019 · Industry Insights

How Youzan Built an End‑to‑End Closed‑Loop Workflow to Cut Demand‑Management Waste

This article examines Youzan's systematic overhaul of merchant feedback handling—introducing a closed‑loop workflow, defining SLA‑based bottleneck mitigation, prioritizing requests, and deploying an online management tool—to reduce waste, improve transparency, and accelerate product iteration across multiple departments.

SLAdemand managementindustry insights
0 likes · 13 min read
How Youzan Built an End‑to‑End Closed‑Loop Workflow to Cut Demand‑Management Waste
Architects' Tech Alliance
Architects' Tech Alliance
Dec 5, 2018 · Cloud Computing

AWS Outposts: Dispersing the Last Cloud of Private Cloud Skepticism

The article examines AWS Outposts as a hybrid‑cloud solution, arguing that while it brings public‑cloud services to on‑premises data centers, it cannot match the ultra‑high SLA of pure public cloud, and discusses the broader implications of private‑cloud versus public‑cloud dominance in the IT industry.

AWSOutpostsSLA
0 likes · 18 min read
AWS Outposts: Dispersing the Last Cloud of Private Cloud Skepticism
DevOps
DevOps
Aug 13, 2018 · Cloud Computing

Understanding Cloud Computing SLA, Availability, and Compensation: A Comparative Analysis of Major Providers

The article explains cloud computing fundamentals, details Service Level Agreements (SLAs) and their metrics, compares the availability and compensation policies of major Chinese cloud providers, and concludes with a brief DevOps recruitment notice, highlighting both technical insights and industry context.

AvailabilityCloud providersDevOps
0 likes · 11 min read
Understanding Cloud Computing SLA, Availability, and Compensation: A Comparative Analysis of Major Providers
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 27, 2017 · Operations

Efficient Ticket System Operations During Double 11 Promotion

The article describes how a ticketing system with strict SLA enforcement, automated routing, and team‑based service management enabled rapid, orderly issue handling during the high‑volume Double 11 shopping event, achieving near‑90% resolution within 30 minutes and improving overall business stability.

Double 11OperationsSLA
0 likes · 7 min read
Efficient Ticket System Operations During Double 11 Promotion
Efficient Ops
Efficient Ops
Nov 27, 2016 · Operations

When Ops Heroes Burn Out: Tackling Personal Heroism in Operations

The article explores personal heroism in operations, defining it as reliance on individual effort to keep flawed systems appearing normal, examines its short‑term benefits and long‑term drawbacks for companies, teams, and the heroes themselves, and offers practical strategies to eliminate this risky mindset.

OperationsSLATeam Culture
0 likes · 10 min read
When Ops Heroes Burn Out: Tackling Personal Heroism in Operations
Efficient Ops
Efficient Ops
Nov 9, 2016 · Operations

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

OperationsSLASLI
0 likes · 11 min read
How to Design Effective SLOs and SLAs: A Technical Deep Dive