Tag

SRE

1 views collected around this technical thread.

DevOps Operations Practice
DevOps Operations Practice
Jun 11, 2025 · Operations

Ops vs DevOps vs SRE: Which Role Matches Your Career Goals?

This article compares traditional Operations (Ops), DevOps, and Site Reliability Engineering (SRE) by outlining their definitions, core responsibilities, typical technology stacks, and career considerations, helping readers understand the distinct philosophies and choose the path that best fits their interests and market demand.

DevOpsMonitoringOperations
0 likes · 6 min read
Ops vs DevOps vs SRE: Which Role Matches Your Career Goals?
DevOps Operations Practice
DevOps Operations Practice
Jun 7, 2025 · Operations

How Ops Professionals Can Reach a 300k Annual Salary: Real‑World Tips

This article compiles practical advice from experienced operations engineers on the challenges and strategies for achieving a 300,000 CNY yearly salary, covering skill development, career moves, company size, automation, and the evolving role of SRE/DevOps.

DevOpsOperationsSRE
0 likes · 6 min read
How Ops Professionals Can Reach a 300k Annual Salary: Real‑World Tips
Efficient Ops
Efficient Ops
Jun 3, 2025 · Operations

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

This article shares Anthropic’s SRE engineer insights on 23 actionable practices—from schema migration and Karpenter node management to OpenTelemetry adoption, Helm chart storage, and Terraform versus CloudFormation—offering concrete recommendations for building reliable, cost‑effective AI and cloud‑native platforms.

Cloud NativeDevOpsInfrastructure
0 likes · 12 min read
What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure
FunTester
FunTester
May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Distributed SystemsOperationsSRE
0 likes · 9 min read
Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles
Baidu Geek Talk
Baidu Geek Talk
Apr 23, 2025 · Operations

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Baidu’s SRE digital‑immune system, evolved into an AI‑powered intelligent immunity platform, quantifies and mitigates risk across thousands of services by integrating data‑driven monitoring, rule‑based detection, and large‑model GraphRAG knowledge mining, cutting degradation cases by ~40% and shifting operations from reactive troubleshooting to proactive, data‑centric quality assurance.

AICloud NativeDigital Immunity
0 likes · 14 min read
Baidu SRE Digital Immunity System: Construction, Evolution, and Practice
Efficient Ops
Efficient Ops
Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

DevOpsOperationsSRE
0 likes · 6 min read
15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them
Efficient Ops
Efficient Ops
Mar 23, 2025 · Operations

Essential Linux Log Files Every SRE Should Monitor

This article outlines the most important Linux log files under /var/log, explains what each records—from system and kernel messages to authentication, web server, database, and firewall events—and shows practical commands for inspecting them, helping SREs improve fault detection and system observability.

LinuxMonitoringOperations
0 likes · 9 min read
Essential Linux Log Files Every SRE Should Monitor
FunTester
FunTester
Mar 23, 2025 · Operations

The Origin, Development, and Future of Chaos Engineering

Chaos engineering, introduced by Netflix in 2011 to proactively inject failures and test system resilience, has evolved over the past decade into a widely adopted practice integrated with SRE, automation, AI, and Kubernetes, offering best‑practice guidelines and future trends for improving distributed system reliability.

Cloud NativeKubernetesSRE
0 likes · 8 min read
The Origin, Development, and Future of Chaos Engineering
Efficient Ops
Efficient Ops
Mar 18, 2025 · Operations

Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions

This article compiles diverse Zhihu comments on the reality of 24 × 7 on‑call duties, contrasting exaggerated myths with practical team‑based solutions, global shift models, backup strategies, and actionable tips for improving operations without sacrificing personal life.

AutomationOperationsSRE
0 likes · 7 min read
Is 24/7 On‑Call a Nightmare? Real Ops Insights from Zhihu Discussions
Efficient Ops
Efficient Ops
Mar 4, 2025 · Operations

Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems

This article explains how SRE teams should collaboratively define Service Level Indicators, Objectives, and Agreements, and then cover reliability, performance, observability signals, error budgeting, risk management, incident handling, and the engineering work needed to build robust cloud‑native platforms.

Error BudgetObservabilitySLI
0 likes · 13 min read
Mastering SRE: How to Define SLIs, SLOs, and Build Reliable Cloud‑Native Systems
Efficient Ops
Efficient Ops
Feb 26, 2025 · Databases

Efficient Operations for Heterogeneous Databases: Insights from Guangdong Mobile

The article summarizes Lai Kunchi's presentation at the 24th GOPS Global Operations Conference, covering the current state and challenges of database development, Guangdong Mobile's database operation system, and future directions for managing heterogeneous databases in evolving business architectures.

AIOpsDatabase OperationsDevOps
0 likes · 3 min read
Efficient Operations for Heterogeneous Databases: Insights from Guangdong Mobile
Efficient Ops
Efficient Ops
Feb 17, 2025 · Operations

From Bronze to AI‑Powered Ops: Mastering the Operations Career Ladder

This article explores the hierarchy of operations roles, outlines five career stages from entry‑level to AI‑driven expert, and offers practical advice on building foundations, automation, high‑availability design, and embracing emerging technologies.

AIAutomationDevOps
0 likes · 6 min read
From Bronze to AI‑Powered Ops: Mastering the Operations Career Ladder
Efficient Ops
Efficient Ops
Feb 6, 2025 · Operations

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

At the 2024 GOPS Global Operations Conference in Shanghai, Alipay’s monitoring lead Tang Liang presented the challenges, architecture, risk‑prevention practices, and implementation details of the company’s full‑ecosystem availability monitoring system, highlighting its role in DevOps, SRE, and AIOps initiatives.

AIOpsCloud NativeDevOps
0 likes · 4 min read
Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices
JD Tech Talk
JD Tech Talk
Feb 6, 2025 · Operations

Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)

This article outlines comprehensive stability assurance mechanisms—including standards, process workflows, the distinction between developers and SREs, personal responsibilities, and practical construction directions—to guide teams in building resilient, high‑availability systems through proactive, daily, and incident‑response practices.

OperationsProcessReliability Engineering
0 likes · 10 min read
Stability Assurance Mechanisms and Practices for Site Reliability Engineering (SRE)
Efficient Ops
Efficient Ops
Jan 20, 2025 · Operations

Inside Qunar’s Pre‑Release Platform: Design, Practice, and Future Outlook

The article recaps Li Jingkang’s presentation at the 2024 GOPS Global Operations Conference, detailing the background, principles, design, and real‑world implementation of Qunar’s pre‑release platform, and outlines its future direction within DevOps, SRE, AIOps, and cloud‑native practices.

AIOpsCloud NativeDevOps
0 likes · 3 min read
Inside Qunar’s Pre‑Release Platform: Design, Practice, and Future Outlook
Efficient Ops
Efficient Ops
Jan 14, 2025 · Operations

What Ops Professionals Learn from Real-World Incident Stories

This article compiles real‑world operations incidents—from accidental database deletions and faulty deployments to hidden data tampering and network device failures—highlighting how quick diagnosis, preventive maintenance, and SRE practices can mitigate impact on users, reputation, and revenue.

Case StudiesDatabase BackupDevOps
0 likes · 6 min read
What Ops Professionals Learn from Real-World Incident Stories
Efficient Ops
Efficient Ops
Jan 1, 2025 · Operations

What 2024’s Biggest Outages Teach Us About Building Resilient Systems

Reviewing the major service disruptions—from Alibaba Cloud to OpenAI—this article extracts key SRE lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning, urging enterprises to adopt resilient practices for a more stable future.

OperationsOutage ManagementReliability Engineering
0 likes · 6 min read
What 2024’s Biggest Outages Teach Us About Building Resilient Systems
Efficient Ops
Efficient Ops
Dec 8, 2024 · Operations

Unlocking BizDevOps: Key Insights from Shanghai’s Enterprise Summit

The article recaps Shanghai’s BizDevOps Enterprise Summit, highlighting five expert sessions on R&D‑operations integration in securities, platform engineering breakthroughs, large‑model agents in financial ops, Ctrip’s 10 PB JuiceFS practice, and core SRE stability strategies for financial firms.

AI agentsBizDevOpsCloud Native
0 likes · 4 min read
Unlocking BizDevOps: Key Insights from Shanghai’s Enterprise Summit
Efficient Ops
Efficient Ops
Nov 20, 2024 · Operations

How China’s Telecom Leaders Accelerate DevOps & AIOps Standards for Faster Delivery

The article outlines China’s 2024‑2027 information standard action plan, the rollout of ITU DevOps and AIOps assessments, and showcases dozens of telecom projects that achieved significant improvements in delivery speed, reliability, automation and observability through standardized DevOps, SRE and AI‑ops practices.

AIOpsDevOpsSRE
0 likes · 23 min read
How China’s Telecom Leaders Accelerate DevOps & AIOps Standards for Faster Delivery