Tag

infrastructure

0 views collected around this technical thread.

Efficient Ops
Efficient Ops
Jun 3, 2025 · Operations

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

This article shares Anthropic’s SRE engineer insights on 23 actionable practices—from schema migration and Karpenter node management to OpenTelemetry adoption, Helm chart storage, and Terraform versus CloudFormation—offering concrete recommendations for building reliable, cost‑effective AI and cloud‑native platforms.

DevOpsKubernetesSRE
0 likes · 12 min read
What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure
Raymond Ops
Raymond Ops
May 27, 2025 · Fundamentals

Understanding Block, File, and Object Storage: Pros, Cons, and Use Cases

This article explains the concepts, advantages, and disadvantages of block storage, file storage, and object storage, compares their architectures, and clarifies when each type is appropriate for different applications and workloads.

block storagefile storageinfrastructure
0 likes · 10 min read
Understanding Block, File, and Object Storage: Pros, Cons, and Use Cases
Bilibili Tech
Bilibili Tech
May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

AutomationServer Fault Managementinfrastructure
0 likes · 17 min read
Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook
Efficient Ops
Efficient Ops
May 21, 2025 · Operations

Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness

Six months after abandoning Kubernetes, our DevOps team reduced infrastructure spend by 62%, cut deployment time by 89%, eliminated weekend on‑call duties, and improved overall happiness, demonstrating that simplifying the tech stack can deliver substantial operational and business benefits.

Cost ReductionDevOpsKubernetes
0 likes · 9 min read
Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness
Efficient Ops
Efficient Ops
May 11, 2025 · Operations

Essential Ops Engineer Toolkit: Must‑Have Tools for Monitoring, Automation, and Troubleshooting

This article presents a comprehensive, scenario‑driven toolbox for operations engineers, covering core SSH utilities, monitoring stacks, automation platforms, log management, network diagnostics, and emerging AI‑augmented practices to help teams select the right tools for modern infrastructure.

AutomationDevOpsinfrastructure
0 likes · 9 min read
Essential Ops Engineer Toolkit: Must‑Have Tools for Monitoring, Automation, and Troubleshooting
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 30, 2025 · Artificial Intelligence

Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era

This article summarizes Zuoyebang's infrastructure leader Dong Xiaocong's presentation on the challenges of AI inference demand and supply, and describes the design and implementation of a unified compute network—including trusted networking, multi‑region container scheduling, and traffic routing—to efficiently serve large‑scale AI models.

AICompute NetworkModel Distribution
0 likes · 9 min read
Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era
Efficient Ops
Efficient Ops
Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

AutomationDevOpsinfrastructure
0 likes · 9 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Efficient Ops
Efficient Ops
Jan 20, 2025 · Operations

12 Essential Operations Roles Every Tech Team Needs

Operations is the backbone of modern digital services, and this article breaks down twelve distinct roles—from implementation and system ops to DevOps, big‑data, security, and cloud—explaining their core responsibilities and how they keep online platforms reliable, efficient, and secure.

DevOpscloudinfrastructure
0 likes · 7 min read
12 Essential Operations Roles Every Tech Team Needs
DevOps
DevOps
Jan 16, 2025 · Operations

Infrastructure as Code (IaC) in DevOps: Solving Modern Operations Challenges

This article explains how Infrastructure as Code (IaC) addresses the inefficiencies of traditional operations, improves collaboration between development and operations, and provides practical steps for implementing IaC within a DevOps workflow to achieve automation, consistency, and faster software delivery.

AutomationDevOpsIaC
0 likes · 10 min read
Infrastructure as Code (IaC) in DevOps: Solving Modern Operations Challenges
Bilibili Tech
Bilibili Tech
Dec 20, 2024 · Operations

Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE

To cope with rapid growth, Bilibili replaced its inflexible PXE workflow with a hybrid system using in‑memory BiliOS and iPXE, adding out‑of‑band management, declarative configuration, and multi‑scenario support, which together dramatically boosted provisioning automation, reliability, and efficiency across its data‑center and edge servers.

BiliOSDeploymentPXE
0 likes · 17 min read
Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE
DevOps Engineer
DevOps Engineer
Oct 29, 2024 · Operations

A Day in the Life of a DevOps Engineer

The article walks through a DevOps engineer’s typical workday, from morning Slack checks and task planning, through code repository maintenance, build and release duties, coffee breaks, lunch with teammates, focused afternoon development, and evening family time, highlighting both technical and personal aspects.

AutomationBuildCI/CD
0 likes · 4 min read
A Day in the Life of a DevOps Engineer
Bilibili Tech
Bilibili Tech
Oct 25, 2024 · Operations

Bilibili Data Center Migration: Planning, Execution, and Lessons Learned

Bilibili’s 18‑month, multi‑regional data‑center migration moved tens of thousands of servers using a high‑frequency rolling strategy, combining meticulous planning, cross‑team coordination, automated rack placement and rigorous checklists to achieve significant cost savings, higher utilization, improved stability and greener operations.

AutomationCapacity Planningdata center migration
0 likes · 21 min read
Bilibili Data Center Migration: Planning, Execution, and Lessons Learned
Selected Java Interview Questions
Selected Java Interview Questions
Oct 7, 2024 · Operations

Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples

This article introduces ten essential tools for operations engineers—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's functionality, typical scenarios, advantages, and real‑world examples with code snippets for practical automation and monitoring.

AutomationConfiguration ManagementContainerization
0 likes · 8 min read
Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples
DevOps Engineer
DevOps Engineer
Oct 1, 2024 · Operations

What a Chief DevOps Engineer Does: Responsibilities, Required Skills, and Business Benefits

The article explains the role of a chief DevOps engineer, outlining core duties such as infrastructure design, automation, and cultural leadership, the essential technical and soft‑skill requirements, and the advantages this position brings to an organization’s efficiency, reliability, and collaboration.

AutomationChief EngineerDevOps
0 likes · 6 min read
What a Chief DevOps Engineer Does: Responsibilities, Required Skills, and Business Benefits
Model Perspective
Model Perspective
Aug 26, 2024 · Fundamentals

How Coupling and Coordination Models Reveal Gaps in Rural Infrastructure Development

Using coupling and coordination degree models, this article explains why new rural infrastructure alone often fails to improve living standards, illustrates how to quantify mismatches between infrastructure and public services, and offers policy recommendations for balanced, harmonious development.

coordination modelcoupling modelinfrastructure
0 likes · 5 min read
How Coupling and Coordination Models Reveal Gaps in Rural Infrastructure Development
IT Services Circle
IT Services Circle
Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

NetEase Cloud Musicdisaster recoverygray release
0 likes · 6 min read
Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons
DevOps Operations Practice
DevOps Operations Practice
Jul 30, 2024 · Cloud Computing

Kubernetes vs OpenStack: A Comprehensive Comparison of Features, Use Cases, and Technical Architecture

This article provides an in‑depth comparison of Kubernetes and OpenStack, covering their core features, typical use cases, architectural differences, installation complexity, ecosystem support, and guidance on selecting the right platform for specific cloud computing needs.

Cloud ComputingContainer OrchestrationKubernetes
0 likes · 7 min read
Kubernetes vs OpenStack: A Comprehensive Comparison of Features, Use Cases, and Technical Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Jul 28, 2024 · Artificial Intelligence

Design and Optimization Practices for Intelligent Computing Platforms in the Era of Large Models

The article examines the new characteristics, challenges, and technical practices of intelligent computing platforms required for large‑model AI workloads, covering infrastructure adaptation, heterogeneous scheduling, application acceleration, operation reliability, and future directions for simplifying GPU usage and connecting heterogeneous resources.

AI PlatformLarge Modelsinfrastructure
0 likes · 6 min read
Design and Optimization Practices for Intelligent Computing Platforms in the Era of Large Models