Tagged articles

infrastructure

371 articles · Page 1 of 4

Jul 2, 2026 · Industry Insights

Design Insight: Why the 130‑meter Electric Ferry Redefines More Than Its Engine

China Zorrilla, a 130‑meter electric Ro‑Pax ferry built by Incat, demonstrates that true innovation lies in redesigning the entire route system—integrating hull, battery, shore power, and passenger experience—rather than merely swapping a diesel engine for electric propulsion.

battery integrationelectric ferryinfrastructure

0 likes · 11 min read

Design Insight: Why the 130‑meter Electric Ferry Redefines More Than Its Engine

Machine Heart

Jun 29, 2026 · Artificial Intelligence

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

The article introduces AISHPerf, the first open‑source benchmark for AI‑infra operations agents built on nearly a hundred‑billion real‑world ops records, detailing its data pipeline, multi‑layer coverage, evaluation metrics, experimental results that show current models lag behind human experts, and future plans to expand and refine the benchmark.

AI OpsEvaluation MetricsFault Injection

0 likes · 16 min read

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

SuanNi

Jun 16, 2026 · Industry Insights

Anthropic’s Talent Profile: Data Shows They Prefer Infrastructure Veterans Over Scientists

An analysis of 1,680 Anthropic engineer resumes reveals the company prioritizes senior infrastructure builders—mostly with 12+ years experience from Google, Meta, and other FAANG firms—over PhDs or pure research scientists, highlighting a rapid team expansion and distinct hiring patterns.

AI EngineeringAnthropicFAANG

0 likes · 11 min read

AI Engineering

Jun 15, 2026 · Industry Insights

What Types of Engineers Anthropic Hires: Insights from 1,680 Resumes

An analysis of 1,680 Anthropic engineers shows the company rapidly built a large infra‑focused team, hiring mostly senior staff with a median of 12.2 years experience, sourcing talent primarily from Google and other FAANG firms, while junior hires are rare and highly selective.

AI industryAnthropicFAANG talent

0 likes · 9 min read

What Types of Engineers Anthropic Hires: Insights from 1,680 Resumes

Machine Heart

Jun 15, 2026 · Industry Insights

Anthropic’s Hiring Secret: 1680 Engineer Resumes Reveal a Preference for Senior Infrastructure Veterans

An analysis of 1,680 public LinkedIn resumes shows that Anthropic has grown threefold in 18 months by hiring mostly senior engineers with extensive infrastructure experience, sourcing talent primarily from Google and other large‑scale tech firms, while largely ignoring fresh graduates and PhDs.

AI hiringAnthropicLinkedIn analysis

0 likes · 10 min read

Anthropic’s Hiring Secret: 1680 Engineer Resumes Reveal a Preference for Senior Infrastructure Veterans

Tech Freedom Circle

Jun 9, 2026 · Artificial Intelligence

Deep Dive into Harness’s Sandbox Infra: How Deep Agents Enable Secure AI Execution

This article provides a detailed technical analysis of Harness’s Sandbox infrastructure, explaining how Deep Agents’ sandbox backend isolates file operations and command execution, the role of the single execute() entry point, security guarantees, lifecycle management, and practical integration steps for Docker, Kubernetes, or commercial sandbox providers.

AIDeep AgentsHarness

0 likes · 39 min read

Deep Dive into Harness’s Sandbox Infra: How Deep Agents Enable Secure AI Execution

Software Engineering 3.0 Era

May 24, 2026 · Artificial Intelligence

The 6 Essential Components of an Effective AI Harness System

The article breaks down AI Harness Engineering into six indispensable parts—prompt system, tools & skills, infrastructure, orchestration logic, hooks & middleware, and model configuration—explaining their roles, concrete examples, common pitfalls, and how they together turn a powerful base model into a reliable, scalable workplace assistant.

AI HarnessModel ConfigurationOrchestration

0 likes · 11 min read

The 6 Essential Components of an Effective AI Harness System

DataFunTalk

May 13, 2026 · Industry Insights

Why Palantir’s Value Is Rising: AI Commoditization, Ontology, and 85% Q1 Revenue Growth

As large‑model capabilities become commoditized, Palantir argues that the true moat lies in its ontology‑driven infrastructure, which integrates business semantics to ensure reliable AI in high‑risk contexts, a strategy reflected in its 85% Q1 revenue jump and a three‑layer AI competition model.

AI commoditizationAI competitionEnterprise AI

0 likes · 11 min read

Why Palantir’s Value Is Rising: AI Commoditization, Ontology, and 85% Q1 Revenue Growth

SuanNi

May 9, 2026 · Industry Insights

Is AI a Bubble? Goldman Sachs Projects $7.6 Trillion AI Infrastructure Over the Next Five Years

Goldman Sachs’ analysis models AI capital expenditure, showing a $7.6 trillion cumulative investment from 2026‑2031 and highlighting four key variables—chip lifespan, data‑center costs, architecture mix, and construction delays—that determine whether AI infrastructure spending will expand or contract.

AICapExChip Lifecycle

0 likes · 13 min read

Is AI a Bubble? Goldman Sachs Projects $7.6 Trillion AI Infrastructure Over the Next Five Years

Digital Planet

May 2, 2026 · Industry Insights

Can AI Actually Lower Enterprise Digitalization Costs?

While many executives believe AI will slash the expenses of digital transformation, the article reveals hidden infrastructure, integration, talent, and ongoing operational costs that often turn AI into a cost‑shifting tool rather than a true cost‑saving solution, especially for core system projects.

AIEnterpriseOperations

0 likes · 9 min read

Can AI Actually Lower Enterprise Digitalization Costs?

Baidu App Technology

Apr 27, 2026 · Artificial Intelligence

Boosting End-to-End Efficiency with AI: From Single-Point Gains to Full Process Integration

The YoyoMan Drama team details how they transformed their product development pipeline by building Prompt‑friendly PRDs, design‑as‑code, AI coding infrastructure, and AI agents, creating a seamless “requirement‑design‑development‑test” loop that shifts work from manual effort to AI‑augmented processes across the entire workflow.

AIAI codingDesign Automation

0 likes · 27 min read

Boosting End-to-End Efficiency with AI: From Single-Point Gains to Full Process Integration

Top Architect

Apr 24, 2026 · Industry Insights

How Much Bandwidth Does Douyin Need to Support Hundreds of Millions of Simultaneous Users?

Douyin, along with other Chinese internet giants, operates data centers with T‑level (≈1 TB/s) outbound bandwidth, hosts over 200 000 servers, and relies on dual‑link designs, CDN acceleration, and multi‑node load balancing to deliver seamless video streams to hundreds of millions of concurrent users.

ByteDanceCDNData Center

0 likes · 9 min read

How Much Bandwidth Does Douyin Need to Support Hundreds of Millions of Simultaneous Users?

Ray's Galactic Tech

Apr 19, 2026 · Cloud Native

Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

This article presents a step‑by‑step guide to designing and implementing a production‑grade Kubernetes platform with GitOps, observability, capacity governance, fault‑injection, and SRE practices, showing how to achieve unified delivery, reliability, and low‑cost operation for high‑concurrency business services.

Cloud NativeGitOpsObservability

0 likes · 37 min read

Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

ITPUB

Apr 17, 2026 · Industry Insights

Why LinkedIn Dumped Kafka for Its Own ‘Northguard’ Streaming Engine

LinkedIn, the original home of Apache Kafka, abandoned the platform for a home‑grown system called Northguard, redesigning log storage, decentralizing metadata, and adding a virtualized Xinfra layer to handle trillions of daily events, while still acknowledging Kafka’s relevance for most companies.

LinkedInNorthguardStreaming

0 likes · 7 min read

Why LinkedIn Dumped Kafka for Its Own ‘Northguard’ Streaming Engine

AI Code to Success

Apr 13, 2026 · Industry Insights

Why Anthropic’s Managed Agents Redefine AI Agent Runtime

Anthropic’s Managed Agents transform the cumbersome agent runtime into a modular, production‑ready infrastructure by decoupling the brain, hands, and session layers, improving reliability, security, and performance while offering developers a clear path to build long‑running AI workflows.

AI AgentsAgent RuntimeAnthropic

0 likes · 10 min read

Why Anthropic’s Managed Agents Redefine AI Agent Runtime

IT Services Circle

Apr 7, 2026 · Industry Insights

How a Single 8 GB Server Powered 500 K Users for 15 Years – The Webminal Story

Webminal, a free online Linux learning platform, has survived for fifteen years on a single 8 GB CentOS server, serving over half a million users by using a minimalist stack—including Python 2.7, Flask, Shellinabox, User Mode Linux and eBPF—while deliberately avoiding modern container orchestration and commercial monetisation.

Case StudyOnline LinuxShellinabox

0 likes · 10 min read

How a Single 8 GB Server Powered 500 K Users for 15 Years – The Webminal Story

PaperAgent

Mar 29, 2026 · Industry Insights

From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development

The article examines the shift from traditional reasoning‑based large‑language‑model pipelines to agentic, harness‑driven AI systems, outlining the definition of a harness, its engineering challenges, architectural components, and the broader implications for training, reinforcement learning, and future research directions.

AI HarnessIntelligent agentsModel Training

0 likes · 16 min read

From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development

Model Perspective

Mar 25, 2026 · Industry Insights

Can a 126 km Bridge Make Beijing‑Taipei Driveable? Graph & Game Theory Insights

The article analyzes the proposed G3 Beijing‑Taipei highway, modeling the missing 126‑km Taiwan Strait crossing as a bridge in a graph, estimating network benefits with a gravity model, and applying game‑theoretic signaling to assess the political feasibility of such a connection.

Game Theorygraph theoryinfrastructure

0 likes · 7 min read

Can a 126 km Bridge Make Beijing‑Taipei Driveable? Graph & Game Theory Insights

AI Waka

Mar 25, 2026 · Information Security

How NemoClaw Secures Autonomous AI Agents with Kernel‑Level Sandboxing

This article examines NemoClaw’s three‑layer architecture that adds kernel‑level sandboxing, policy‑driven deployment, and flexible inference routing to OpenClaw, outlines installation steps, compares it with the native OpenClaw runtime, and discusses current limitations for production use.

AI Agent SecurityNemoClawOpenShell

0 likes · 9 min read

How NemoClaw Secures Autonomous AI Agents with Kernel‑Level Sandboxing

dbaplus Community

Mar 23, 2026 · Operations

How a Single AI‑Driven Command Wiped 2.5 Years of Production Data

In this detailed post‑mortem, Alexey Grigorev recounts how using Claude Code to automate a Terraform deployment unintentionally erased his entire production environment and two‑and‑a‑half years of data, exposing the risks of over‑reliance on AI‑driven automation and highlighting essential safeguards.

AIAWSAutomation

0 likes · 11 min read

How a Single AI‑Driven Command Wiped 2.5 Years of Production Data

ByteDance Data Platform

Mar 13, 2026 · Artificial Intelligence

Beyond Parameters: How ClawLake Turns Agent Memory into Enterprise‑Level AI Infrastructure

The article explains why an AI agent's capabilities are limited by memory depth rather than model size, reviews three historical memory architectures, highlights their structural shortcomings, and details how the ClawLake solution provides a multi‑layer, multimodal, enterprise‑grade memory infrastructure for OpenClaw agents.

AIAgentEnterprise

0 likes · 17 min read

Beyond Parameters: How ClawLake Turns Agent Memory into Enterprise‑Level AI Infrastructure

Efficient Ops

Mar 11, 2026 · Operations

How an AI‑Powered Terraform Command Erased 2 Million Records – Lessons for Safe Ops

A single Terraform command executed by the AI assistant Claude Code mistakenly destroyed a production database of over two million records, exposing how over‑reliance on AI, missing state files, weak backup practices, and absent deletion protection can cause massive outages and what safeguards can prevent such incidents.

AI OpsAWSIncident Management

0 likes · 6 min read

How an AI‑Powered Terraform Command Erased 2 Million Records – Lessons for Safe Ops

dbaplus Community

Mar 1, 2026 · Operations

50 High‑Impact IT Operations Projects to Supercharge Your Resume

This guide presents 50 detailed IT operations projects—covering infrastructure, cloud native, automation, monitoring, security, databases, networking, disaster recovery, and DevOps—each with background, tech stack, implementation steps, and quantifiable results to help engineers craft compelling, results‑driven resume entries.

AutomationCloudinfrastructure

0 likes · 25 min read

50 High‑Impact IT Operations Projects to Supercharge Your Resume

Architects' Tech Alliance

Jan 23, 2026 · Artificial Intelligence

What Are the Top 10 Global Computing Power Trends Shaping AI by 2026?

The Global Computing Alliance’s 2026 report outlines ten transformative trends—from explosive AI compute growth and the rise of supernodes to embodied intelligence, heterogeneous architectures, network‑centric designs, and the imminent commercialization of quantum computing—showing how compute power is becoming the strategic engine of the digital economy.

AIcomputing powerdigital economy

0 likes · 12 min read

What Are the Top 10 Global Computing Power Trends Shaping AI by 2026?

DevOps Coach

Jan 20, 2026 · Cloud Native

How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide

This article walks you through the complete journey from a single Kubernetes cluster to a production‑grade, multi‑cluster platform, covering managed services, capacity planning, GitOps pipelines, networking, observability, cost optimisation, upgrade strategies, and the people and processes needed for sustainable large‑scale operations.

Cloud NativeObservabilitycost management

0 likes · 27 min read

How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide

Alibaba Cloud Infrastructure

Jan 4, 2026 · Cloud Native

How OpenKruise Agents Enable Scalable AI Agent Sandboxes on Kubernetes

The article explains how OpenKruise Agents, an open‑source project from Alibaba Cloud, provides a cloud‑native sandbox infrastructure for AI agents on Kubernetes, detailing its architecture, lifecycle management, security challenges, resource pooling, and future roadmap for AI‑driven workloads.

AI AgentCloud NativeOpenKruise

0 likes · 17 min read

How OpenKruise Agents Enable Scalable AI Agent Sandboxes on Kubernetes

Raymond Ops

Dec 28, 2025 · Operations

From Zero to Production: Ansible Playbook Design Patterns & Best Practices

This guide walks you through building a production‑grade Ansible automation framework—from identifying common manual‑deployment pain points to defining layered architecture, directory conventions, reusable playbook patterns, high‑availability deployments, performance optimizations, monitoring, security hardening, CI/CD integration, and troubleshooting tips—empowering teams to achieve reliable, scalable operations.

AnsibleAutomationCI/CD

0 likes · 14 min read

From Zero to Production: Ansible Playbook Design Patterns & Best Practices

DevOps Coach

Dec 24, 2025 · Operations

Will AI Replace Your DevOps Skills? Future‑Proof Your Career Today

The article explains how AI is rapidly automating traditional DevOps tasks—troubleshooting, configuration management, and toolchain mastery—forcing engineers to shift from manual expertise to outcome‑oriented orchestration, and outlines three pillars for building an AI‑native DevOps career.

AIAutomationCareer

0 likes · 8 min read

Will AI Replace Your DevOps Skills? Future‑Proof Your Career Today

Raymond Ops

Dec 21, 2025 · Operations

Mastering Ansible: Deep Dive into Architecture, Modules, and Enterprise Automation

This comprehensive guide explains Ansible's agentless architecture, core components, module taxonomy, custom module development, performance tuning, large‑scale design patterns, real‑world LAMP deployment, monitoring integration, and future cloud‑native and AI‑driven trends, providing actionable steps for DevOps engineers.

Ansibleconfiguration managementinfrastructure

0 likes · 15 min read

Mastering Ansible: Deep Dive into Architecture, Modules, and Enterprise Automation

DevOps Engineer

Dec 10, 2025 · Operations

DevOps Tools as a Car Factory: Packer, Terraform, Ansible, Docker, Kubernetes

The article uses a car‑factory analogy to clarify the distinct roles of DevOps tools—Packer for image building, Terraform for infrastructure provisioning, Ansible for configuration, Docker for containerized applications, and Kubernetes for large‑scale orchestration—showing how they fit into build, provision, and run phases of the IT lifecycle.

AnsibleDockerPacker

0 likes · 8 min read

DevOps Tools as a Car Factory: Packer, Terraform, Ansible, Docker, Kubernetes

Past Memory Big Data

Dec 9, 2025 · Artificial Intelligence

A Decade of Evolution: Inside Pinterest’s AI Platform Journey

Over ten years Pinterest transformed a fragmented machine‑learning stack into a unified AI platform, iterating through stages from early ad‑hoc pipelines to scalable GPU‑accelerated services, while learning that timing, organization alignment, and efficiency are crucial for lasting impact.

AI platformGPU inferenceML Ops

0 likes · 25 min read

A Decade of Evolution: Inside Pinterest’s AI Platform Journey

21CTO

Dec 3, 2025 · Operations

What My Biggest Developer Mistakes Taught Me About Operations and Resilience

A software engineer recounts three major mistakes—from accidentally deleting thousands of F5 URLs to leaking code externally and being laid off during COVID—highlighting how operational oversights, poor process controls, and personal resilience shape professional growth and underscore the value of empathy and systematic safeguards.

Resiliencefailureinfrastructure

0 likes · 14 min read

What My Biggest Developer Mistakes Taught Me About Operations and Resilience

Amazon Cloud Developers

Dec 3, 2025 · Cloud Computing

The Road to Billions of AI Agents: Key Takeaways from Matt Garman’s re:Invent 2025 Keynote

At AWS re:Invent 2025, CEO Matt Garman outlined four essential pillars for building AI agents, unveiled three frontier agents, introduced the Amazon Nova 2 model series and 25 major cloud service innovations, and argued that billions of agents will soon deliver ten‑fold efficiency gains across enterprises.

AI AgentsAWSCloud Computing

0 likes · 20 min read

The Road to Billions of AI Agents: Key Takeaways from Matt Garman’s re:Invent 2025 Keynote

DevOps Coach

Nov 27, 2025 · Cloud Native

When Kubernetes Is Overkill: A Practical Guide for Small Teams

This article examines why Kubernetes often adds unnecessary complexity for tiny startups, outlines the hidden costs of its operational overhead, and offers concrete alternatives and step‑by‑step advice for when to adopt or avoid container orchestration.

Cloud NativePlatform Engineeringdevops

0 likes · 12 min read

When Kubernetes Is Overkill: A Practical Guide for Small Teams

Alimama Tech

Nov 26, 2025 · Artificial Intelligence

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

Alibaba’s open‑source ROCK environment sandbox and the ROLL reinforcement‑learning engine together provide a standardized, high‑throughput training loop that lets developers scale Agentic AI from a single machine to thousands of parallel instances while simplifying debugging and resource management.

Agentic AIScalable Traininginfrastructure

0 likes · 12 min read

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

MaGe Linux Operations

Nov 17, 2025 · Operations

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

This guide details production‑grade Prometheus alerting configurations, covering applicable scenarios, prerequisites, anti‑patterns, environment matrices, step‑by‑step deployment of Node Exporter, Prometheus and Alertmanager, comprehensive rule files, performance testing, troubleshooting, best practices, and ready‑to‑use scripts for backup and health checks.

AlertingOpsinfrastructure

0 likes · 51 min read

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

dbaplus Community

Nov 15, 2025 · Operations

What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience

A chronicle of real‑world operations incidents from 2003 to 2024 shows how simple mistakes—mis‑configured passwords, unplugged cables, hardware mix‑ups, and rushed cloud migrations—can cascade into massive outages, offering hard‑earned lessons for anyone managing production systems.

Case StudyOperationsincident

0 likes · 11 min read

What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience

IT Services Circle

Nov 7, 2025 · Artificial Intelligence

Why Microsoft’s GPU Fleet Is Sitting Idle – The Power Crisis Behind AI’s Growth

Microsoft’s CEO Satya Nadella admits the tech giant’s massive stock of Nvidia GPUs are idle due to insufficient electricity and lack of ready‑to‑use data‑center facilities, highlighting a broader industry shift where AI’s soaring compute demand is now constrained by power and infrastructure limits.

AICloud ComputingData Centers

0 likes · 8 min read

Why Microsoft’s GPU Fleet Is Sitting Idle – The Power Crisis Behind AI’s Growth

Alibaba Cloud Infrastructure

Oct 29, 2025 · Cloud Native

How Container Services Are Powering the AI Agent Revolution

The article reviews Alibaba Cloud's container service advancements, highlights AI-driven trends such as intelligent agents reshaping applications, the migration of AI infrastructure to cloud‑native platforms, and showcases four customer case studies demonstrating massive efficiency gains and the emergence of containers as the operating system for the AI era.

AIAI AgentsCloud Native

0 likes · 6 min read

How Container Services Are Powering the AI Agent Revolution

ITPUB

Oct 28, 2025 · Operations

50 Powerful IT Ops Projects to Supercharge Your Resume

This article compiles 50 detailed IT operations projects across infrastructure, cloud, containers, automation, monitoring, security, databases, networking, disaster recovery and DevOps, each with scenario, tech stack, implementation steps and quantifiable results to help you craft standout résumé entries.

AutomationCloudIT Operations

0 likes · 30 min read

50 Powerful IT Ops Projects to Supercharge Your Resume

Amazon Cloud Developers

Oct 13, 2025 · Artificial Intelligence

Agentic AI Guide: Building and Deploying Robust AI Agents

This article provides a comprehensive technical guide on Agentic AI, detailing the core modules, infrastructure requirements, security considerations, observability practices, and deployment strategies needed to develop and operate production‑ready AI agents.

AI AgentsAgentOpsAgentic AI

0 likes · 27 min read

Agentic AI Guide: Building and Deploying Robust AI Agents

MaGe Linux Operations

Oct 8, 2025 · Operations

Build an Enterprise‑Grade DevOps Pipeline in 7 Days: Hands‑On Guide + Ready‑to‑Use Scripts

This step‑by‑step guide shows how to create a full‑stack, enterprise‑level DevOps CI/CD pipeline—from environment setup and Docker installation to Jenkins pipeline scripts, Kubernetes deployments, monitoring, security hardening, and cost‑optimisation—enabling teams to reduce release cycles from days to minutes within a week.

AutomationCI/CDDocker

0 likes · 38 min read

Build an Enterprise‑Grade DevOps Pipeline in 7 Days: Hands‑On Guide + Ready‑to‑Use Scripts

dbaplus Community

Oct 7, 2025 · Operations

Why Ops Engineers Need a Massive Knowledge Stack—and How to Master It

This comprehensive guide explains why modern operations engineers must cover the full technology stack, outlines common learning pitfalls, presents a three‑layer, nine‑domain knowledge framework, and offers a step‑by‑step, personalized roadmap with practical labs and career‑growth advice.

AutomationOperationscareer development

0 likes · 14 min read

Why Ops Engineers Need a Massive Knowledge Stack—and How to Master It

21CTO

Oct 5, 2025 · Artificial Intelligence

Anthropic Appoints Former Stripe Exec Rahul Patil as CTO Amid AI Infrastructure Race

Anthropic has named former Stripe senior executive Rahul Patil as its new CTO, reshaping its engineering structure to tighten product, infrastructure, and inference teams while facing intense AI infrastructure competition from OpenAI and Meta, and imposing new usage limits on its Claude services.

AIAnthropicCTO

0 likes · 4 min read

Anthropic Appoints Former Stripe Exec Rahul Patil as CTO Amid AI Infrastructure Race

MaGe Linux Operations

Oct 2, 2025 · Operations

How Ansible Can Deploy 100 Servers in 10 Minutes: A Hands‑On Guide

This article explains why Ansible is the preferred automation tool, outlines its core advantages and architecture, and provides a step‑by‑step, code‑rich tutorial—from installing the control node and configuring SSH keys to writing inventories, ad‑hoc commands, Playbooks, Roles, and a real‑world 100‑server deployment case—showing how to achieve massive scaling with minimal effort.

Ansibleconfiguration managementinfrastructure

0 likes · 29 min read

How Ansible Can Deploy 100 Servers in 10 Minutes: A Hands‑On Guide

DevOps Coach

Oct 1, 2025 · Operations

10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know

Drawing from real incidents like SQLite crashes, missing logs, unthrottled APIs, slow container startups, queue bottlenecks, network partitions, unreliable clocks, and weak alerts, this article shares ten concrete infrastructure lessons with code examples, performance data, and practical recommendations to avoid costly pitfalls.

GoOperationsdevops

0 likes · 8 min read

10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know

Alibaba Cloud Infrastructure

Oct 1, 2025 · Artificial Intelligence

How Alibaba Cloud’s AI Infra Predictable Network Is Redefining Large-Model Training

The 2025 Alibaba Cloud Yi Conference in Hangzhou showcased the AI Infra Predictable Network forum, unveiling next‑generation high‑performance networking architectures and ecosystem collaborations that are reshaping AI infrastructure for large‑model training and inference.

AICloudNetwork

0 likes · 6 min read

How Alibaba Cloud’s AI Infra Predictable Network Is Redefining Large-Model Training

DataFunTalk

Sep 23, 2025 · Artificial Intelligence

Nvidia and OpenAI Launch the World’s Largest AI Compute Project

Nvidia and OpenAI have forged a strategic partnership to deploy at least 10 GW of GPU power—equivalent to millions of GPUs—with up to $100 billion in investment, marking the biggest AI infrastructure effort ever and promising transformative impacts across industries.

AIGPU computeNVIDIA

0 likes · 5 min read

Nvidia and OpenAI Launch the World’s Largest AI Compute Project

ITFLY8 Architecture Home

Sep 19, 2025 · Fundamentals

What Is Technical Architecture? Core Components and Design Principles Explained

This article explains the definition of technical architecture, outlines its core components such as technical services, components, infrastructure and selection, and illustrates how it supports business, application, and data architectures through layered system diagrams and practical examples.

Enterprise ArchitectureMiddlewareSystem Design

0 likes · 3 min read

What Is Technical Architecture? Core Components and Design Principles Explained

Ops Development & AI Practice

Sep 2, 2025 · Information Security

How a Tiny XSS Bug in Dev Environments Can Compromise Production Secrets

The article reveals how a seemingly harmless XSS flaw in an internal development platform can be weaponized to steal high‑privilege credentials, pivot across internal services, and ultimately breach production systems, urging teams to treat development environments as critical security frontiers.

Application SecurityCredential TheftDevOps Security

0 likes · 9 min read

How a Tiny XSS Bug in Dev Environments Can Compromise Production Secrets

Wuming AI

Aug 26, 2025 · Artificial Intelligence

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

This article presents a hierarchical breakdown of Agentic AI, detailing the foundational large language models, the capabilities of AI agents, the coordination mechanisms of multi‑agent systems, and the supporting infrastructure needed for reliability, scalability, and security.

AI AgentsAgentic AILLM

0 likes · 5 min read

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

AI Info Trend

Aug 25, 2025 · Industry Insights

Can China’s Power Edge Overtake the AI Race? Insights into xAI’s Strategy

Elon Musk claims xAI will soon outpace most rivals, yet the real threat comes from Chinese firms whose superior power supply and hardware manufacturing give them a decisive advantage in the global AI competition, reshaping the industry's infrastructure dynamics.

AI competitionChinaEnergy

0 likes · 7 min read

Can China’s Power Edge Overtake the AI Race? Insights into xAI’s Strategy

Alibaba Cloud Infrastructure

Aug 8, 2025 · Cloud Native

How Cloud‑Native Architecture Powered Lingxi Interactive’s Global Game Expansion

Lingxi Interactive transformed its overseas game publishing by adopting a cloud‑native, ACK‑based infrastructure with a three‑layer design, automated scaling, integrated storage, full‑stack observability, and FinOps practices, achieving higher stability, efficiency, and over 40% cost reduction.

FinOpsGame Developmentcloud-native

0 likes · 11 min read

How Cloud‑Native Architecture Powered Lingxi Interactive’s Global Game Expansion

MaGe Linux Operations

Jul 29, 2025 · Operations

How to Build a Production‑Ready Ansible Automation System from Scratch

This comprehensive guide walks you through the pain points of traditional operations and presents a layered, role‑driven Ansible architecture with design patterns, high‑availability deployment examples, performance tweaks, monitoring, security best practices, CI/CD integration, and debugging techniques for building a production‑grade automation framework.

Ansibleinfrastructureplaybook

0 likes · 12 min read

How to Build a Production‑Ready Ansible Automation System from Scratch

Baobao Algorithm Notes

Jul 28, 2025 · Industry Insights

Why AWS Bedrock AgentCore Signals a New Era for Agentic AI Infrastructure

The article analyzes AWS Bedrock AgentCore and related hardware and software requirements for Agentic AI, covering runtime isolation with microVMs, memory architectures, identity and gateway design, zero‑trust networking, and the challenges of multi‑tenant KVCache and context engineering.

AWS BedrockAgentic AIMemory Management

0 likes · 15 min read

Why AWS Bedrock AgentCore Signals a New Era for Agentic AI Infrastructure

Open Source Linux

Jul 23, 2025 · Operations

Master Ansible Playbooks: From Basics to Large‑Scale Cluster Automation

This comprehensive guide walks you through Ansible fundamentals, core components, advanced playbook design, variable management, role architecture, error handling, large‑scale deployment strategies, performance tuning, security hardening, CI/CD integration, and monitoring, empowering you to automate modern infrastructure efficiently.

Ansibleconfiguration managementdevops

0 likes · 14 min read

Master Ansible Playbooks: From Basics to Large‑Scale Cluster Automation

Efficient Ops

Jul 21, 2025 · Operations

30 Must‑Have DevOps Skills to Boost Your Resume in 2025

This article outlines 30 essential DevOps competencies—from foundational infrastructure and cloud/container orchestration to automation, monitoring, security, and AI‑driven operations—detailing key technologies, real‑world scenarios, and measurable impact, helping professionals craft a standout resume in the evolving operations landscape.

AI OpsAutomationCloud

0 likes · 8 min read

30 Must‑Have DevOps Skills to Boost Your Resume in 2025

Ops Development & AI Practice

Jul 21, 2025 · Industry Insights

Why Building a DEX Is Far More Than Writing Smart Contracts

Running a decentralized exchange requires extensive pre‑launch development, rigorous security audits, robust front‑end design, tokenomics planning, and continuous post‑launch operations such as infrastructure maintenance, security monitoring, liquidity management, and community governance.

DEXDeFiSecurity Audits

0 likes · 7 min read

Why Building a DEX Is Far More Than Writing Smart Contracts

Ops Development & AI Practice

Jul 19, 2025 · Industry Insights

Step‑by‑Step Guide to Deploying Your Own Solana Testnet Validator

This tutorial walks you through hardware requirements, environment setup, key generation, and command‑line configuration needed to launch and verify a Solana testnet validator node, providing practical tips for cloud or bare‑metal servers.

CLINode SetupSolana

0 likes · 9 min read

Step‑by‑Step Guide to Deploying Your Own Solana Testnet Validator

IT Architects Alliance

Jul 11, 2025 · Fundamentals

How Do China and the U.S. Stack Up in Tech Infrastructure, Cloud, and AI?

This article compares China and the United States across infrastructure, cloud computing, artificial intelligence, key technologies, innovation ecosystems, and standards, highlighting each nation's strengths, strategic approaches, and the evolving balance of competition and cooperation in global technology development.

ChinaCloud ComputingUnited States

0 likes · 9 min read

How Do China and the U.S. Stack Up in Tech Infrastructure, Cloud, and AI?

Ops Development & AI Practice

Jul 7, 2025 · Cloud Computing

Why Infrastructure Architecture Is the Hidden Backbone of Modern Cloud Systems

Infrastructure architecture, the often‑overlooked foundation of IT, defines how compute, storage, networking, and security are designed, integrated, and automated—linking software, ops, and cloud strategies—through processes like requirement analysis, technology selection, IaC implementation, and continuous optimization for reliability, performance, cost, and operational excellence.

IaCOperationscloud architecture

0 likes · 8 min read

Why Infrastructure Architecture Is the Hidden Backbone of Modern Cloud Systems

Architects' Tech Alliance

Jul 6, 2025 · Fundamentals

Mastering Data Center Essentials: 100 Core Concepts You Must Know

This comprehensive guide walks you through 100 essential data‑center concepts—from basic definitions, tier standards, and modular design to networking layers, storage architectures, compute resources, security measures, operational practices, energy efficiency, emerging technologies, and industry ecosystem—providing a complete knowledge framework for modern digital infrastructure.

ComputeData Centerinfrastructure

0 likes · 21 min read

Mastering Data Center Essentials: 100 Core Concepts You Must Know

Alipay Experience Technology

Jul 3, 2025 · Artificial Intelligence

How MCP Transforms Agent Development: From Complex Tools to Plug‑and‑Play

This talk explains the Model Context Protocol (MCP), how it simplifies agent tool integration by replacing numerous custom interfaces with a single standardized protocol, and details its adoption, architecture, security, and future directions within Ant Group's ecosystem.

AIAgentLLM

0 likes · 21 min read

How MCP Transforms Agent Development: From Complex Tools to Plug‑and‑Play

Efficient Ops

Jun 24, 2025 · Operations

Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management

This article compiles a comprehensive matrix of 58 mainstream operations tools—covering operating systems, open‑source mirrors, containers, AI‑assisted ops, basic services, databases, monitoring, automation, CI/CD and service mesh—to help engineers quickly locate the right technology stack for efficient infrastructure management.

CloudOperationsdevops

0 likes · 6 min read

Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management

ITPUB

Jun 17, 2025 · Artificial Intelligence

Why Private Cloud Is the Best Choice for Enterprise AI Deployment

The article examines why private‑cloud infrastructure, rather than public‑cloud services, offers enterprises better cost control, data sovereignty, customization, and security for building AI‑ready platforms, and outlines five core capabilities needed to achieve this.

AIEnterpriseOrchestration

0 likes · 11 min read

Why Private Cloud Is the Best Choice for Enterprise AI Deployment

Smart Era Software Development

Jun 5, 2025 · Artificial Intelligence

How AI Is Taking Over Your IDE: A 5‑Stage Roadmap for Agent‑Native Infrastructure

The article argues that AI’s ultimate goal is not just to write code faster than humans but to control the entire software lifecycle, and it proposes a five‑stage L0‑L5 maturity model for AI‑native infrastructure that moves from simple code generation to a full Agent‑Native operating system.

AIAgentResult-as-a-Service

0 likes · 32 min read

How AI Is Taking Over Your IDE: A 5‑Stage Roadmap for Agent‑Native Infrastructure

Efficient Ops

Jun 3, 2025 · Operations

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

This article shares Anthropic’s SRE engineer insights on 23 actionable practices—from schema migration and Karpenter node management to OpenTelemetry adoption, Helm chart storage, and Terraform versus CloudFormation—offering concrete recommendations for building reliable, cost‑effective AI and cloud‑native platforms.

Cloud NativeSREdevops

0 likes · 12 min read

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

Raymond Ops

May 27, 2025 · Fundamentals

Understanding Block, File, and Object Storage: Pros, Cons, and Use Cases

This article explains the concepts, advantages, and disadvantages of block storage, file storage, and object storage, compares their architectures, and clarifies when each type is appropriate for different applications and workloads.

block storagefile storageinfrastructure

0 likes · 10 min read

Understanding Block, File, and Object Storage: Pros, Cons, and Use Cases

Bilibili Tech

May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

AutomationMonitoringOperations

0 likes · 17 min read

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

Efficient Ops

May 21, 2025 · Operations

Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness

Six months after abandoning Kubernetes, our DevOps team reduced infrastructure spend by 62%, cut deployment time by 89%, eliminated weekend on‑call duties, and improved overall happiness, demonstrating that simplifying the tech stack can deliver substantial operational and business benefits.

Operationscost reductiondevops

0 likes · 9 min read

Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness

Architects' Tech Alliance

May 18, 2025 · Industry Insights

What Drives the Rise of AI Data Centers? A Deep Dive into Architecture, Market and Impact

This article analyzes the concept, core functions, industry chain, infrastructure components, and rapid market growth of AI Data Centers, highlighting their distinction from traditional data centers, regional concentration in eastern China, and projected investment and compute capacity through 2028.

AICloud ComputingData Center

0 likes · 10 min read

What Drives the Rise of AI Data Centers? A Deep Dive into Architecture, Market and Impact

Alibaba Cloud Infrastructure

Apr 30, 2025 · Artificial Intelligence

Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era

This article summarizes Zuoyebang's infrastructure leader Dong Xiaocong's presentation on the challenges of AI inference demand and supply, and describes the design and implementation of a unified compute network—including trusted networking, multi‑region container scheduling, and traffic routing—to efficiently serve large‑scale AI models.

AIModel Distributioncompute network

0 likes · 9 min read

Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era

MaGe Linux Operations

Apr 17, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces the ten most frequently used operations engineering tools, detailing each tool's functions, suitable scenarios, advantages, and real‑world examples, and includes practical code snippets to help engineers automate and streamline their daily workflows.

AutomationLinux toolsOperations

0 likes · 8 min read

Top 10 Essential Ops Tools Every Engineer Should Master

Efficient Ops

Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

AutomationMonitoringOperations

0 likes · 9 min read

dbaplus Community

Apr 14, 2025 · Operations

20 Critical Server Ops Mistakes to Avoid: Real Cases & Fixes

Drawing from over 500 enterprise server failure incidents, this guide outlines twenty absolutely prohibited server actions across security configuration, system operation, data management, and architecture design, each paired with a real-world case, risk rating, and concrete remediation steps.

backupdevopsinfrastructure

0 likes · 13 min read

20 Critical Server Ops Mistakes to Avoid: Real Cases & Fixes

ITPUB

Apr 13, 2025 · Operations

How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage

Cursor, the AI‑powered code editor, grew to handle billions of document queries and over a hundred‑million model calls daily, prompting a multi‑stage infrastructure overhaul that moved from a failing YugaByte setup to PostgreSQL RDS, then to object‑storage‑backed databases, while tackling indexing, inference scaling, and cold‑start challenges.

AICloudDatabases

0 likes · 11 min read

How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage

Baobao Algorithm Notes

Mar 30, 2025 · Artificial Intelligence

Why Scaling, Data, and Infra Matter More Than Reward Design in R1 Replication

The article analyses two months of community attempts to reproduce DeepSeek R1, highlighting that model scaling, high‑quality data, robust training infrastructure, and careful hyper‑parameter tuning outweigh pure reward‑based tricks, and it outlines common pitfalls and future research directions.

DeepSeekLLMRLHF

0 likes · 13 min read

Why Scaling, Data, and Infra Matter More Than Reward Design in R1 Replication

FunTester

Mar 30, 2025 · Cloud Native

Mastering Kubernetes Resources with Java: EndpointSlice, PVC, PV, NetworkPolicy & More

This guide shows how to use the Fabric8 Kubernetes Java client to load, create, apply, list, watch, and delete core Kubernetes objects such as EndpointSlice, PersistentVolumeClaim, PersistentVolume, NetworkPolicy, PodDisruptionBudget, and various RBAC resources, with complete code examples for each operation.

APICloud NativeFabric8

0 likes · 12 min read

Mastering Kubernetes Resources with Java: EndpointSlice, PVC, PV, NetworkPolicy & More

21CTO

Mar 18, 2025 · Cloud Native

Why OpenInfra’s Move to the Linux Foundation Signals a New Era for Cloud‑Native Infrastructure

OpenInfra’s decision to join the Linux Foundation marks a strategic shift that unites OpenStack and CNCF resources, promising shared governance, funding, and stronger support for AI‑driven, accelerated‑computing and digital‑sovereignty workloads in the cloud‑native ecosystem.

Linux FoundationOpenInfraOpenStack

0 likes · 5 min read

Why OpenInfra’s Move to the Linux Foundation Signals a New Era for Cloud‑Native Infrastructure

Baidu Geek Talk

Mar 17, 2025 · Industry Insights

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

This article traces the decade‑long evolution of AI training stability—from early small‑model manual operations to large‑scale, multi‑thousand‑GPU clusters—detailing metrics like invalid training time, fault‑tolerance architectures, eBPF‑based hidden‑fault detection, BCCL enhancements, multi‑level restart strategies, and trigger‑based checkpointing that together shrink downtime from minutes to seconds.

AI trainingdistributed systemseBPF

0 likes · 22 min read

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

Ops Development & AI Practice

Mar 10, 2025 · Cloud Computing

Decode AWS EC2 Instance Names: A Complete Guide to Families, Generations, and Specs

This guide explains the systematic naming convention of AWS EC2 instance types, breaking down families, generations, metal indicators, and size specifications, and provides detailed examples to help you quickly identify the right instance for your workload.

AWSCloud ComputingEC2

0 likes · 8 min read

Decode AWS EC2 Instance Names: A Complete Guide to Families, Generations, and Specs

Ops Development & AI Practice

Feb 25, 2025 · Cloud Computing

Master Terraform Functions, Expressions, and Meta-Arguments for Powerful IaC

This guide walks through Terraform's built‑in functions for strings, numbers, lists, and maps, explains conditional expressions, string interpolation, template rendering, and demonstrates how to use the count and for_each meta‑arguments to create flexible, reusable infrastructure configurations.

CloudIaCTerraform

0 likes · 11 min read

Master Terraform Functions, Expressions, and Meta-Arguments for Powerful IaC

MaGe Linux Operations

Dec 26, 2024 · Operations

What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution

This article provides a comprehensive overview of internet operations, detailing the three core pillars of service‑centered stability, security, and efficiency, describing the classification of operation roles, their responsibilities, the evolution of operational practices, and practical advice for aspiring operation engineers.

Site Reliability Engineeringinfrastructure

0 likes · 20 min read

What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution

AntTech

Dec 20, 2024 · Artificial Intelligence

Ant Group’s Wang Xu on Generative AI, the Emerging LAMP Paradigm, and Infrastructure Evolution

In his MEET2025 keynote, Ant Group’s Wang Xu explains how generative AI models are reshaping traditional database‑centric architectures, driving a new LAMP stack, accelerating AI agent frameworks, and prompting fundamental shifts in infrastructure, security, and developer productivity.

AI AgentsAnt GroupLAMP

0 likes · 9 min read

Ant Group’s Wang Xu on Generative AI, the Emerging LAMP Paradigm, and Infrastructure Evolution

Bilibili Tech

Dec 20, 2024 · Operations

Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE

To cope with rapid growth, Bilibili replaced its inflexible PXE workflow with a hybrid system using in‑memory BiliOS and iPXE, adding out‑of‑band management, declarative configuration, and multi‑scenario support, which together dramatically boosted provisioning automation, reliability, and efficiency across its data‑center and edge servers.

BiliOSPXEServer Provisioning

0 likes · 17 min read

Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE

MaGe Linux Operations

Dec 15, 2024 · Operations

Top Open-Source Tools for Unified Accounts, Automation & Infra Ops

This guide surveys a curated set of open‑source solutions—including LDAP, JumpServer, Ansible, dnsmasq, ApacheBench, PortSentry, Vagrant, Docker, ELK and Smokeping—that together enable unified account management, automated deployment, DNS services, stress testing, security hardening, virtualization, log collection and monitoring for robust operations.

Account ManagementAutomationinfrastructure

0 likes · 8 min read

Top Open-Source Tools for Unified Accounts, Automation & Infra Ops

Radish, Keep Going!

Nov 16, 2024 · Operations

How to Set Up a Local GitLab CI/CD Environment on an M1 MacBook with Docker

This guide walks through installing GitLab CE on an M1 MacBook using an ARM Docker image, configuring ports, setting up SSH access, creating a sample project, and registering a GitLab Runner for CI/CD pipelines, all with step‑by‑step commands and code snippets.

CI/CDDockerGitLab

0 likes · 7 min read

How to Set Up a Local GitLab CI/CD Environment on an M1 MacBook with Docker

Bilibili Tech

Nov 15, 2024 · Operations

B站直播团队S14赛事保障实践

The Bilibili live‑streaming team’s S14 tournament support showcases how systematic business‑scenario analysis, precise resource forecasting, accelerated fault‑drill and stress‑test workflows, and optimized tooling can deliver stable, low‑cost performance for massive, high‑concurrency events like the 2024 League of Legends World Championship.

Live StreamingTechnical Case StudyTraffic Management

0 likes · 13 min read

Architects' Tech Alliance

Nov 10, 2024 · Industry Insights

AI Compute Infrastructure: Trends, Scaling Laws, and the Rise of Massive Clusters

The article analyzes the development of AI compute infrastructure, detailing the three‑level architecture from chip to cluster, the scaling law linking model parameters to compute demand, the rapid growth of massive “ten‑thousand‑card” clusters worldwide, and the emerging demand for inference workloads driving new deployment and scheduling strategies.

AI computeIndustry TrendsInference Demand

0 likes · 15 min read

AI Compute Infrastructure: Trends, Scaling Laws, and the Rise of Massive Clusters

DevOps Engineer

Oct 29, 2024 · Operations

A Day in the Life of a DevOps Engineer

The article walks through a DevOps engineer’s typical workday, from morning Slack checks and task planning, through code repository maintenance, build and release duties, coffee breaks, lunch with teammates, focused afternoon development, and evening family time, highlighting both technical and personal aspects.

AutomationCI/CDOperations

0 likes · 4 min read

Selected Java Interview Questions

Oct 7, 2024 · Operations

Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples

This article introduces ten essential tools for operations engineers—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's functionality, typical scenarios, advantages, and real‑world examples with code snippets for practical automation and monitoring.

AutomationMonitoringOperations

0 likes · 8 min read

Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples

DevOps Engineer

Oct 1, 2024 · Operations

What a Chief DevOps Engineer Does: Responsibilities, Required Skills, and Business Benefits

The article explains the role of a chief DevOps engineer, outlining core duties such as infrastructure design, automation, and cultural leadership, the essential technical and soft‑skill requirements, and the advantages this position brings to an organization’s efficiency, reliability, and collaboration.

AutomationChief EngineerLeadership

0 likes · 6 min read

What a Chief DevOps Engineer Does: Responsibilities, Required Skills, and Business Benefits

Liangxu Linux

Sep 29, 2024 · Operations

Essential Automation Scripts for Operations: Baselines, Checks, and Repository Structure

This guide presents a comprehensive collection of automation operation scripts—including baseline health checks, business inspections, organized directory structures, naming conventions, and download links—designed to streamline system, network, database, and cloud infrastructure management.

AnsibleAutomationOperations

0 likes · 6 min read

Essential Automation Scripts for Operations: Baselines, Checks, and Repository Structure

Volcano Engine Developer Services

Sep 2, 2024 · Operations

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

This article details ByteDance’s disaster‑recovery evolution—from a single‑room deployment to same‑city multi‑data‑center setups and finally to active‑active multi‑region architectures—explaining the challenges, specific failure scenarios, and the strategic practices used to ensure continuous service during outages.

Disaster RecoveryHigh AvailabilityOperations

0 likes · 15 min read

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

Alibaba Cloud Infrastructure

Sep 2, 2024 · Cloud Native

How Lilith Games Used Cloud‑Native Architecture to Transform AFK Journey

This article examines Lilith Games' cloud‑native migration of the new title AFK Journey, detailing the motivations, technical challenges of containerizing stateful game servers, the adoption of OpenKruise for in‑place updates, and the measurable improvements in resource utilization, release speed, and operational costs.

Cloud NativeGame DevelopmentOpenKruise

0 likes · 8 min read

How Lilith Games Used Cloud‑Native Architecture to Transform AFK Journey

Model Perspective

Aug 26, 2024 · Fundamentals

How Coupling and Coordination Models Reveal Gaps in Rural Infrastructure Development

Using coupling and coordination degree models, this article explains why new rural infrastructure alone often fails to improve living standards, illustrates how to quantify mismatches between infrastructure and public services, and offers policy recommendations for balanced, harmonious development.

coordination modelcoupling modelinfrastructure

0 likes · 5 min read

How Coupling and Coordination Models Reveal Gaps in Rural Infrastructure Development

IT Services Circle

Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

Disaster RecoveryNetEase Cloud MusicOperations

0 likes · 6 min read

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

dbaplus Community

Aug 10, 2024 · Operations

Is Operations Really the Lowest‑Skill Role in IT? Insights from Zhihu Users

A collection of Zhihu answers examines the perception that IT operations is low‑tech, sharing real‑world experiences that reveal hidden complexities, the evolution of ops responsibilities, and why the role is actually far from trivial.

IT careerdevopsinfrastructure

0 likes · 6 min read

Is Operations Really the Lowest‑Skill Role in IT? Insights from Zhihu Users

Open Source Linux

Aug 1, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, ideal use cases, key advantages, and practical examples, while also providing code snippets and visual illustrations to help readers understand and apply them effectively.

AutomationMonitoringOperations

0 likes · 8 min read

Architects' Tech Alliance

Jul 28, 2024 · Artificial Intelligence

Design and Optimization Practices for Intelligent Computing Platforms in the Era of Large Models

The article examines the new characteristics, challenges, and technical practices of intelligent computing platforms required for large‑model AI workloads, covering infrastructure adaptation, heterogeneous scheduling, application acceleration, operation reliability, and future directions for simplifying GPU usage and connecting heterogeneous resources.

AI platformPerformance OptimizationScheduling

0 likes · 6 min read

Design and Optimization Practices for Intelligent Computing Platforms in the Era of Large Models