Tagged articles
360 articles
Page 1 of 4
DataFunTalk
DataFunTalk
May 13, 2026 · Industry Insights

Why Palantir’s Value Is Rising: AI Commoditization, Ontology, and 85% Q1 Revenue Growth

As large‑model capabilities become commoditized, Palantir argues that the true moat lies in its ontology‑driven infrastructure, which integrates business semantics to ensure reliable AI in high‑risk contexts, a strategy reflected in its 85% Q1 revenue jump and a three‑layer AI competition model.

AI commoditizationAI competitionEnterprise AI
0 likes · 11 min read
Why Palantir’s Value Is Rising: AI Commoditization, Ontology, and 85% Q1 Revenue Growth
Digital Planet
Digital Planet
May 2, 2026 · Industry Insights

Can AI Actually Lower Enterprise Digitalization Costs?

While many executives believe AI will slash the expenses of digital transformation, the article reveals hidden infrastructure, integration, talent, and ongoing operational costs that often turn AI into a cost‑shifting tool rather than a true cost‑saving solution, especially for core system projects.

AIDigital TransformationEnterprise
0 likes · 9 min read
Can AI Actually Lower Enterprise Digitalization Costs?
Baidu App Technology
Baidu App Technology
Apr 27, 2026 · Artificial Intelligence

Boosting End-to-End Efficiency with AI: From Single-Point Gains to Full Process Integration

The YoyoMan Drama team details how they transformed their product development pipeline by building Prompt‑friendly PRDs, design‑as‑code, AI coding infrastructure, and AI agents, creating a seamless “requirement‑design‑development‑test” loop that shifts work from manual effort to AI‑augmented processes across the entire workflow.

AIAI CodingCollaboration
0 likes · 27 min read
Boosting End-to-End Efficiency with AI: From Single-Point Gains to Full Process Integration
Ray's Galactic Tech
Ray's Galactic Tech
Apr 19, 2026 · Cloud Native

Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

This article presents a step‑by‑step guide to designing and implementing a production‑grade Kubernetes platform with GitOps, observability, capacity governance, fault‑injection, and SRE practices, showing how to achieve unified delivery, reliability, and low‑cost operation for high‑concurrency business services.

Cloud NativeGitOpsInfrastructure
0 likes · 37 min read
Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success
ITPUB
ITPUB
Apr 17, 2026 · Industry Insights

Why LinkedIn Dumped Kafka for Its Own ‘Northguard’ Streaming Engine

LinkedIn, the original home of Apache Kafka, abandoned the platform for a home‑grown system called Northguard, redesigning log storage, decentralizing metadata, and adding a virtualized Xinfra layer to handle trillions of daily events, while still acknowledging Kafka’s relevance for most companies.

Distributed SystemsInfrastructureKafka
0 likes · 7 min read
Why LinkedIn Dumped Kafka for Its Own ‘Northguard’ Streaming Engine
AI Code to Success
AI Code to Success
Apr 13, 2026 · Industry Insights

Why Anthropic’s Managed Agents Redefine AI Agent Runtime

Anthropic’s Managed Agents transform the cumbersome agent runtime into a modular, production‑ready infrastructure by decoupling the brain, hands, and session layers, improving reliability, security, and performance while offering developers a clear path to build long‑running AI workflows.

AI AgentsAnthropicInfrastructure
0 likes · 10 min read
Why Anthropic’s Managed Agents Redefine AI Agent Runtime
IT Services Circle
IT Services Circle
Apr 7, 2026 · Industry Insights

How a Single 8 GB Server Powered 500 K Users for 15 Years – The Webminal Story

Webminal, a free online Linux learning platform, has survived for fifteen years on a single 8 GB CentOS server, serving over half a million users by using a minimalist stack—including Python 2.7, Flask, Shellinabox, User Mode Linux and eBPF—while deliberately avoiding modern container orchestration and commercial monetisation.

Case StudyInfrastructureOnline Linux
0 likes · 10 min read
How a Single 8 GB Server Powered 500 K Users for 15 Years – The Webminal Story
PaperAgent
PaperAgent
Mar 29, 2026 · Industry Insights

From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development

The article examines the shift from traditional reasoning‑based large‑language‑model pipelines to agentic, harness‑driven AI systems, outlining the definition of a harness, its engineering challenges, architectural components, and the broader implications for training, reinforcement learning, and future research directions.

AI HarnessInfrastructureIntelligent agents
0 likes · 16 min read
From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development
AI Waka
AI Waka
Mar 25, 2026 · Information Security

How NemoClaw Secures Autonomous AI Agents with Kernel‑Level Sandboxing

This article examines NemoClaw’s three‑layer architecture that adds kernel‑level sandboxing, policy‑driven deployment, and flexible inference routing to OpenClaw, outlines installation steps, compares it with the native OpenClaw runtime, and discusses current limitations for production use.

AI Agent SecurityInfrastructureNemoClaw
0 likes · 9 min read
How NemoClaw Secures Autonomous AI Agents with Kernel‑Level Sandboxing
dbaplus Community
dbaplus Community
Mar 23, 2026 · Operations

How a Single AI‑Driven Command Wiped 2.5 Years of Production Data

In this detailed post‑mortem, Alexey Grigorev recounts how using Claude Code to automate a Terraform deployment unintentionally erased his entire production environment and two‑and‑a‑half years of data, exposing the risks of over‑reliance on AI‑driven automation and highlighting essential safeguards.

AIAWSAutomation
0 likes · 11 min read
How a Single AI‑Driven Command Wiped 2.5 Years of Production Data
ByteDance Data Platform
ByteDance Data Platform
Mar 13, 2026 · Artificial Intelligence

Beyond Parameters: How ClawLake Turns Agent Memory into Enterprise‑Level AI Infrastructure

The article explains why an AI agent's capabilities are limited by memory depth rather than model size, reviews three historical memory architectures, highlights their structural shortcomings, and details how the ClawLake solution provides a multi‑layer, multimodal, enterprise‑grade memory infrastructure for OpenClaw agents.

AIAgentEnterprise
0 likes · 17 min read
Beyond Parameters: How ClawLake Turns Agent Memory into Enterprise‑Level AI Infrastructure
dbaplus Community
dbaplus Community
Mar 1, 2026 · Operations

50 High‑Impact IT Operations Projects to Supercharge Your Resume

This guide presents 50 detailed IT operations projects—covering infrastructure, cloud native, automation, monitoring, security, databases, networking, disaster recovery, and DevOps—each with background, tech stack, implementation steps, and quantifiable results to help engineers craft compelling, results‑driven resume entries.

AutomationInfrastructureSecurity
0 likes · 25 min read
50 High‑Impact IT Operations Projects to Supercharge Your Resume
Architects' Tech Alliance
Architects' Tech Alliance
Jan 23, 2026 · Artificial Intelligence

What Are the Top 10 Global Computing Power Trends Shaping AI by 2026?

The Global Computing Alliance’s 2026 report outlines ten transformative trends—from explosive AI compute growth and the rise of supernodes to embodied intelligence, heterogeneous architectures, network‑centric designs, and the imminent commercialization of quantum computing—showing how compute power is becoming the strategic engine of the digital economy.

AIDigital EconomyInfrastructure
0 likes · 12 min read
What Are the Top 10 Global Computing Power Trends Shaping AI by 2026?
DevOps Coach
DevOps Coach
Jan 20, 2026 · Cloud Native

How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide

This article walks you through the complete journey from a single Kubernetes cluster to a production‑grade, multi‑cluster platform, covering managed services, capacity planning, GitOps pipelines, networking, observability, cost optimisation, upgrade strategies, and the people and processes needed for sustainable large‑scale operations.

Cloud NativeCost ManagementInfrastructure
0 likes · 27 min read
How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide
Raymond Ops
Raymond Ops
Dec 28, 2025 · Operations

From Zero to Production: Ansible Playbook Design Patterns & Best Practices

This guide walks you through building a production‑grade Ansible automation framework—from identifying common manual‑deployment pain points to defining layered architecture, directory conventions, reusable playbook patterns, high‑availability deployments, performance optimizations, monitoring, security hardening, CI/CD integration, and troubleshooting tips—empowering teams to achieve reliable, scalable operations.

AnsibleAutomationDevOps
0 likes · 14 min read
From Zero to Production: Ansible Playbook Design Patterns & Best Practices
DevOps Coach
DevOps Coach
Dec 24, 2025 · Operations

Will AI Replace Your DevOps Skills? Future‑Proof Your Career Today

The article explains how AI is rapidly automating traditional DevOps tasks—troubleshooting, configuration management, and toolchain mastery—forcing engineers to shift from manual expertise to outcome‑oriented orchestration, and outlines three pillars for building an AI‑native DevOps career.

AIAutomationDevOps
0 likes · 8 min read
Will AI Replace Your DevOps Skills? Future‑Proof Your Career Today
Raymond Ops
Raymond Ops
Dec 21, 2025 · Operations

Mastering Ansible: Deep Dive into Architecture, Modules, and Enterprise Automation

This comprehensive guide explains Ansible's agentless architecture, core components, module taxonomy, custom module development, performance tuning, large‑scale design patterns, real‑world LAMP deployment, monitoring integration, and future cloud‑native and AI‑driven trends, providing actionable steps for DevOps engineers.

AnsibleConfiguration ManagementInfrastructure
0 likes · 15 min read
Mastering Ansible: Deep Dive into Architecture, Modules, and Enterprise Automation
DevOps Engineer
DevOps Engineer
Dec 10, 2025 · Operations

DevOps Tools as a Car Factory: Packer, Terraform, Ansible, Docker, Kubernetes

The article uses a car‑factory analogy to clarify the distinct roles of DevOps tools—Packer for image building, Terraform for infrastructure provisioning, Ansible for configuration, Docker for containerized applications, and Kubernetes for large‑scale orchestration—showing how they fit into build, provision, and run phases of the IT lifecycle.

AnsibleDevOpsDocker
0 likes · 8 min read
DevOps Tools as a Car Factory: Packer, Terraform, Ansible, Docker, Kubernetes
21CTO
21CTO
Dec 3, 2025 · Operations

What My Biggest Developer Mistakes Taught Me About Operations and Resilience

A software engineer recounts three major mistakes—from accidentally deleting thousands of F5 URLs to leaking code externally and being laid off during COVID—highlighting how operational oversights, poor process controls, and personal resilience shape professional growth and underscore the value of empathy and systematic safeguards.

InfrastructureResiliencefailure
0 likes · 14 min read
What My Biggest Developer Mistakes Taught Me About Operations and Resilience
DevOps Coach
DevOps Coach
Nov 27, 2025 · Cloud Native

When Kubernetes Is Overkill: A Practical Guide for Small Teams

This article examines why Kubernetes often adds unnecessary complexity for tiny startups, outlines the hidden costs of its operational overhead, and offers concrete alternatives and step‑by‑step advice for when to adopt or avoid container orchestration.

Cloud NativeDevOpsInfrastructure
0 likes · 12 min read
When Kubernetes Is Overkill: A Practical Guide for Small Teams
Alimama Tech
Alimama Tech
Nov 26, 2025 · Artificial Intelligence

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

Alibaba’s open‑source ROCK environment sandbox and the ROLL reinforcement‑learning engine together provide a standardized, high‑throughput training loop that lets developers scale Agentic AI from a single machine to thousands of parallel instances while simplifying debugging and resource management.

Agentic AIInfrastructureScalable Training
0 likes · 12 min read
How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training
MaGe Linux Operations
MaGe Linux Operations
Nov 17, 2025 · Operations

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

This guide details production‑grade Prometheus alerting configurations, covering applicable scenarios, prerequisites, anti‑patterns, environment matrices, step‑by‑step deployment of Node Exporter, Prometheus and Alertmanager, comprehensive rule files, performance testing, troubleshooting, best practices, and ready‑to‑use scripts for backup and health checks.

AlertingInfrastructureOps
0 likes · 51 min read
Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices
dbaplus Community
dbaplus Community
Nov 15, 2025 · Operations

What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience

A chronicle of real‑world operations incidents from 2003 to 2024 shows how simple mistakes—mis‑configured passwords, unplugged cables, hardware mix‑ups, and rushed cloud migrations—can cascade into massive outages, offering hard‑earned lessons for anyone managing production systems.

Case StudyIncidentInfrastructure
0 likes · 11 min read
What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 29, 2025 · Cloud Native

How Container Services Are Powering the AI Agent Revolution

The article reviews Alibaba Cloud's container service advancements, highlights AI-driven trends such as intelligent agents reshaping applications, the migration of AI infrastructure to cloud‑native platforms, and showcases four customer case studies demonstrating massive efficiency gains and the emergence of containers as the operating system for the AI era.

AIAI AgentsCloud Native
0 likes · 6 min read
How Container Services Are Powering the AI Agent Revolution
ITPUB
ITPUB
Oct 28, 2025 · Operations

50 Powerful IT Ops Projects to Supercharge Your Resume

This article compiles 50 detailed IT operations projects across infrastructure, cloud, containers, automation, monitoring, security, databases, networking, disaster recovery and DevOps, each with scenario, tech stack, implementation steps and quantifiable results to help you craft standout résumé entries.

AutomationIT OperationsInfrastructure
0 likes · 30 min read
50 Powerful IT Ops Projects to Supercharge Your Resume
MaGe Linux Operations
MaGe Linux Operations
Oct 8, 2025 · Operations

Build an Enterprise‑Grade DevOps Pipeline in 7 Days: Hands‑On Guide + Ready‑to‑Use Scripts

This step‑by‑step guide shows how to create a full‑stack, enterprise‑level DevOps CI/CD pipeline—from environment setup and Docker installation to Jenkins pipeline scripts, Kubernetes deployments, monitoring, security hardening, and cost‑optimisation—enabling teams to reduce release cycles from days to minutes within a week.

AutomationDevOpsDocker
0 likes · 38 min read
Build an Enterprise‑Grade DevOps Pipeline in 7 Days: Hands‑On Guide + Ready‑to‑Use Scripts
dbaplus Community
dbaplus Community
Oct 7, 2025 · Operations

Why Ops Engineers Need a Massive Knowledge Stack—and How to Master It

This comprehensive guide explains why modern operations engineers must cover the full technology stack, outlines common learning pitfalls, presents a three‑layer, nine‑domain knowledge framework, and offers a step‑by‑step, personalized roadmap with practical labs and career‑growth advice.

AutomationCareer DevelopmentDevOps
0 likes · 14 min read
Why Ops Engineers Need a Massive Knowledge Stack—and How to Master It
21CTO
21CTO
Oct 5, 2025 · Artificial Intelligence

Anthropic Appoints Former Stripe Exec Rahul Patil as CTO Amid AI Infrastructure Race

Anthropic has named former Stripe senior executive Rahul Patil as its new CTO, reshaping its engineering structure to tighten product, infrastructure, and inference teams while facing intense AI infrastructure competition from OpenAI and Meta, and imposing new usage limits on its Claude services.

AIAnthropicCTO
0 likes · 4 min read
Anthropic Appoints Former Stripe Exec Rahul Patil as CTO Amid AI Infrastructure Race
MaGe Linux Operations
MaGe Linux Operations
Oct 2, 2025 · Operations

How Ansible Can Deploy 100 Servers in 10 Minutes: A Hands‑On Guide

This article explains why Ansible is the preferred automation tool, outlines its core advantages and architecture, and provides a step‑by‑step, code‑rich tutorial—from installing the control node and configuring SSH keys to writing inventories, ad‑hoc commands, Playbooks, Roles, and a real‑world 100‑server deployment case—showing how to achieve massive scaling with minimal effort.

AnsibleConfiguration ManagementInfrastructure
0 likes · 29 min read
How Ansible Can Deploy 100 Servers in 10 Minutes: A Hands‑On Guide
DevOps Coach
DevOps Coach
Oct 1, 2025 · Operations

10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know

Drawing from real incidents like SQLite crashes, missing logs, unthrottled APIs, slow container startups, queue bottlenecks, network partitions, unreliable clocks, and weak alerts, this article shares ten concrete infrastructure lessons with code examples, performance data, and practical recommendations to avoid costly pitfalls.

DevOpsGoInfrastructure
0 likes · 8 min read
10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know
DataFunTalk
DataFunTalk
Sep 23, 2025 · Artificial Intelligence

Nvidia and OpenAI Launch the World’s Largest AI Compute Project

Nvidia and OpenAI have forged a strategic partnership to deploy at least 10 GW of GPU power—equivalent to millions of GPUs—with up to $100 billion in investment, marking the biggest AI infrastructure effort ever and promising transformative impacts across industries.

AIGPU computeInfrastructure
0 likes · 5 min read
Nvidia and OpenAI Launch the World’s Largest AI Compute Project
Ops Development & AI Practice
Ops Development & AI Practice
Sep 2, 2025 · Information Security

How a Tiny XSS Bug in Dev Environments Can Compromise Production Secrets

The article reveals how a seemingly harmless XSS flaw in an internal development platform can be weaponized to steal high‑privilege credentials, pivot across internal services, and ultimately breach production systems, urging teams to treat development environments as critical security frontiers.

Application SecurityDevOps SecurityInfrastructure
0 likes · 9 min read
How a Tiny XSS Bug in Dev Environments Can Compromise Production Secrets
Wuming AI
Wuming AI
Aug 26, 2025 · Artificial Intelligence

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

This article presents a hierarchical breakdown of Agentic AI, detailing the foundational large language models, the capabilities of AI agents, the coordination mechanisms of multi‑agent systems, and the supporting infrastructure needed for reliability, scalability, and security.

AI AgentsAgentic AIInfrastructure
0 likes · 5 min read
A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems
MaGe Linux Operations
MaGe Linux Operations
Jul 29, 2025 · Operations

How to Build a Production‑Ready Ansible Automation System from Scratch

This comprehensive guide walks you through the pain points of traditional operations and presents a layered, role‑driven Ansible architecture with design patterns, high‑availability deployment examples, performance tweaks, monitoring, security best practices, CI/CD integration, and debugging techniques for building a production‑grade automation framework.

AnsibleInfrastructurePlaybook
0 likes · 12 min read
How to Build a Production‑Ready Ansible Automation System from Scratch
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 28, 2025 · Industry Insights

Why AWS Bedrock AgentCore Signals a New Era for Agentic AI Infrastructure

The article analyzes AWS Bedrock AgentCore and related hardware and software requirements for Agentic AI, covering runtime isolation with microVMs, memory architectures, identity and gateway design, zero‑trust networking, and the challenges of multi‑tenant KVCache and context engineering.

AWS BedrockAgentic AIInfrastructure
0 likes · 15 min read
Why AWS Bedrock AgentCore Signals a New Era for Agentic AI Infrastructure
Open Source Linux
Open Source Linux
Jul 23, 2025 · Operations

Master Ansible Playbooks: From Basics to Large‑Scale Cluster Automation

This comprehensive guide walks you through Ansible fundamentals, core components, advanced playbook design, variable management, role architecture, error handling, large‑scale deployment strategies, performance tuning, security hardening, CI/CD integration, and monitoring, empowering you to automate modern infrastructure efficiently.

AnsibleConfiguration ManagementDevOps
0 likes · 14 min read
Master Ansible Playbooks: From Basics to Large‑Scale Cluster Automation
Efficient Ops
Efficient Ops
Jul 21, 2025 · Operations

30 Must‑Have DevOps Skills to Boost Your Resume in 2025

This article outlines 30 essential DevOps competencies—from foundational infrastructure and cloud/container orchestration to automation, monitoring, security, and AI‑driven operations—detailing key technologies, real‑world scenarios, and measurable impact, helping professionals craft a standout resume in the evolving operations landscape.

AI OpsAutomationDevOps
0 likes · 8 min read
30 Must‑Have DevOps Skills to Boost Your Resume in 2025
Ops Development & AI Practice
Ops Development & AI Practice
Jul 21, 2025 · Industry Insights

Why Building a DEX Is Far More Than Writing Smart Contracts

Running a decentralized exchange requires extensive pre‑launch development, rigorous security audits, robust front‑end design, tokenomics planning, and continuous post‑launch operations such as infrastructure maintenance, security monitoring, liquidity management, and community governance.

DEXDeFiInfrastructure
0 likes · 7 min read
Why Building a DEX Is Far More Than Writing Smart Contracts
IT Architects Alliance
IT Architects Alliance
Jul 11, 2025 · Fundamentals

How Do China and the U.S. Stack Up in Tech Infrastructure, Cloud, and AI?

This article compares China and the United States across infrastructure, cloud computing, artificial intelligence, key technologies, innovation ecosystems, and standards, highlighting each nation's strengths, strategic approaches, and the evolving balance of competition and cooperation in global technology development.

ChinaInfrastructureInnovation
0 likes · 9 min read
How Do China and the U.S. Stack Up in Tech Infrastructure, Cloud, and AI?
Ops Development & AI Practice
Ops Development & AI Practice
Jul 7, 2025 · Cloud Computing

Why Infrastructure Architecture Is the Hidden Backbone of Modern Cloud Systems

Infrastructure architecture, the often‑overlooked foundation of IT, defines how compute, storage, networking, and security are designed, integrated, and automated—linking software, ops, and cloud strategies—through processes like requirement analysis, technology selection, IaC implementation, and continuous optimization for reliability, performance, cost, and operational excellence.

DevOpsInfrastructureOperations
0 likes · 8 min read
Why Infrastructure Architecture Is the Hidden Backbone of Modern Cloud Systems
Architects' Tech Alliance
Architects' Tech Alliance
Jul 6, 2025 · Fundamentals

Mastering Data Center Essentials: 100 Core Concepts You Must Know

This comprehensive guide walks you through 100 essential data‑center concepts—from basic definitions, tier standards, and modular design to networking layers, storage architectures, compute resources, security measures, operational practices, energy efficiency, emerging technologies, and industry ecosystem—providing a complete knowledge framework for modern digital infrastructure.

ComputeData centerInfrastructure
0 likes · 21 min read
Mastering Data Center Essentials: 100 Core Concepts You Must Know
Efficient Ops
Efficient Ops
Jun 24, 2025 · Operations

Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management

This article compiles a comprehensive matrix of 58 mainstream operations tools—covering operating systems, open‑source mirrors, containers, AI‑assisted ops, basic services, databases, monitoring, automation, CI/CD and service mesh—to help engineers quickly locate the right technology stack for efficient infrastructure management.

DevOpsInfrastructureOperations
0 likes · 6 min read
Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management
ITPUB
ITPUB
Jun 17, 2025 · Artificial Intelligence

Why Private Cloud Is the Best Choice for Enterprise AI Deployment

The article examines why private‑cloud infrastructure, rather than public‑cloud services, offers enterprises better cost control, data sovereignty, customization, and security for building AI‑ready platforms, and outlines five core capabilities needed to achieve this.

AIEnterpriseInfrastructure
0 likes · 11 min read
Why Private Cloud Is the Best Choice for Enterprise AI Deployment
Efficient Ops
Efficient Ops
Jun 3, 2025 · Operations

What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure

This article shares Anthropic’s SRE engineer insights on 23 actionable practices—from schema migration and Karpenter node management to OpenTelemetry adoption, Helm chart storage, and Terraform versus CloudFormation—offering concrete recommendations for building reliable, cost‑effective AI and cloud‑native platforms.

Cloud NativeDevOpsInfrastructure
0 likes · 12 min read
What Anthropic’s SRE Team Learned: 23 Practical Ops Tips for Scalable AI Infrastructure
Bilibili Tech
Bilibili Tech
May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

AutomationInfrastructureOperations
0 likes · 17 min read
Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook
Efficient Ops
Efficient Ops
May 21, 2025 · Operations

Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness

Six months after abandoning Kubernetes, our DevOps team reduced infrastructure spend by 62%, cut deployment time by 89%, eliminated weekend on‑call duties, and improved overall happiness, demonstrating that simplifying the tech stack can deliver substantial operational and business benefits.

Cost reductionDevOpsInfrastructure
0 likes · 9 min read
Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 30, 2025 · Artificial Intelligence

Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era

This article summarizes Zuoyebang's infrastructure leader Dong Xiaocong's presentation on the challenges of AI inference demand and supply, and describes the design and implementation of a unified compute network—including trusted networking, multi‑region container scheduling, and traffic routing—to efficiently serve large‑scale AI models.

AICompute NetworkInfrastructure
0 likes · 9 min read
Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era
MaGe Linux Operations
MaGe Linux Operations
Apr 17, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces the ten most frequently used operations engineering tools, detailing each tool's functions, suitable scenarios, advantages, and real‑world examples, and includes practical code snippets to help engineers automate and streamline their daily workflows.

AutomationInfrastructureLinux tools
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Efficient Ops
Efficient Ops
Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

AutomationDevOpsInfrastructure
0 likes · 9 min read
Top 10 Essential Ops Tools Every Engineer Should Master
dbaplus Community
dbaplus Community
Apr 14, 2025 · Operations

20 Critical Server Ops Mistakes to Avoid: Real Cases & Fixes

Drawing from over 500 enterprise server failure incidents, this guide outlines twenty absolutely prohibited server actions across security configuration, system operation, data management, and architecture design, each paired with a real-world case, risk rating, and concrete remediation steps.

BackupDevOpsInfrastructure
0 likes · 13 min read
20 Critical Server Ops Mistakes to Avoid: Real Cases & Fixes
ITPUB
ITPUB
Apr 13, 2025 · Operations

How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage

Cursor, the AI‑powered code editor, grew to handle billions of document queries and over a hundred‑million model calls daily, prompting a multi‑stage infrastructure overhaul that moved from a failing YugaByte setup to PostgreSQL RDS, then to object‑storage‑backed databases, while tackling indexing, inference scaling, and cold‑start challenges.

AIInferenceInfrastructure
0 likes · 11 min read
How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage
FunTester
FunTester
Mar 30, 2025 · Cloud Native

Mastering Kubernetes Resources with Java: EndpointSlice, PVC, PV, NetworkPolicy & More

This guide shows how to use the Fabric8 Kubernetes Java client to load, create, apply, list, watch, and delete core Kubernetes objects such as EndpointSlice, PersistentVolumeClaim, PersistentVolume, NetworkPolicy, PodDisruptionBudget, and various RBAC resources, with complete code examples for each operation.

APICloud NativeDevOps
0 likes · 12 min read
Mastering Kubernetes Resources with Java: EndpointSlice, PVC, PV, NetworkPolicy & More
Baidu Geek Talk
Baidu Geek Talk
Mar 17, 2025 · Industry Insights

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

This article traces the decade‑long evolution of AI training stability—from early small‑model manual operations to large‑scale, multi‑thousand‑GPU clusters—detailing metrics like invalid training time, fault‑tolerance architectures, eBPF‑based hidden‑fault detection, BCCL enhancements, multi‑level restart strategies, and trigger‑based checkpointing that together shrink downtime from minutes to seconds.

AI trainingDistributed SystemsInfrastructure
0 likes · 22 min read
From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability
MaGe Linux Operations
MaGe Linux Operations
Dec 26, 2024 · Operations

What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution

This article provides a comprehensive overview of internet operations, detailing the three core pillars of service‑centered stability, security, and efficiency, describing the classification of operation roles, their responsibilities, the evolution of operational practices, and practical advice for aspiring operation engineers.

InfrastructureSite Reliability Engineering
0 likes · 20 min read
What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution
Bilibili Tech
Bilibili Tech
Dec 20, 2024 · Operations

Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE

To cope with rapid growth, Bilibili replaced its inflexible PXE workflow with a hybrid system using in‑memory BiliOS and iPXE, adding out‑of‑band management, declarative configuration, and multi‑scenario support, which together dramatically boosted provisioning automation, reliability, and efficiency across its data‑center and edge servers.

BiliOSDeploymentInfrastructure
0 likes · 17 min read
Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE
MaGe Linux Operations
MaGe Linux Operations
Dec 15, 2024 · Operations

Top Open-Source Tools for Unified Accounts, Automation & Infra Ops

This guide surveys a curated set of open‑source solutions—including LDAP, JumpServer, Ansible, dnsmasq, ApacheBench, PortSentry, Vagrant, Docker, ELK and Smokeping—that together enable unified account management, automated deployment, DNS services, stress testing, security hardening, virtualization, log collection and monitoring for robust operations.

Account ManagementAutomationInfrastructure
0 likes · 8 min read
Top Open-Source Tools for Unified Accounts, Automation & Infra Ops
Bilibili Tech
Bilibili Tech
Nov 15, 2024 · Operations

B站直播团队S14赛事保障实践

The Bilibili live‑streaming team’s S14 tournament support showcases how systematic business‑scenario analysis, precise resource forecasting, accelerated fault‑drill and stress‑test workflows, and optimized tooling can deliver stable, low‑cost performance for massive, high‑concurrency events like the 2024 League of Legends World Championship.

InfrastructureTechnical Case Studyevent support
0 likes · 13 min read
B站直播团队S14赛事保障实践
Architects' Tech Alliance
Architects' Tech Alliance
Nov 10, 2024 · Industry Insights

AI Compute Infrastructure: Trends, Scaling Laws, and the Rise of Massive Clusters

The article analyzes the development of AI compute infrastructure, detailing the three‑level architecture from chip to cluster, the scaling law linking model parameters to compute demand, the rapid growth of massive “ten‑thousand‑card” clusters worldwide, and the emerging demand for inference workloads driving new deployment and scheduling strategies.

AI computeInference DemandInfrastructure
0 likes · 15 min read
AI Compute Infrastructure: Trends, Scaling Laws, and the Rise of Massive Clusters
DevOps Engineer
DevOps Engineer
Oct 29, 2024 · Operations

A Day in the Life of a DevOps Engineer

The article walks through a DevOps engineer’s typical workday, from morning Slack checks and task planning, through code repository maintenance, build and release duties, coffee breaks, lunch with teammates, focused afternoon development, and evening family time, highlighting both technical and personal aspects.

AutomationDevOpsInfrastructure
0 likes · 4 min read
A Day in the Life of a DevOps Engineer
Selected Java Interview Questions
Selected Java Interview Questions
Oct 7, 2024 · Operations

Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples

This article introduces ten essential tools for operations engineers—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's functionality, typical scenarios, advantages, and real‑world examples with code snippets for practical automation and monitoring.

AutomationInfrastructureOperations
0 likes · 8 min read
Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 2, 2024 · Operations

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

This article details ByteDance’s disaster‑recovery evolution—from a single‑room deployment to same‑city multi‑data‑center setups and finally to active‑active multi‑region architectures—explaining the challenges, specific failure scenarios, and the strategic practices used to ensure continuous service during outages.

InfrastructureOperationsdisaster recovery
0 likes · 15 min read
How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Sep 2, 2024 · Cloud Native

How Lilith Games Used Cloud‑Native Architecture to Transform AFK Journey

This article examines Lilith Games' cloud‑native migration of the new title AFK Journey, detailing the motivations, technical challenges of containerizing stateful game servers, the adoption of OpenKruise for in‑place updates, and the measurable improvements in resource utilization, release speed, and operational costs.

Cloud NativeDevOpsGame Development
0 likes · 8 min read
How Lilith Games Used Cloud‑Native Architecture to Transform AFK Journey
IT Services Circle
IT Services Circle
Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

InfrastructureNetEase Cloud MusicOperations
0 likes · 6 min read
Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons
Open Source Linux
Open Source Linux
Aug 1, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, ideal use cases, key advantages, and practical examples, while also providing code snippets and visual illustrations to help readers understand and apply them effectively.

AutomationConfiguration ManagementInfrastructure
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Architects' Tech Alliance
Architects' Tech Alliance
Jul 28, 2024 · Artificial Intelligence

Design and Optimization Practices for Intelligent Computing Platforms in the Era of Large Models

The article examines the new characteristics, challenges, and technical practices of intelligent computing platforms required for large‑model AI workloads, covering infrastructure adaptation, heterogeneous scheduling, application acceleration, operation reliability, and future directions for simplifying GPU usage and connecting heterogeneous resources.

AI PlatformInfrastructurePerformance Optimization
0 likes · 6 min read
Design and Optimization Practices for Intelligent Computing Platforms in the Era of Large Models
Programmer DD
Programmer DD
Jul 24, 2024 · Databases

How We Cut Redis Costs by $460k Monthly: 10 Proven Optimization Strategies

In 2023, a TapTap infra team reduced Redis operating costs by 460,000 CNY per month through low‑cost ESSD instances, traffic compression, unused‑instance cleanup, TTL management, data migration, online compression, and targeted cleaning, detailing ten concrete measures and the open‑source tools that enabled zero‑downtime optimization.

Cost OptimizationDatabase ManagementInfrastructure
0 likes · 17 min read
How We Cut Redis Costs by $460k Monthly: 10 Proven Optimization Strategies
Architects' Tech Alliance
Architects' Tech Alliance
Jul 22, 2024 · Fundamentals

Comprehensive Overview of Data Center Architecture and Its Core Components

This article provides a detailed overview of modern data center architecture, covering physical and IT infrastructure, network topologies such as three‑tier and spine‑leaf, storage solutions like DAS, NAS and SAN, server designs, cloud data‑center components, physical site considerations, and various data‑center deployment models.

Data centerInfrastructureStorage Systems
0 likes · 20 min read
Comprehensive Overview of Data Center Architecture and Its Core Components
Baidu Geek Talk
Baidu Geek Talk
Jul 15, 2024 · Industry Insights

How AI Is Revolutionizing Physical Network Fault Localization

This article explains how Baidu Cloud evolved from manual and integrated network fault detection to AI-driven localization using large language models, detailing structured prompting, multi‑agent workflows, and real‑world comparisons that demonstrate improved accuracy and faster mitigation.

AIFault LocalizationInfrastructure
0 likes · 14 min read
How AI Is Revolutionizing Physical Network Fault Localization
Efficient Ops
Efficient Ops
Jun 23, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, typical use cases, key advantages, and real‑world examples, while also providing a practical Shell script and an Ansible playbook to illustrate automation in daily workflows.

Infrastructuredevops toolsmonitoring
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Practical DevOps Architecture
Practical DevOps Architecture
Jun 13, 2024 · Operations

Comprehensive Data Center Operations Training Course Overview

This extensive training program covers everything a data center operations engineer needs—from foundational infrastructure management and server hardware maintenance to advanced network configuration, security hardening, monitoring, fault handling, and practical hands‑on skills for real‑world challenges.

Data centerInfrastructureOperations
0 likes · 6 min read
Comprehensive Data Center Operations Training Course Overview
DevOps Engineer
DevOps Engineer
May 29, 2024 · Cloud Computing

Overview of the Python Software Foundation (PSF) Infrastructure

The article provides a comprehensive overview of the Python Software Foundation's infrastructure, detailing its team, cloud providers, data centers, and the hosting arrangements for numerous services such as PyPI, docs, bug trackers, and the main Python website.

Cloud ServicesInfrastructureOperations
0 likes · 9 min read
Overview of the Python Software Foundation (PSF) Infrastructure
dbaplus Community
dbaplus Community
May 28, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This guide reviews ten indispensable tools for operations engineers, detailing each tool's functions, ideal scenarios, advantages, and real‑world examples, and includes practical code snippets for automation, monitoring, container management, and log analysis.

AutomationDevOpsInfrastructure
0 likes · 8 min read
Top 10 Essential Tools Every Operations Engineer Should Master
Architects' Tech Alliance
Architects' Tech Alliance
May 24, 2024 · Industry Insights

What Drives AI's Future? A Four‑Layer Industry Framework Explained

This article breaks down the AI ecosystem into four layers—AI hardware and cloud services, model and algorithm advances, MLOps middleware, and B2B/B2C applications—highlighting how hardware cost reductions, cloud integration, model breakthroughs, and middleware providers shape the market and adoption speed.

AIAI applicationsIndustry analysis
0 likes · 6 min read
What Drives AI's Future? A Four‑Layer Industry Framework Explained
Bilibili Tech
Bilibili Tech
Apr 30, 2024 · Industry Insights

How Bilibili’s Smart Cabling Platform Boosts Data Center Efficiency

This article examines Bilibili's data‑center cabling challenges and presents a smart management platform that digitizes design, automates routing with scenario‑based and shortest‑path algorithms, streamlines task creation and operation, ultimately reducing installation time and improving maintenance efficiency.

AutomationCablingData center
0 likes · 12 min read
How Bilibili’s Smart Cabling Platform Boosts Data Center Efficiency
Architects' Tech Alliance
Architects' Tech Alliance
Apr 24, 2024 · Industry Insights

What Is Hyper‑Converged Infrastructure and Why It’s Transforming Data Centers

Hyper‑converged infrastructure (HCI) integrates compute, storage, and networking into a single software‑defined platform, offering simplified management, improved efficiency, seamless scalability, and lower total cost of ownership, making it a preferred architecture for modern data centers and cloud‑native workloads.

Data centerHCIInfrastructure
0 likes · 28 min read
What Is Hyper‑Converged Infrastructure and Why It’s Transforming Data Centers