Tagged articles

infrastructure

371 articles · Page 1 of 4
Design Hub
Design Hub
Jul 2, 2026 · Industry Insights

Design Insight: Why the 130‑meter Electric Ferry Redefines More Than Its Engine

China Zorrilla, a 130‑meter electric Ro‑Pax ferry built by Incat, demonstrates that true innovation lies in redesigning the entire route system—integrating hull, battery, shore power, and passenger experience—rather than merely swapping a diesel engine for electric propulsion.

battery integrationelectric ferryinfrastructure
0 likes · 11 min read
Design Insight: Why the 130‑meter Electric Ferry Redefines More Than Its Engine
Machine Heart
Machine Heart
Jun 29, 2026 · Artificial Intelligence

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

The article introduces AISHPerf, the first open‑source benchmark for AI‑infra operations agents built on nearly a hundred‑billion real‑world ops records, detailing its data pipeline, multi‑layer coverage, evaluation metrics, experimental results that show current models lag behind human experts, and future plans to expand and refine the benchmark.

AI OpsEvaluation MetricsFault Injection
0 likes · 16 min read
Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data
AI Engineering
AI Engineering
Jun 15, 2026 · Industry Insights

What Types of Engineers Anthropic Hires: Insights from 1,680 Resumes

An analysis of 1,680 Anthropic engineers shows the company rapidly built a large infra‑focused team, hiring mostly senior staff with a median of 12.2 years experience, sourcing talent primarily from Google and other FAANG firms, while junior hires are rare and highly selective.

AI industryAnthropicFAANG talent
0 likes · 9 min read
What Types of Engineers Anthropic Hires: Insights from 1,680 Resumes
Machine Heart
Machine Heart
Jun 15, 2026 · Industry Insights

Anthropic’s Hiring Secret: 1680 Engineer Resumes Reveal a Preference for Senior Infrastructure Veterans

An analysis of 1,680 public LinkedIn resumes shows that Anthropic has grown threefold in 18 months by hiring mostly senior engineers with extensive infrastructure experience, sourcing talent primarily from Google and other large‑scale tech firms, while largely ignoring fresh graduates and PhDs.

AI hiringAnthropicLinkedIn analysis
0 likes · 10 min read
Anthropic’s Hiring Secret: 1680 Engineer Resumes Reveal a Preference for Senior Infrastructure Veterans
Tech Freedom Circle
Tech Freedom Circle
Jun 9, 2026 · Artificial Intelligence

Deep Dive into Harness’s Sandbox Infra: How Deep Agents Enable Secure AI Execution

This article provides a detailed technical analysis of Harness’s Sandbox infrastructure, explaining how Deep Agents’ sandbox backend isolates file operations and command execution, the role of the single execute() entry point, security guarantees, lifecycle management, and practical integration steps for Docker, Kubernetes, or commercial sandbox providers.

AIDeep AgentsHarness
0 likes · 39 min read
Deep Dive into Harness’s Sandbox Infra: How Deep Agents Enable Secure AI Execution
Software Engineering 3.0 Era
Software Engineering 3.0 Era
May 24, 2026 · Artificial Intelligence

The 6 Essential Components of an Effective AI Harness System

The article breaks down AI Harness Engineering into six indispensable parts—prompt system, tools & skills, infrastructure, orchestration logic, hooks & middleware, and model configuration—explaining their roles, concrete examples, common pitfalls, and how they together turn a powerful base model into a reliable, scalable workplace assistant.

AI HarnessModel ConfigurationOrchestration
0 likes · 11 min read
The 6 Essential Components of an Effective AI Harness System
DataFunTalk
DataFunTalk
May 13, 2026 · Industry Insights

Why Palantir’s Value Is Rising: AI Commoditization, Ontology, and 85% Q1 Revenue Growth

As large‑model capabilities become commoditized, Palantir argues that the true moat lies in its ontology‑driven infrastructure, which integrates business semantics to ensure reliable AI in high‑risk contexts, a strategy reflected in its 85% Q1 revenue jump and a three‑layer AI competition model.

AI commoditizationAI competitionEnterprise AI
0 likes · 11 min read
Why Palantir’s Value Is Rising: AI Commoditization, Ontology, and 85% Q1 Revenue Growth
Digital Planet
Digital Planet
May 2, 2026 · Industry Insights

Can AI Actually Lower Enterprise Digitalization Costs?

While many executives believe AI will slash the expenses of digital transformation, the article reveals hidden infrastructure, integration, talent, and ongoing operational costs that often turn AI into a cost‑shifting tool rather than a true cost‑saving solution, especially for core system projects.

AIEnterpriseOperations
0 likes · 9 min read
Can AI Actually Lower Enterprise Digitalization Costs?
Baidu App Technology
Baidu App Technology
Apr 27, 2026 · Artificial Intelligence

Boosting End-to-End Efficiency with AI: From Single-Point Gains to Full Process Integration

The YoyoMan Drama team details how they transformed their product development pipeline by building Prompt‑friendly PRDs, design‑as‑code, AI coding infrastructure, and AI agents, creating a seamless “requirement‑design‑development‑test” loop that shifts work from manual effort to AI‑augmented processes across the entire workflow.

AIAI codingDesign Automation
0 likes · 27 min read
Boosting End-to-End Efficiency with AI: From Single-Point Gains to Full Process Integration
Ray's Galactic Tech
Ray's Galactic Tech
Apr 19, 2026 · Cloud Native

Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

This article presents a step‑by‑step guide to designing and implementing a production‑grade Kubernetes platform with GitOps, observability, capacity governance, fault‑injection, and SRE practices, showing how to achieve unified delivery, reliability, and low‑cost operation for high‑concurrency business services.

Cloud NativeGitOpsObservability
0 likes · 37 min read
Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success
ITPUB
ITPUB
Apr 17, 2026 · Industry Insights

Why LinkedIn Dumped Kafka for Its Own ‘Northguard’ Streaming Engine

LinkedIn, the original home of Apache Kafka, abandoned the platform for a home‑grown system called Northguard, redesigning log storage, decentralizing metadata, and adding a virtualized Xinfra layer to handle trillions of daily events, while still acknowledging Kafka’s relevance for most companies.

LinkedInNorthguardStreaming
0 likes · 7 min read
Why LinkedIn Dumped Kafka for Its Own ‘Northguard’ Streaming Engine
AI Code to Success
AI Code to Success
Apr 13, 2026 · Industry Insights

Why Anthropic’s Managed Agents Redefine AI Agent Runtime

Anthropic’s Managed Agents transform the cumbersome agent runtime into a modular, production‑ready infrastructure by decoupling the brain, hands, and session layers, improving reliability, security, and performance while offering developers a clear path to build long‑running AI workflows.

AI AgentsAgent RuntimeAnthropic
0 likes · 10 min read
Why Anthropic’s Managed Agents Redefine AI Agent Runtime
IT Services Circle
IT Services Circle
Apr 7, 2026 · Industry Insights

How a Single 8 GB Server Powered 500 K Users for 15 Years – The Webminal Story

Webminal, a free online Linux learning platform, has survived for fifteen years on a single 8 GB CentOS server, serving over half a million users by using a minimalist stack—including Python 2.7, Flask, Shellinabox, User Mode Linux and eBPF—while deliberately avoiding modern container orchestration and commercial monetisation.

Case StudyOnline LinuxShellinabox
0 likes · 10 min read
How a Single 8 GB Server Powered 500 K Users for 15 Years – The Webminal Story
PaperAgent
PaperAgent
Mar 29, 2026 · Industry Insights

From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development

The article examines the shift from traditional reasoning‑based large‑language‑model pipelines to agentic, harness‑driven AI systems, outlining the definition of a harness, its engineering challenges, architectural components, and the broader implications for training, reinforcement learning, and future research directions.

AI HarnessIntelligent agentsModel Training
0 likes · 16 min read
From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development
AI Waka
AI Waka
Mar 25, 2026 · Information Security

How NemoClaw Secures Autonomous AI Agents with Kernel‑Level Sandboxing

This article examines NemoClaw’s three‑layer architecture that adds kernel‑level sandboxing, policy‑driven deployment, and flexible inference routing to OpenClaw, outlines installation steps, compares it with the native OpenClaw runtime, and discusses current limitations for production use.

AI Agent SecurityNemoClawOpenShell
0 likes · 9 min read
How NemoClaw Secures Autonomous AI Agents with Kernel‑Level Sandboxing
dbaplus Community
dbaplus Community
Mar 23, 2026 · Operations

How a Single AI‑Driven Command Wiped 2.5 Years of Production Data

In this detailed post‑mortem, Alexey Grigorev recounts how using Claude Code to automate a Terraform deployment unintentionally erased his entire production environment and two‑and‑a‑half years of data, exposing the risks of over‑reliance on AI‑driven automation and highlighting essential safeguards.

AIAWSAutomation
0 likes · 11 min read
How a Single AI‑Driven Command Wiped 2.5 Years of Production Data
ByteDance Data Platform
ByteDance Data Platform
Mar 13, 2026 · Artificial Intelligence

Beyond Parameters: How ClawLake Turns Agent Memory into Enterprise‑Level AI Infrastructure

The article explains why an AI agent's capabilities are limited by memory depth rather than model size, reviews three historical memory architectures, highlights their structural shortcomings, and details how the ClawLake solution provides a multi‑layer, multimodal, enterprise‑grade memory infrastructure for OpenClaw agents.

AIAgentEnterprise
0 likes · 17 min read
Beyond Parameters: How ClawLake Turns Agent Memory into Enterprise‑Level AI Infrastructure
Efficient Ops
Efficient Ops
Mar 11, 2026 · Operations

How an AI‑Powered Terraform Command Erased 2 Million Records – Lessons for Safe Ops

A single Terraform command executed by the AI assistant Claude Code mistakenly destroyed a production database of over two million records, exposing how over‑reliance on AI, missing state files, weak backup practices, and absent deletion protection can cause massive outages and what safeguards can prevent such incidents.

AI OpsAWSIncident Management
0 likes · 6 min read
How an AI‑Powered Terraform Command Erased 2 Million Records – Lessons for Safe Ops
dbaplus Community
dbaplus Community
Mar 1, 2026 · Operations

50 High‑Impact IT Operations Projects to Supercharge Your Resume

This guide presents 50 detailed IT operations projects—covering infrastructure, cloud native, automation, monitoring, security, databases, networking, disaster recovery, and DevOps—each with background, tech stack, implementation steps, and quantifiable results to help engineers craft compelling, results‑driven resume entries.

AutomationCloudinfrastructure
0 likes · 25 min read
50 High‑Impact IT Operations Projects to Supercharge Your Resume
Architects' Tech Alliance
Architects' Tech Alliance
Jan 23, 2026 · Artificial Intelligence

What Are the Top 10 Global Computing Power Trends Shaping AI by 2026?

The Global Computing Alliance’s 2026 report outlines ten transformative trends—from explosive AI compute growth and the rise of supernodes to embodied intelligence, heterogeneous architectures, network‑centric designs, and the imminent commercialization of quantum computing—showing how compute power is becoming the strategic engine of the digital economy.

AIcomputing powerdigital economy
0 likes · 12 min read
What Are the Top 10 Global Computing Power Trends Shaping AI by 2026?
DevOps Coach
DevOps Coach
Jan 20, 2026 · Cloud Native

How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide

This article walks you through the complete journey from a single Kubernetes cluster to a production‑grade, multi‑cluster platform, covering managed services, capacity planning, GitOps pipelines, networking, observability, cost optimisation, upgrade strategies, and the people and processes needed for sustainable large‑scale operations.

Cloud NativeObservabilitycost management
0 likes · 27 min read
How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide
Raymond Ops
Raymond Ops
Dec 28, 2025 · Operations

From Zero to Production: Ansible Playbook Design Patterns & Best Practices

This guide walks you through building a production‑grade Ansible automation framework—from identifying common manual‑deployment pain points to defining layered architecture, directory conventions, reusable playbook patterns, high‑availability deployments, performance optimizations, monitoring, security hardening, CI/CD integration, and troubleshooting tips—empowering teams to achieve reliable, scalable operations.

AnsibleAutomationCI/CD
0 likes · 14 min read
From Zero to Production: Ansible Playbook Design Patterns & Best Practices
DevOps Coach
DevOps Coach
Dec 24, 2025 · Operations

Will AI Replace Your DevOps Skills? Future‑Proof Your Career Today

The article explains how AI is rapidly automating traditional DevOps tasks—troubleshooting, configuration management, and toolchain mastery—forcing engineers to shift from manual expertise to outcome‑oriented orchestration, and outlines three pillars for building an AI‑native DevOps career.

AIAutomationCareer
0 likes · 8 min read
Will AI Replace Your DevOps Skills? Future‑Proof Your Career Today
Raymond Ops
Raymond Ops
Dec 21, 2025 · Operations

Mastering Ansible: Deep Dive into Architecture, Modules, and Enterprise Automation

This comprehensive guide explains Ansible's agentless architecture, core components, module taxonomy, custom module development, performance tuning, large‑scale design patterns, real‑world LAMP deployment, monitoring integration, and future cloud‑native and AI‑driven trends, providing actionable steps for DevOps engineers.

Ansibleconfiguration managementinfrastructure
0 likes · 15 min read
Mastering Ansible: Deep Dive into Architecture, Modules, and Enterprise Automation
DevOps Engineer
DevOps Engineer
Dec 10, 2025 · Operations

DevOps Tools as a Car Factory: Packer, Terraform, Ansible, Docker, Kubernetes

The article uses a car‑factory analogy to clarify the distinct roles of DevOps tools—Packer for image building, Terraform for infrastructure provisioning, Ansible for configuration, Docker for containerized applications, and Kubernetes for large‑scale orchestration—showing how they fit into build, provision, and run phases of the IT lifecycle.

AnsibleDockerPacker
0 likes · 8 min read
DevOps Tools as a Car Factory: Packer, Terraform, Ansible, Docker, Kubernetes
Past Memory Big Data
Past Memory Big Data
Dec 9, 2025 · Artificial Intelligence

A Decade of Evolution: Inside Pinterest’s AI Platform Journey

Over ten years Pinterest transformed a fragmented machine‑learning stack into a unified AI platform, iterating through stages from early ad‑hoc pipelines to scalable GPU‑accelerated services, while learning that timing, organization alignment, and efficiency are crucial for lasting impact.

AI platformGPU inferenceML Ops
0 likes · 25 min read
A Decade of Evolution: Inside Pinterest’s AI Platform Journey
21CTO
21CTO
Dec 3, 2025 · Operations

What My Biggest Developer Mistakes Taught Me About Operations and Resilience

A software engineer recounts three major mistakes—from accidentally deleting thousands of F5 URLs to leaking code externally and being laid off during COVID—highlighting how operational oversights, poor process controls, and personal resilience shape professional growth and underscore the value of empathy and systematic safeguards.

Resiliencefailureinfrastructure
0 likes · 14 min read
What My Biggest Developer Mistakes Taught Me About Operations and Resilience
Amazon Cloud Developers
Amazon Cloud Developers
Dec 3, 2025 · Cloud Computing

The Road to Billions of AI Agents: Key Takeaways from Matt Garman’s re:Invent 2025 Keynote

At AWS re:Invent 2025, CEO Matt Garman outlined four essential pillars for building AI agents, unveiled three frontier agents, introduced the Amazon Nova 2 model series and 25 major cloud service innovations, and argued that billions of agents will soon deliver ten‑fold efficiency gains across enterprises.

AI AgentsAWSCloud Computing
0 likes · 20 min read
The Road to Billions of AI Agents: Key Takeaways from Matt Garman’s re:Invent 2025 Keynote
DevOps Coach
DevOps Coach
Nov 27, 2025 · Cloud Native

When Kubernetes Is Overkill: A Practical Guide for Small Teams

This article examines why Kubernetes often adds unnecessary complexity for tiny startups, outlines the hidden costs of its operational overhead, and offers concrete alternatives and step‑by‑step advice for when to adopt or avoid container orchestration.

Cloud NativePlatform Engineeringdevops
0 likes · 12 min read
When Kubernetes Is Overkill: A Practical Guide for Small Teams
Alimama Tech
Alimama Tech
Nov 26, 2025 · Artificial Intelligence

How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training

Alibaba’s open‑source ROCK environment sandbox and the ROLL reinforcement‑learning engine together provide a standardized, high‑throughput training loop that lets developers scale Agentic AI from a single machine to thousands of parallel instances while simplifying debugging and resource management.

Agentic AIScalable Traininginfrastructure
0 likes · 12 min read
How Alibaba’s ROCK & ROLL Enable Scalable Agentic AI Training
MaGe Linux Operations
MaGe Linux Operations
Nov 17, 2025 · Operations

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

This guide details production‑grade Prometheus alerting configurations, covering applicable scenarios, prerequisites, anti‑patterns, environment matrices, step‑by‑step deployment of Node Exporter, Prometheus and Alertmanager, comprehensive rule files, performance testing, troubleshooting, best practices, and ready‑to‑use scripts for backup and health checks.

AlertingOpsinfrastructure
0 likes · 51 min read
Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices
dbaplus Community
dbaplus Community
Nov 15, 2025 · Operations

What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience

A chronicle of real‑world operations incidents from 2003 to 2024 shows how simple mistakes—mis‑configured passwords, unplugged cables, hardware mix‑ups, and rushed cloud migrations—can cascade into massive outages, offering hard‑earned lessons for anyone managing production systems.

Case StudyOperationsincident
0 likes · 11 min read
What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience
IT Services Circle
IT Services Circle
Nov 7, 2025 · Artificial Intelligence

Why Microsoft’s GPU Fleet Is Sitting Idle – The Power Crisis Behind AI’s Growth

Microsoft’s CEO Satya Nadella admits the tech giant’s massive stock of Nvidia GPUs are idle due to insufficient electricity and lack of ready‑to‑use data‑center facilities, highlighting a broader industry shift where AI’s soaring compute demand is now constrained by power and infrastructure limits.

AICloud ComputingData Centers
0 likes · 8 min read
Why Microsoft’s GPU Fleet Is Sitting Idle – The Power Crisis Behind AI’s Growth
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 29, 2025 · Cloud Native

How Container Services Are Powering the AI Agent Revolution

The article reviews Alibaba Cloud's container service advancements, highlights AI-driven trends such as intelligent agents reshaping applications, the migration of AI infrastructure to cloud‑native platforms, and showcases four customer case studies demonstrating massive efficiency gains and the emergence of containers as the operating system for the AI era.

AIAI AgentsCloud Native
0 likes · 6 min read
How Container Services Are Powering the AI Agent Revolution
ITPUB
ITPUB
Oct 28, 2025 · Operations

50 Powerful IT Ops Projects to Supercharge Your Resume

This article compiles 50 detailed IT operations projects across infrastructure, cloud, containers, automation, monitoring, security, databases, networking, disaster recovery and DevOps, each with scenario, tech stack, implementation steps and quantifiable results to help you craft standout résumé entries.

AutomationCloudIT Operations
0 likes · 30 min read
50 Powerful IT Ops Projects to Supercharge Your Resume
Amazon Cloud Developers
Amazon Cloud Developers
Oct 13, 2025 · Artificial Intelligence

Agentic AI Guide: Building and Deploying Robust AI Agents

This article provides a comprehensive technical guide on Agentic AI, detailing the core modules, infrastructure requirements, security considerations, observability practices, and deployment strategies needed to develop and operate production‑ready AI agents.

AI AgentsAgentOpsAgentic AI
0 likes · 27 min read
Agentic AI Guide: Building and Deploying Robust AI Agents
MaGe Linux Operations
MaGe Linux Operations
Oct 8, 2025 · Operations

Build an Enterprise‑Grade DevOps Pipeline in 7 Days: Hands‑On Guide + Ready‑to‑Use Scripts

This step‑by‑step guide shows how to create a full‑stack, enterprise‑level DevOps CI/CD pipeline—from environment setup and Docker installation to Jenkins pipeline scripts, Kubernetes deployments, monitoring, security hardening, and cost‑optimisation—enabling teams to reduce release cycles from days to minutes within a week.

AutomationCI/CDDocker
0 likes · 38 min read
Build an Enterprise‑Grade DevOps Pipeline in 7 Days: Hands‑On Guide + Ready‑to‑Use Scripts
dbaplus Community
dbaplus Community
Oct 7, 2025 · Operations

Why Ops Engineers Need a Massive Knowledge Stack—and How to Master It

This comprehensive guide explains why modern operations engineers must cover the full technology stack, outlines common learning pitfalls, presents a three‑layer, nine‑domain knowledge framework, and offers a step‑by‑step, personalized roadmap with practical labs and career‑growth advice.

AutomationOperationscareer development
0 likes · 14 min read
Why Ops Engineers Need a Massive Knowledge Stack—and How to Master It
21CTO
21CTO
Oct 5, 2025 · Artificial Intelligence

Anthropic Appoints Former Stripe Exec Rahul Patil as CTO Amid AI Infrastructure Race

Anthropic has named former Stripe senior executive Rahul Patil as its new CTO, reshaping its engineering structure to tighten product, infrastructure, and inference teams while facing intense AI infrastructure competition from OpenAI and Meta, and imposing new usage limits on its Claude services.

AIAnthropicCTO
0 likes · 4 min read
Anthropic Appoints Former Stripe Exec Rahul Patil as CTO Amid AI Infrastructure Race
MaGe Linux Operations
MaGe Linux Operations
Oct 2, 2025 · Operations

How Ansible Can Deploy 100 Servers in 10 Minutes: A Hands‑On Guide

This article explains why Ansible is the preferred automation tool, outlines its core advantages and architecture, and provides a step‑by‑step, code‑rich tutorial—from installing the control node and configuring SSH keys to writing inventories, ad‑hoc commands, Playbooks, Roles, and a real‑world 100‑server deployment case—showing how to achieve massive scaling with minimal effort.

Ansibleconfiguration managementinfrastructure
0 likes · 29 min read
How Ansible Can Deploy 100 Servers in 10 Minutes: A Hands‑On Guide
DevOps Coach
DevOps Coach
Oct 1, 2025 · Operations

10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know

Drawing from real incidents like SQLite crashes, missing logs, unthrottled APIs, slow container startups, queue bottlenecks, network partitions, unreliable clocks, and weak alerts, this article shares ten concrete infrastructure lessons with code examples, performance data, and practical recommendations to avoid costly pitfalls.

GoOperationsdevops
0 likes · 8 min read
10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know
DataFunTalk
DataFunTalk
Sep 23, 2025 · Artificial Intelligence

Nvidia and OpenAI Launch the World’s Largest AI Compute Project

Nvidia and OpenAI have forged a strategic partnership to deploy at least 10 GW of GPU power—equivalent to millions of GPUs—with up to $100 billion in investment, marking the biggest AI infrastructure effort ever and promising transformative impacts across industries.

AIGPU computeNVIDIA
0 likes · 5 min read
Nvidia and OpenAI Launch the World’s Largest AI Compute Project
Ops Development & AI Practice
Ops Development & AI Practice
Sep 2, 2025 · Information Security

How a Tiny XSS Bug in Dev Environments Can Compromise Production Secrets

The article reveals how a seemingly harmless XSS flaw in an internal development platform can be weaponized to steal high‑privilege credentials, pivot across internal services, and ultimately breach production systems, urging teams to treat development environments as critical security frontiers.

Application SecurityCredential TheftDevOps Security
0 likes · 9 min read
How a Tiny XSS Bug in Dev Environments Can Compromise Production Secrets
Wuming AI
Wuming AI
Aug 26, 2025 · Artificial Intelligence

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

This article presents a hierarchical breakdown of Agentic AI, detailing the foundational large language models, the capabilities of AI agents, the coordination mechanisms of multi‑agent systems, and the supporting infrastructure needed for reliability, scalability, and security.

AI AgentsAgentic AILLM
0 likes · 5 min read
A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems
MaGe Linux Operations
MaGe Linux Operations
Jul 29, 2025 · Operations

How to Build a Production‑Ready Ansible Automation System from Scratch

This comprehensive guide walks you through the pain points of traditional operations and presents a layered, role‑driven Ansible architecture with design patterns, high‑availability deployment examples, performance tweaks, monitoring, security best practices, CI/CD integration, and debugging techniques for building a production‑grade automation framework.

Ansibleinfrastructureplaybook
0 likes · 12 min read
How to Build a Production‑Ready Ansible Automation System from Scratch
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 28, 2025 · Industry Insights

Why AWS Bedrock AgentCore Signals a New Era for Agentic AI Infrastructure

The article analyzes AWS Bedrock AgentCore and related hardware and software requirements for Agentic AI, covering runtime isolation with microVMs, memory architectures, identity and gateway design, zero‑trust networking, and the challenges of multi‑tenant KVCache and context engineering.

AWS BedrockAgentic AIMemory Management
0 likes · 15 min read
Why AWS Bedrock AgentCore Signals a New Era for Agentic AI Infrastructure
Open Source Linux
Open Source Linux
Jul 23, 2025 · Operations

Master Ansible Playbooks: From Basics to Large‑Scale Cluster Automation

This comprehensive guide walks you through Ansible fundamentals, core components, advanced playbook design, variable management, role architecture, error handling, large‑scale deployment strategies, performance tuning, security hardening, CI/CD integration, and monitoring, empowering you to automate modern infrastructure efficiently.

Ansibleconfiguration managementdevops
0 likes · 14 min read
Master Ansible Playbooks: From Basics to Large‑Scale Cluster Automation
Efficient Ops
Efficient Ops
Jul 21, 2025 · Operations

30 Must‑Have DevOps Skills to Boost Your Resume in 2025

This article outlines 30 essential DevOps competencies—from foundational infrastructure and cloud/container orchestration to automation, monitoring, security, and AI‑driven operations—detailing key technologies, real‑world scenarios, and measurable impact, helping professionals craft a standout resume in the evolving operations landscape.

AI OpsAutomationCloud
0 likes · 8 min read
30 Must‑Have DevOps Skills to Boost Your Resume in 2025
Ops Development & AI Practice
Ops Development & AI Practice
Jul 21, 2025 · Industry Insights

Why Building a DEX Is Far More Than Writing Smart Contracts

Running a decentralized exchange requires extensive pre‑launch development, rigorous security audits, robust front‑end design, tokenomics planning, and continuous post‑launch operations such as infrastructure maintenance, security monitoring, liquidity management, and community governance.

DEXDeFiSecurity Audits
0 likes · 7 min read
Why Building a DEX Is Far More Than Writing Smart Contracts
IT Architects Alliance
IT Architects Alliance
Jul 11, 2025 · Fundamentals

How Do China and the U.S. Stack Up in Tech Infrastructure, Cloud, and AI?

This article compares China and the United States across infrastructure, cloud computing, artificial intelligence, key technologies, innovation ecosystems, and standards, highlighting each nation's strengths, strategic approaches, and the evolving balance of competition and cooperation in global technology development.

ChinaCloud ComputingUnited States
0 likes · 9 min read
How Do China and the U.S. Stack Up in Tech Infrastructure, Cloud, and AI?
Ops Development & AI Practice
Ops Development & AI Practice
Jul 7, 2025 · Cloud Computing

Why Infrastructure Architecture Is the Hidden Backbone of Modern Cloud Systems

Infrastructure architecture, the often‑overlooked foundation of IT, defines how compute, storage, networking, and security are designed, integrated, and automated—linking software, ops, and cloud strategies—through processes like requirement analysis, technology selection, IaC implementation, and continuous optimization for reliability, performance, cost, and operational excellence.

IaCOperationscloud architecture
0 likes · 8 min read
Why Infrastructure Architecture Is the Hidden Backbone of Modern Cloud Systems
Architects' Tech Alliance
Architects' Tech Alliance
Jul 6, 2025 · Fundamentals

Mastering Data Center Essentials: 100 Core Concepts You Must Know

This comprehensive guide walks you through 100 essential data‑center concepts—from basic definitions, tier standards, and modular design to networking layers, storage architectures, compute resources, security measures, operational practices, energy efficiency, emerging technologies, and industry ecosystem—providing a complete knowledge framework for modern digital infrastructure.

ComputeData Centerinfrastructure
0 likes · 21 min read
Mastering Data Center Essentials: 100 Core Concepts You Must Know
Efficient Ops
Efficient Ops
Jun 24, 2025 · Operations

Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management

This article compiles a comprehensive matrix of 58 mainstream operations tools—covering operating systems, open‑source mirrors, containers, AI‑assisted ops, basic services, databases, monitoring, automation, CI/CD and service mesh—to help engineers quickly locate the right technology stack for efficient infrastructure management.

CloudOperationsdevops
0 likes · 6 min read
Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management
ITPUB
ITPUB
Jun 17, 2025 · Artificial Intelligence

Why Private Cloud Is the Best Choice for Enterprise AI Deployment

The article examines why private‑cloud infrastructure, rather than public‑cloud services, offers enterprises better cost control, data sovereignty, customization, and security for building AI‑ready platforms, and outlines five core capabilities needed to achieve this.

AIEnterpriseOrchestration
0 likes · 11 min read
Why Private Cloud Is the Best Choice for Enterprise AI Deployment
Smart Era Software Development
Smart Era Software Development
Jun 5, 2025 · Artificial Intelligence

How AI Is Taking Over Your IDE: A 5‑Stage Roadmap for Agent‑Native Infrastructure

The article argues that AI’s ultimate goal is not just to write code faster than humans but to control the entire software lifecycle, and it proposes a five‑stage L0‑L5 maturity model for AI‑native infrastructure that moves from simple code generation to a full Agent‑Native operating system.

AIAgentResult-as-a-Service
0 likes · 32 min read
How AI Is Taking Over Your IDE: A 5‑Stage Roadmap for Agent‑Native Infrastructure
Bilibili Tech
Bilibili Tech
May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

AutomationMonitoringOperations
0 likes · 17 min read
Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook
Efficient Ops
Efficient Ops
May 21, 2025 · Operations

Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness

Six months after abandoning Kubernetes, our DevOps team reduced infrastructure spend by 62%, cut deployment time by 89%, eliminated weekend on‑call duties, and improved overall happiness, demonstrating that simplifying the tech stack can deliver substantial operational and business benefits.

Operationscost reductiondevops
0 likes · 9 min read
Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 30, 2025 · Artificial Intelligence

Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era

This article summarizes Zuoyebang's infrastructure leader Dong Xiaocong's presentation on the challenges of AI inference demand and supply, and describes the design and implementation of a unified compute network—including trusted networking, multi‑region container scheduling, and traffic routing—to efficiently serve large‑scale AI models.

AIModel Distributioncompute network
0 likes · 9 min read
Exploring and Practicing a Unified Compute Network for AI at Zuoyebang: Building an Innovation Engine for the AI Era
MaGe Linux Operations
MaGe Linux Operations
Apr 17, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces the ten most frequently used operations engineering tools, detailing each tool's functions, suitable scenarios, advantages, and real‑world examples, and includes practical code snippets to help engineers automate and streamline their daily workflows.

AutomationLinux toolsOperations
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Efficient Ops
Efficient Ops
Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

AutomationMonitoringOperations
0 likes · 9 min read
Top 10 Essential Ops Tools Every Engineer Should Master
dbaplus Community
dbaplus Community
Apr 14, 2025 · Operations

20 Critical Server Ops Mistakes to Avoid: Real Cases & Fixes

Drawing from over 500 enterprise server failure incidents, this guide outlines twenty absolutely prohibited server actions across security configuration, system operation, data management, and architecture design, each paired with a real-world case, risk rating, and concrete remediation steps.

backupdevopsinfrastructure
0 likes · 13 min read
20 Critical Server Ops Mistakes to Avoid: Real Cases & Fixes
ITPUB
ITPUB
Apr 13, 2025 · Operations

How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage

Cursor, the AI‑powered code editor, grew to handle billions of document queries and over a hundred‑million model calls daily, prompting a multi‑stage infrastructure overhaul that moved from a failing YugaByte setup to PostgreSQL RDS, then to object‑storage‑backed databases, while tackling indexing, inference scaling, and cold‑start challenges.

AICloudDatabases
0 likes · 11 min read
How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage
FunTester
FunTester
Mar 30, 2025 · Cloud Native

Mastering Kubernetes Resources with Java: EndpointSlice, PVC, PV, NetworkPolicy & More

This guide shows how to use the Fabric8 Kubernetes Java client to load, create, apply, list, watch, and delete core Kubernetes objects such as EndpointSlice, PersistentVolumeClaim, PersistentVolume, NetworkPolicy, PodDisruptionBudget, and various RBAC resources, with complete code examples for each operation.

APICloud NativeFabric8
0 likes · 12 min read
Mastering Kubernetes Resources with Java: EndpointSlice, PVC, PV, NetworkPolicy & More
Baidu Geek Talk
Baidu Geek Talk
Mar 17, 2025 · Industry Insights

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

This article traces the decade‑long evolution of AI training stability—from early small‑model manual operations to large‑scale, multi‑thousand‑GPU clusters—detailing metrics like invalid training time, fault‑tolerance architectures, eBPF‑based hidden‑fault detection, BCCL enhancements, multi‑level restart strategies, and trigger‑based checkpointing that together shrink downtime from minutes to seconds.

AI trainingdistributed systemseBPF
0 likes · 22 min read
From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability
MaGe Linux Operations
MaGe Linux Operations
Dec 26, 2024 · Operations

What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution

This article provides a comprehensive overview of internet operations, detailing the three core pillars of service‑centered stability, security, and efficiency, describing the classification of operation roles, their responsibilities, the evolution of operational practices, and practical advice for aspiring operation engineers.

Site Reliability Engineeringinfrastructure
0 likes · 20 min read
What Does Modern IT Operations Involve? A Complete Guide to Roles & Evolution
Bilibili Tech
Bilibili Tech
Dec 20, 2024 · Operations

Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE

To cope with rapid growth, Bilibili replaced its inflexible PXE workflow with a hybrid system using in‑memory BiliOS and iPXE, adding out‑of‑band management, declarative configuration, and multi‑scenario support, which together dramatically boosted provisioning automation, reliability, and efficiency across its data‑center and edge servers.

BiliOSPXEServer Provisioning
0 likes · 17 min read
Evolution of Bilibili's Server Provisioning System: From Traditional PXE to BiliOS and iPXE
MaGe Linux Operations
MaGe Linux Operations
Dec 15, 2024 · Operations

Top Open-Source Tools for Unified Accounts, Automation & Infra Ops

This guide surveys a curated set of open‑source solutions—including LDAP, JumpServer, Ansible, dnsmasq, ApacheBench, PortSentry, Vagrant, Docker, ELK and Smokeping—that together enable unified account management, automated deployment, DNS services, stress testing, security hardening, virtualization, log collection and monitoring for robust operations.

Account ManagementAutomationinfrastructure
0 likes · 8 min read
Top Open-Source Tools for Unified Accounts, Automation & Infra Ops
Bilibili Tech
Bilibili Tech
Nov 15, 2024 · Operations

B站直播团队S14赛事保障实践

The Bilibili live‑streaming team’s S14 tournament support showcases how systematic business‑scenario analysis, precise resource forecasting, accelerated fault‑drill and stress‑test workflows, and optimized tooling can deliver stable, low‑cost performance for massive, high‑concurrency events like the 2024 League of Legends World Championship.

Live StreamingTechnical Case StudyTraffic Management
0 likes · 13 min read
B站直播团队S14赛事保障实践
Architects' Tech Alliance
Architects' Tech Alliance
Nov 10, 2024 · Industry Insights

AI Compute Infrastructure: Trends, Scaling Laws, and the Rise of Massive Clusters

The article analyzes the development of AI compute infrastructure, detailing the three‑level architecture from chip to cluster, the scaling law linking model parameters to compute demand, the rapid growth of massive “ten‑thousand‑card” clusters worldwide, and the emerging demand for inference workloads driving new deployment and scheduling strategies.

AI computeIndustry TrendsInference Demand
0 likes · 15 min read
AI Compute Infrastructure: Trends, Scaling Laws, and the Rise of Massive Clusters
DevOps Engineer
DevOps Engineer
Oct 29, 2024 · Operations

A Day in the Life of a DevOps Engineer

The article walks through a DevOps engineer’s typical workday, from morning Slack checks and task planning, through code repository maintenance, build and release duties, coffee breaks, lunch with teammates, focused afternoon development, and evening family time, highlighting both technical and personal aspects.

AutomationCI/CDOperations
0 likes · 4 min read
A Day in the Life of a DevOps Engineer
Selected Java Interview Questions
Selected Java Interview Questions
Oct 7, 2024 · Operations

Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples

This article introduces ten essential tools for operations engineers—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's functionality, typical scenarios, advantages, and real‑world examples with code snippets for practical automation and monitoring.

AutomationMonitoringOperations
0 likes · 8 min read
Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 2, 2024 · Operations

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

This article details ByteDance’s disaster‑recovery evolution—from a single‑room deployment to same‑city multi‑data‑center setups and finally to active‑active multi‑region architectures—explaining the challenges, specific failure scenarios, and the strategic practices used to ensure continuous service during outages.

Disaster RecoveryHigh AvailabilityOperations
0 likes · 15 min read
How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Sep 2, 2024 · Cloud Native

How Lilith Games Used Cloud‑Native Architecture to Transform AFK Journey

This article examines Lilith Games' cloud‑native migration of the new title AFK Journey, detailing the motivations, technical challenges of containerizing stateful game servers, the adoption of OpenKruise for in‑place updates, and the measurable improvements in resource utilization, release speed, and operational costs.

Cloud NativeGame DevelopmentOpenKruise
0 likes · 8 min read
How Lilith Games Used Cloud‑Native Architecture to Transform AFK Journey
IT Services Circle
IT Services Circle
Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

Disaster RecoveryNetEase Cloud MusicOperations
0 likes · 6 min read
Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons
Open Source Linux
Open Source Linux
Aug 1, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, ideal use cases, key advantages, and practical examples, while also providing code snippets and visual illustrations to help readers understand and apply them effectively.

AutomationMonitoringOperations
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Architects' Tech Alliance
Architects' Tech Alliance
Jul 28, 2024 · Artificial Intelligence

Design and Optimization Practices for Intelligent Computing Platforms in the Era of Large Models

The article examines the new characteristics, challenges, and technical practices of intelligent computing platforms required for large‑model AI workloads, covering infrastructure adaptation, heterogeneous scheduling, application acceleration, operation reliability, and future directions for simplifying GPU usage and connecting heterogeneous resources.

AI platformPerformance OptimizationScheduling
0 likes · 6 min read
Design and Optimization Practices for Intelligent Computing Platforms in the Era of Large Models