Tagged articles
969 articles
Page 2 of 10
Woodpecker Software Testing
Woodpecker Software Testing
Mar 3, 2026 · Artificial Intelligence

2026 In‑Depth Comparison of RAG Testing Tools: Finding the Most Trustworthy Solution

RAG systems have reached a trustworthiness tipping point, and in 2026 a surge of testing challenges demands new evaluation metrics; this article benchmarks twelve leading retrieval‑augmented generation testing tools across retrieval quality, generation controllability, observability, security compliance, and CI/CD integration, revealing which solutions best address real‑world finance and government use cases.

AI testingObservabilityRAG
0 likes · 8 min read
2026 In‑Depth Comparison of RAG Testing Tools: Finding the Most Trustworthy Solution
Woodpecker Software Testing
Woodpecker Software Testing
Mar 3, 2026 · Operations

Self-Healing Test Scripts: End Frequent Maintenance Hassles

The article explains how self‑healing test scripts, built on observable snapshots, strategy libraries, and lightweight decision engines, can automatically detect UI changes, diagnose locator failures, and apply semantic or visual fixes, dramatically reducing maintenance time and manual intervention in fast‑paced continuous delivery environments.

ObservabilityPythonSelenium
0 likes · 7 min read
Self-Healing Test Scripts: End Frequent Maintenance Hassles
Alibaba Cloud Native
Alibaba Cloud Native
Mar 2, 2026 · Artificial Intelligence

How to Make AI Agents Auditable and Controlled with OpenClaw, SLS, and OTEL

This article explains how to combine OpenClaw session logs, application logs, and OpenTelemetry metrics in Alibaba Cloud SLS to answer who triggered an AI agent, what actions were taken, how much it cost, and whether the behavior is traceable, enabling a complete observability and security solution for AI agents.

AI AgentMetricsOTEL
0 likes · 26 min read
How to Make AI Agents Auditable and Controlled with OpenClaw, SLS, and OTEL
Woodpecker Software Testing
Woodpecker Software Testing
Mar 1, 2026 · Artificial Intelligence

Optimizing RAG System Performance: A Practical Testing Guide

The article presents a systematic framework for testing and optimizing Retrieval‑Augmented Generation (RAG) systems, detailing performance‑sensitive bottlenecks, a three‑dimensional test matrix, real‑world case studies, and test‑driven engineering practices to ensure stable, fast, and accurate AI services.

AIBenchmarkingObservability
0 likes · 9 min read
Optimizing RAG System Performance: A Practical Testing Guide
Code Wrench
Code Wrench
Feb 28, 2026 · Backend Development

Why Explicit Code Beats Clever Tricks: Go’s Industrial Programming Principles

The article revisits Peter Bourgon’s “Go for Industrial Programming,” explaining how explicit, readable code, strict dependency handling, disciplined concurrency, robust observability, and simple flag‑based configuration empower Go teams to build maintainable, long‑lived backend systems.

GoIndustrial ProgrammingObservability
0 likes · 7 min read
Why Explicit Code Beats Clever Tricks: Go’s Industrial Programming Principles
Raymond Ops
Raymond Ops
Feb 26, 2026 · Operations

What Core Skills Do 500k‑CNY Ops Engineers Master?

This article breaks down the essential technical and soft‑skill competencies—ranging from deep Linux kernel knowledge and database optimization to cloud‑native Kubernetes expertise, observability, automation, cost‑saving architecture, and security—that distinguish high‑salary operations engineers and provides a practical roadmap for achieving them.

KubernetesObservabilityOperations
0 likes · 38 min read
What Core Skills Do 500k‑CNY Ops Engineers Master?
Architect
Architect
Feb 25, 2026 · Backend Development

Why OpenClaw Uses sessionKey as Partition Key and How Its Dual‑Queue Design Guarantees Order and Throughput

The article explains how OpenClaw tackles common multi‑agent messaging problems by treating sessionKey as a partition key, redefining DM scope for multi‑source inputs, employing a dual‑layer queue with per‑session serialization and global lane throttling, and exposing configurable knobs for micro‑batching, backpressure, and observability.

Message QueueObservabilityOpenClaw
0 likes · 11 min read
Why OpenClaw Uses sessionKey as Partition Key and How Its Dual‑Queue Design Guarantees Order and Throughput
Raymond Ops
Raymond Ops
Feb 24, 2026 · Cloud Native

Master Enterprise Monitoring: Build a Prometheus + Grafana Observability Platform

This guide details how to design and implement an enterprise‑grade cloud‑native observability platform using Prometheus for metrics collection and Grafana for visualization, covering architecture, high‑availability deployment, alerting, dashboard automation, case studies, best‑practice recommendations, and future trends.

Cloud NativeGrafanaObservability
0 likes · 24 min read
Master Enterprise Monitoring: Build a Prometheus + Grafana Observability Platform
High Availability Architecture
High Availability Architecture
Feb 22, 2026 · Artificial Intelligence

Why Traces, Not Code, Are the New Source of Truth in AI Agents

The article explains how AI agent development shifts the source of truth from static code to dynamic execution traces, reshaping debugging, testing, performance optimization, monitoring, and team collaboration around trace‑based observability for reliable, high‑quality agents.

AI agentsObservabilitydebugging
0 likes · 11 min read
Why Traces, Not Code, Are the New Source of Truth in AI Agents
Architect's Guide
Architect's Guide
Feb 21, 2026 · Backend Development

Essential Microservice Design Patterns Every Backend Engineer Should Know

This article surveys common microservice design patterns—including decomposition, integration, event‑driven, cross‑cutting concerns, and observability—explaining their goals, trade‑offs, and practical implementation steps to help architects build scalable, resilient backend systems.

Backend ArchitectureMicroservicesObservability
0 likes · 20 min read
Essential Microservice Design Patterns Every Backend Engineer Should Know
Fighter's World
Fighter's World
Feb 14, 2026 · Industry Insights

Can Pace’s Vertical AI Win the $70B Insurance BPO Market or Expand to a $400B BFSI Constellation?

The article analyzes how Pace, a tiny AI‑driven insurance BPO startup, aims to capture the $70 billion insurance BPO market with outcome‑based pricing and 100% POC success, while positioning itself for a longer‑term expansion into the $400 billion BFSI sector through reusable assets and a Constellation‑style acquisition strategy.

AIBPOFDE
0 likes · 22 min read
Can Pace’s Vertical AI Win the $70B Insurance BPO Market or Expand to a $400B BFSI Constellation?
Alibaba Cloud Native
Alibaba Cloud Native
Feb 13, 2026 · Cloud Native

How a Tea Chain Achieved Seamless Mega‑Promotions with Cloud‑Native Architecture

Facing massive traffic spikes from viral marketing events, the leading tea brand Guming transformed its digital foundation by adopting a cloud‑native micro‑service architecture, leveraging Alibaba Cloud MSE and RocketMQ Serverless to achieve elastic scaling, cost savings, strong consistency, and full‑stack observability for stable, high‑speed operations.

Digital TransformationMessagingMicroservices
0 likes · 8 min read
How a Tea Chain Achieved Seamless Mega‑Promotions with Cloud‑Native Architecture
AI Tech Publishing
AI Tech Publishing
Feb 6, 2026 · Artificial Intelligence

2026 Large Model Engineering Roadmap: From Foundations to Production

This roadmap outlines a step‑by‑step learning path for building, optimizing, and safely deploying large language model systems, covering fundamentals, vector stores, RAG, advanced techniques, fine‑tuning, inference speed, deployment, observability, agents, and production safeguards.

DeploymentFine-tuningInference
0 likes · 5 min read
2026 Large Model Engineering Roadmap: From Foundations to Production
Raymond Ops
Raymond Ops
Feb 2, 2026 · Operations

10 Essential PromQL Queries Every Ops Engineer Should Master

This article presents ten practical PromQL query examples covering CPU, memory, disk, network, database, Kubernetes, and business metrics, explains the underlying concepts, provides alert thresholds and best‑practice tips, and includes advanced optimization and alert‑rule design guidance for reliable monitoring.

AlertingMetricsObservability
0 likes · 22 min read
10 Essential PromQL Queries Every Ops Engineer Should Master
Architecture Digest
Architecture Digest
Jan 30, 2026 · Backend Development

How Hera Transforms SpringBoot Logging: A Step‑by‑Step Integration Guide

Integrating the Hera log platform into SpringBoot resolves common distributed‑system logging pain points—centralized storage, full‑trace linkages, and cost‑effective retention—by adding a non‑intrusive agent, configuring custom fields, enabling trace IDs, and providing a web console for rapid, multi‑service debugging and analysis.

Distributed SystemsHeraObservability
0 likes · 14 min read
How Hera Transforms SpringBoot Logging: A Step‑by‑Step Integration Guide
Code Wrench
Code Wrench
Jan 27, 2026 · Artificial Intelligence

Building a Multi‑Agent AI System: Easy‑Agent’s Foreman, Coder, and Researcher

This article explains how the easy‑agent project evolved from a single monolithic AI into a multi‑agent architecture with specialized Foreman, Coder, and Researcher agents, covering design principles, communication mechanisms, task decomposition, fault tolerance, parallel execution, observability, and future extensions, complete with code examples and open‑source links.

AIAgent ArchitectureGo
0 likes · 13 min read
Building a Multi‑Agent AI System: Easy‑Agent’s Foreman, Coder, and Researcher
Alibaba Cloud Observability
Alibaba Cloud Observability
Jan 26, 2026 · Cloud Native

How LoongCollector Delivers 10× Throughput and 80% Resource Savings in Cloud‑Native Observability

LoongCollector, the open‑source cloud‑native collector behind Alibaba Cloud's Simple Log Service, achieves ten‑fold higher throughput, up to 80% lower CPU and memory usage, near‑linear scaling, zero‑copy processing, lock‑free event pools and adaptive concurrency, while guaranteeing enterprise‑grade reliability for petabyte‑scale log and metric ingestion.

High ThroughputLoongCollectorObservability
0 likes · 16 min read
How LoongCollector Delivers 10× Throughput and 80% Resource Savings in Cloud‑Native Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
Jan 26, 2026 · Cloud Native

Solving Edge Observability: How LoongCollector Ensures Reliable Data Collection

This article explains the three major challenges of collecting observability data on edge devices—unstable networks, reliable delivery, and bandwidth limits—and shows how LoongCollector’s persistent‑asynchronous architecture, smart back‑pressure, and configurable flow control provide a low‑resource, high‑reliability solution with real‑world performance results.

Edge ComputingObservabilityPerformance
0 likes · 14 min read
Solving Edge Observability: How LoongCollector Ensures Reliable Data Collection
Efficient Ops
Efficient Ops
Jan 20, 2026 · Operations

Deploy Netdata for Real‑Time System Monitoring in Seconds

This guide introduces Netdata, an open‑source real‑time monitoring solution, outlines its key features, and provides step‑by‑step installation instructions for Linux and Docker, along with configuration of auto‑discovery, alerts, core metrics, and UI previews.

DevOpsDockerLinux
0 likes · 5 min read
Deploy Netdata for Real‑Time System Monitoring in Seconds
DevOps Coach
DevOps Coach
Jan 20, 2026 · Cloud Native

How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide

This article walks you through the complete journey from a single Kubernetes cluster to a production‑grade, multi‑cluster platform, covering managed services, capacity planning, GitOps pipelines, networking, observability, cost optimisation, upgrade strategies, and the people and processes needed for sustainable large‑scale operations.

Cloud NativeCost ManagementInfrastructure
0 likes · 27 min read
How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 15, 2026 · Cloud Native

Deploy Alibaba Cloud Service Mesh (ASM): Gateways, Traffic Management & Zero‑Trust

This guide explains how to set up Alibaba Cloud Service Mesh (ASM) on an ACK Kubernetes cluster, covering prerequisites, two methods of cluster registration, creation of north‑south and east‑west gateways, traffic routing with HTTPRoute, security policies using PeerAuthentication and AuthorizationPolicy, and observability configuration via Telemetry.

ASMAlibaba CloudGateway API
0 likes · 9 min read
Deploy Alibaba Cloud Service Mesh (ASM): Gateways, Traffic Management & Zero‑Trust
Alibaba Cloud Observability
Alibaba Cloud Observability
Jan 12, 2026 · Mobile Development

How to Bridge the Mobile Observability Gap with End‑to‑End Trace Integration

This article explains why mobile‑side observability often falls into a black hole, outlines a four‑step solution that makes the mobile client the first hop of a distributed trace using standard protocols, and demonstrates the approach with a real‑world slow‑query debugging case on Alibaba Cloud RUM.

MobileObservabilityPerformance
0 likes · 14 min read
How to Bridge the Mobile Observability Gap with End‑to‑End Trace Integration
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 12, 2026 · Operations

Why Traditional Monitoring Fails and How UModel Redefines Observability for AI‑Powered Ops

The article explains how legacy monitoring based on isolated metrics, traces, and logs cannot keep up with the massive, fragmented, and dynamic data of modern IT systems, and introduces UModel—a graph‑based observability model that bridges data, model, and engineering gaps to enable AI‑driven operations.

Graph ModelingObservabilityOperations
0 likes · 11 min read
Why Traditional Monitoring Fails and How UModel Redefines Observability for AI‑Powered Ops
Tech Verticals & Horizontals
Tech Verticals & Horizontals
Jan 8, 2026 · Artificial Intelligence

Google Agent Whitepaper: Building Production‑Ready AI Agents from Architecture to Ops

This whitepaper explains how modern AI agents evolve from simple language models to autonomous, multi‑step systems, detailing their core components, five‑step reasoning loop, classification levels, design patterns, deployment options, observability, security, and continuous learning with concrete examples.

AI agentsAgent ArchitectureDeployment
0 likes · 49 min read
Google Agent Whitepaper: Building Production‑Ready AI Agents from Architecture to Ops
MaGe Linux Operations
MaGe Linux Operations
Jan 7, 2026 · Operations

How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques

This comprehensive guide walks you through the architecture of Prometheus and Alertmanager, shows how to design, write, and test robust alert rules, and shares ten practical techniques—including proper for‑durations, rate() usage, recording rules, multi‑level alerts, and inhibition—to dramatically reduce alert noise and improve SRE reliability.

AlertingAlertmanagerDevOps
0 likes · 40 min read
How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques
DeWu Technology
DeWu Technology
Jan 7, 2026 · Operations

From Chaos to Clarity: Building Full‑Stack Observability for Poizon’s Algorithm Ecosystem

This article details how Poizon’s algorithm platform evolved from fragmented tracing to a unified, scenario‑driven observability system that standardizes traces, metrics, logs, and events, introduces a knowledge‑graph of algorithm scenes, and applies compression, async reporting, and advanced anomaly detection to improve stability and debugging efficiency.

Algorithm PlatformDistributed TracingLog Standardization
0 likes · 26 min read
From Chaos to Clarity: Building Full‑Stack Observability for Poizon’s Algorithm Ecosystem
Huolala Tech
Huolala Tech
Jan 7, 2026 · Operations

How Exemplar Bridges the Last‑Mile Gap in Observability

Facing the “last mile” challenge of correlating metrics, logs, and traces, the article examines common heterogeneous storage architectures, critiques existing Exemplar implementations, and presents HuoLala’s end‑to‑end solution that treats Exemplar as an independent observable dimension, detailing its data model, SDK integration, collector, and interactive visualization.

ExemplarLogAggregationMetrics
0 likes · 22 min read
How Exemplar Bridges the Last‑Mile Gap in Observability
Alibaba Cloud Native
Alibaba Cloud Native
Jan 3, 2026 · Operations

Turning Chaotic Observability Data into Actionable Graphs with UModel

This article examines the evolution of IT observability, explains why traditional metrics, traces, and logs fall short for AI‑driven operations, and introduces UModel—a graph‑based universal observability model that structures fragmented data into a semantic runtime context for autonomous AIOps agents.

Cloud NativeGraph ModelingObservability
0 likes · 12 min read
Turning Chaotic Observability Data into Actionable Graphs with UModel
MaGe Linux Operations
MaGe Linux Operations
Dec 24, 2025 · Backend Development

Mastering OpenTelemetry: From Setup to Advanced Sampling and Production‑Ready Practices

This guide walks through the fundamentals of OpenTelemetry, covering component architecture, environment setup, SDK and Collector configuration for Java, Go, and Kubernetes, and dives into common pitfalls, performance tuning, security hardening, high‑availability deployment, and advanced tail‑based sampling strategies.

CollectorDistributed TracingKubernetes
0 likes · 27 min read
Mastering OpenTelemetry: From Setup to Advanced Sampling and Production‑Ready Practices
DevOps Coach
DevOps Coach
Dec 22, 2025 · R&D Management

Why We Abandoned Scrum: Inside Our Developer‑Led Delivery Transformation

After discovering that traditional Agile rituals stifled high‑output engineering teams, we rebuilt our workflow around autonomous, domain‑owned squads using GitHub PRs, feature flags, and real‑time metrics, resulting in dramatically faster deployments, fewer incidents, and higher developer satisfaction.

Agile TransformationDeveloper-Led DeliveryFlow Engineering
0 likes · 8 min read
Why We Abandoned Scrum: Inside Our Developer‑Led Delivery Transformation
Ray's Galactic Tech
Ray's Galactic Tech
Dec 19, 2025 · Cloud Native

Mastering Kubernetes Networking: From Core Model to Production‑Ready Practices

This comprehensive guide explains Kubernetes' core networking model, CNI plugins, service networking, ingress, network policies, DNS, service mesh, advanced CNI features, kube‑proxyless alternatives, multi‑cluster setups, security, observability, and troubleshooting techniques for building high‑performance, secure, and observable clusters.

CNICloud NativeNetworkPolicy
0 likes · 10 min read
Mastering Kubernetes Networking: From Core Model to Production‑Ready Practices
Alibaba Cloud Native
Alibaba Cloud Native
Dec 19, 2025 · Artificial Intelligence

What Enterprises Are Learning from the State of Agent Engineering Report

The recent LangChain "State of Agent Engineering" report, combined with data from the AI‑Native Application Architecture whitepaper, reveals rapid production adoption of AI agents, persistent quality challenges, widespread observability, multi‑model strategies, and evolving evaluation practices across organizations of all sizes.

AI agentsEvaluationLLM
0 likes · 10 min read
What Enterprises Are Learning from the State of Agent Engineering Report
Alibaba Cloud Observability
Alibaba Cloud Observability
Dec 15, 2025 · Cloud Native

How UModel PaaS API Simplifies Observability Queries with Unified Entity Search

This article explains how the UModel PaaS API abstracts complex observability concepts—such as EntitySet, DataSet, StorageLink, and Filter—into a unified, object‑oriented query interface, offering Table, Object, and metadata modes, code examples, UI and SDK usage, and AI‑agent integration for efficient, low‑maintenance monitoring.

AI AgentAPICloud Native
0 likes · 16 min read
How UModel PaaS API Simplifies Observability Queries with Unified Entity Search
Ray's Galactic Tech
Ray's Galactic Tech
Dec 13, 2025 · Cloud Native

Mastering Kubernetes Observability: From Basic Metrics to Production‑Ready Practices

This guide explains how to build a robust Kubernetes observability system, covering core concepts, why traditional monitoring fails, paradigm shifts, best‑practice recommendations, and real‑world case studies that illustrate troubleshooting, alert design, cost and security monitoring, and a step‑by‑step adoption checklist.

Cloud NativeObservabilityPrometheus
0 likes · 10 min read
Mastering Kubernetes Observability: From Basic Metrics to Production‑Ready Practices
Alibaba Cloud Native
Alibaba Cloud Native
Dec 9, 2025 · Cloud Native

How UModel Simplifies Observability with Unified Entity Search and Table/Object Modes

This article explains how UModel abstracts observability data into unified table and object models, hides complex routing and field‑mapping logic, provides a single SPL‑based query language, supports metadata reflection for AI agents, and offers SDK and dry‑run examples to streamline metric, log, and trace queries across multiple storage backends.

AI AgentAPIObservability
0 likes · 15 min read
How UModel Simplifies Observability with Unified Entity Search and Table/Object Modes
Alibaba Cloud Observability
Alibaba Cloud Observability
Dec 9, 2025 · Cloud Native

Unlocking System Insights with Graph Queries in Cloud‑Native Observability

This article explains how integrating graph‑based data models into cloud‑native observability platforms transforms isolated metric monitoring into a relational view, enabling powerful queries such as graph‑match and Cypher to perform fault impact analysis, root‑cause tracing, and security audits across services, pods, and infrastructure.

CypherGraph DatabaseObservability
0 likes · 29 min read
Unlocking System Insights with Graph Queries in Cloud‑Native Observability
Alibaba Cloud Native
Alibaba Cloud Native
Dec 6, 2025 · Cloud Native

How Graph Queries Transform Cloud‑Native Observability and Fault Diagnosis

In modern cloud‑native systems, treating each service, container, or middleware as an isolated entity hides the essential connections between components, so this article explains how integrating graph‑based data models and query languages like graph‑match and Cypher unlocks powerful fault‑impact analysis, topology insights, and performance‑optimized troubleshooting.

CypherObservabilityfault-analysis
0 likes · 28 min read
How Graph Queries Transform Cloud‑Native Observability and Fault Diagnosis
Alibaba Cloud Observability
Alibaba Cloud Observability
Dec 1, 2025 · Cloud Native

How Entity Explorer Revolutionizes Cloud‑Native Observability with USearch and SPL

Entity Explorer provides a unified, high‑performance way to discover, query, and visualize billions of heterogeneous infrastructure, application, and business entities in cloud‑native environments, tackling massive data scale, semantic heterogeneity, and tight UI coupling through a USearch‑based search engine, scenario‑driven apps, dynamic topology, and model‑driven rendering.

Entity ExplorerObservabilitySPL
0 likes · 18 min read
How Entity Explorer Revolutionizes Cloud‑Native Observability with USearch and SPL
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 1, 2025 · Operations

How to Uncover Hidden Java Memory Leaks in Kubernetes Pods with Alibaba Cloud OS Console

When migrating automotive workloads to cloud-native containers, unexpected OOMKilled pods often hide a large amount of Java memory consumption caused by JNI, libc, and Transparent Huge Pages, which can be identified and resolved using the Alibaba Cloud OS Console's memory panorama analysis and hotspot tracing features.

Alibaba CloudJNIKubernetes
0 likes · 11 min read
How to Uncover Hidden Java Memory Leaks in Kubernetes Pods with Alibaba Cloud OS Console
Huya Tech Engineering
Huya Tech Engineering
Nov 28, 2025 · Operations

How LLMs Accelerate Root‑Cause Diagnosis in Large‑Scale Microservices

By abstracting a massive microservice system as a dynamic multi‑layer graph and integrating large language models, the article outlines three evolution stages—from manual expert debugging to rule‑based AIOps and finally LLM‑driven cognitive reasoning—detailing practical workflows, context engineering, and real‑world case studies that dramatically improve MTTR and accuracy.

Context EngineeringLLMMicroservices
0 likes · 20 min read
How LLMs Accelerate Root‑Cause Diagnosis in Large‑Scale Microservices
Java Web Project
Java Web Project
Nov 27, 2025 · Artificial Intelligence

How Spring AI Alibaba Admin Overcomes Enterprise AI Agent Deployment Pain Points

Spring AI Alibaba Admin addresses three major engineering obstacles—inefficient prompt debugging, unreliable AI quality assessment, and opaque production operations—by providing a full AI agent lifecycle platform with versioned prompt management, dataset versioning, flexible evaluator configuration, experiment automation, and end‑to‑end observability.

AI AgentEnterprise AIObservability
0 likes · 10 min read
How Spring AI Alibaba Admin Overcomes Enterprise AI Agent Deployment Pain Points
DevOps Coach
DevOps Coach
Nov 26, 2025 · Operations

Why Kubernetes Monitoring Is Essential and How to Implement Best Practices

This article explains why monitoring is critical in dynamic Kubernetes environments, outlines the expanded observability scope introduced by containers and the control plane, and provides a practical checklist of best‑practice steps—including namespaces, labeling, resource limits, health probes, centralized telemetry, automation, and version upgrades—to achieve reliable production‑grade observability.

Cloud NativeDevOpsKubernetes
0 likes · 7 min read
Why Kubernetes Monitoring Is Essential and How to Implement Best Practices
Alibaba Cloud Native
Alibaba Cloud Native
Nov 26, 2025 · Cloud Native

How Entity Explorer Redefines Cloud‑Native Observability with Unified Queries and Model‑Driven UI

Entity Explorer introduces a unified, model‑driven approach to cloud‑native observability that classifies infrastructure, application, business, and operations entities, tackles massive‑scale data, heterogeneity, and UI coupling challenges, and delivers fast, contextual search and visual analysis through USearch and SPL languages.

Cloud NativeEntityObservability
0 likes · 20 min read
How Entity Explorer Redefines Cloud‑Native Observability with Unified Queries and Model‑Driven UI
IT Architects Alliance
IT Architects Alliance
Nov 25, 2025 · Operations

Making Architecture Decisions Observable with DevOps Monitoring

The article explains how to integrate architecture decision tracking into DevOps monitoring, detailing tagging, multi‑layer metric design, time‑window analysis, automated alerts, reporting, and continuous optimization to turn architectural choices into measurable, data‑driven outcomes.

DevOpsMetricsObservability
0 likes · 9 min read
Making Architecture Decisions Observable with DevOps Monitoring
Alibaba Cloud Native
Alibaba Cloud Native
Nov 25, 2025 · Artificial Intelligence

AI‑Native Architecture Insights: Highlights from AgentX 2025 SECon

The AgentX 2025 SECon AI‑native application track, co‑hosted by Alibaba Cloud and the Institute of Information, delivered deep technical insights on AI‑native architecture, the AgentScope 1.0 framework, AI gateway capabilities, and observability‑driven reliability for long‑cycle agents, summarised here for practitioners.

AI gatewayAI-nativeAgentScope
0 likes · 7 min read
AI‑Native Architecture Insights: Highlights from AgentX 2025 SECon
DevOps Coach
DevOps Coach
Nov 24, 2025 · Operations

10 Essential Grafana Dashboards to Spot Incidents Early

This guide presents ten essential Grafana dashboards—covering SLO burn, user‑journey funnel, infrastructure USE metrics, queue lag, database health, cache hit‑rate, CDN latency, rollout guardrails, trace topology, and a command‑center view—each explained with its purpose, panel layout, and ready‑to‑use PromQL or LogQL queries.

DashboardsGrafanaObservability
0 likes · 13 min read
10 Essential Grafana Dashboards to Spot Incidents Early
Ops Development Stories
Ops Development Stories
Nov 24, 2025 · Operations

How to Deploy OpenTelemetry, Grafana Tempo, and Jaeger with Docker Compose for End-to-End Tracing

This guide walks you through setting up a complete tracing pipeline using OpenTelemetry, Grafana Tempo, and Jaeger with Docker‑Compose, covering Tempo installation, collector configuration, sample application deployment, and Grafana UI integration to visualize traces, including code snippets and step‑by‑step commands.

Docker ComposeGrafana TempoObservability
0 likes · 7 min read
How to Deploy OpenTelemetry, Grafana Tempo, and Jaeger with Docker Compose for End-to-End Tracing
JavaGuide
JavaGuide
Nov 19, 2025 · Artificial Intelligence

Spring AI 1.1 Released: Explosive New Features for Java AI Development

Spring AI 1.1.0 arrives with a major overhaul, adding out‑of‑the‑box Model Context Protocol support, five‑mode prompt caching that can cut LLM costs by up to 90%, reasoning APIs, recursive advisors, a broadened model ecosystem, enhanced vector‑store and chat‑memory options, and richer observability integrations.

AI integrationMCPObservability
0 likes · 9 min read
Spring AI 1.1 Released: Explosive New Features for Java AI Development
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 17, 2025 · Cloud Native

How We Built a Scalable Traffic Governance System for Thousands of Microservices

This article details a company’s step‑by‑step evolution from basic observability to a full‑stack traffic governance framework—including automated tracing, adaptive rate‑limiting, circuit‑breaking, and intelligent gray‑release—enabling stable operation of a microservice ecosystem with tens of thousands of instances while cutting MTTR to minutes and resource waste by over 20%.

Cloud NativeMicroservicesObservability
0 likes · 24 min read
How We Built a Scalable Traffic Governance System for Thousands of Microservices
Alibaba Cloud Observability
Alibaba Cloud Observability
Nov 17, 2025 · Operations

How to Build Full‑Stack Observability for Dify LLM Apps Using Alibaba Cloud Monitoring

This guide explains how to achieve end‑to‑end observability for Dify low‑code LLM applications by combining Dify's built‑in monitoring, third‑party tracing services like Langfuse, and Alibaba Cloud's CloudMonitor with Python and Go probes, covering component‑level tracing, configuration steps, and trace linking for debugging and performance optimization.

Alibaba CloudDifyObservability
0 likes · 27 min read
How to Build Full‑Stack Observability for Dify LLM Apps Using Alibaba Cloud Monitoring
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 17, 2025 · Operations

Achieving Full‑Stack Observability for Dify Agentic Apps with Alibaba Cloud Monitoring

This guide explains the observability challenges of Dify's low‑code LLM platform, analyzes its native and third‑party monitoring capabilities, and provides a step‑by‑step solution using Alibaba Cloud's non‑intrusive Python and Go probes, Trace Link integration, and detailed deployment instructions to monitor every component from the API to plugins and sandbox.

Alibaba CloudDifyObservability
0 likes · 28 min read
Achieving Full‑Stack Observability for Dify Agentic Apps with Alibaba Cloud Monitoring
dbaplus Community
dbaplus Community
Nov 10, 2025 · Backend Development

Why Most Developers Fail at Logging and How to Master It

This article reveals common logging pitfalls that cause silent failures, explains three levels of logging maturity from rookie to expert, and provides concrete Java code examples, structured‑logging techniques, MDC usage, and automated alerting to turn logs into a powerful observability tool.

Observabilitybest-practiceserror-handling
0 likes · 14 min read
Why Most Developers Fail at Logging and How to Master It
DevOps Coach
DevOps Coach
Nov 10, 2025 · Operations

How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases

This guide explains the SRE framework—SLA, SLO, SLI hierarchy, golden signals, error budgets, and DORA metrics—showing how to instrument a Python app with OpenTelemetry, query Prometheus, avoid common pitfalls, and adopt a cultural and technical process that balances feature velocity with system stability.

DoRAError BudgetGolden Signals
0 likes · 18 min read
How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases
Ops Development Stories
Ops Development Stories
Nov 10, 2025 · Operations

Build a Low‑Cost Observability Platform with OpenObserve and Vector

This guide walks you through the architecture, deployment, and configuration of the Rust‑based OpenObserve observability platform together with the high‑performance Vector data pipeline, covering log, metric, and trace collection, Docker‑Compose setup, UI usage, and common FAQs for small teams.

ObservabilityVectorcloud-native
0 likes · 11 min read
Build a Low‑Cost Observability Platform with OpenObserve and Vector
Alibaba Cloud Observability
Alibaba Cloud Observability
Nov 10, 2025 · Cloud Native

How a Next‑Gen Cloud‑Native Observability Platform Boosted Ticketing Stability by 80%

A leading digital‑entertainment group tackled severe stability and monitoring challenges in its high‑traffic ticketing system by building a cloud‑native, full‑link observability platform on Alibaba Cloud, achieving an 80% improvement in fault detection speed, a 40% reduction in operational costs, and establishing data‑driven operations as the digital foundation for product growth.

ObservabilityOperationsaiops
0 likes · 15 min read
How a Next‑Gen Cloud‑Native Observability Platform Boosted Ticketing Stability by 80%
Efficient Ops
Efficient Ops
Nov 9, 2025 · Operations

How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE

The talk details Tencent PCG’s end‑to‑end observability platform, its data‑standardization pipeline, client‑backend session linking, AI‑enhanced SRE Agent with large language models, and the roadmap toward a SaaS offering, illustrating how modern operations integrate AI for rapid fault localization.

AIObservabilitySRE
0 likes · 17 min read
How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE
Didi Tech
Didi Tech
Nov 7, 2025 · Cloud Native

How Didi’s Open‑Source Projects Are Shaping Cloud‑Native Innovation at Zhejiang University

On November 3, Didi Open‑Source presented its ecosystem and four flagship projects—XIAOJUSURVEY, HUATUO, MPX, and KnowStreaming—to over a hundred Zhejiang University software students, sharing insights on enterprise‑grade open‑source practices, cloud‑native observability, cross‑platform development, and the role of open source in talent cultivation.

AICross‑platform developmentObservability
0 likes · 7 min read
How Didi’s Open‑Source Projects Are Shaping Cloud‑Native Innovation at Zhejiang University
Architect
Architect
Nov 6, 2025 · Operations

Why Most Teams Should Choose Loki Over ELK for Log Management – A Cost‑Effective Guide

This comprehensive guide compares ELK, EFK, and Loki log‑management solutions, analyzing their architecture, performance, cost, and use‑case suitability, and provides a decision framework, real‑world case studies, migration strategies, and optimization tips to help teams select the most efficient logging stack for their needs.

Cost OptimizationELKLog Management
0 likes · 36 min read
Why Most Teams Should Choose Loki Over ELK for Log Management – A Cost‑Effective Guide
JakartaEE China Community
JakartaEE China Community
Nov 4, 2025 · Operations

How Logs, Traces, and Metrics Differ—and Why It Matters

Logs, tracing, and metrics each serve distinct monitoring goals—logs capture discrete events for debugging and audit, traces map request flows to pinpoint performance bottlenecks, and metrics provide time‑series health data; understanding their differences and integrating tools like ELK, OpenTelemetry, Prometheus, and Grafana enables robust observability.

ELKGrafanaMetrics
0 likes · 7 min read
How Logs, Traces, and Metrics Differ—and Why It Matters
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Nov 2, 2025 · Backend Development

What’s New in the Elasticsearch 9.x Documentation?

The Elasticsearch 9.x documentation has moved to a new URL, unified version handling, reorganized by solution use‑cases, separated release notes, added versioned API paths, and introduced client library navigation and versioning guides, all aimed at improving discoverability and developer efficiency.

APIDocumentationElasticsearch
0 likes · 7 min read
What’s New in the Elasticsearch 9.x Documentation?
FunTester
FunTester
Oct 31, 2025 · Fundamentals

Master Defensive Programming: Turn Failures into Manageable Events

This article explains why defensive programming is essential, outlines its core principles, presents common failure scenarios and practical guidelines, and shows how testing and observability can turn inevitable errors into controlled, recoverable events that keep systems stable and maintainable.

Error HandlingObservabilitydefensive programming
0 likes · 9 min read
Master Defensive Programming: Turn Failures into Manageable Events
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 30, 2025 · Artificial Intelligence

Why AI Agents Aren’t As Simple As They Appear: Engineering Challenges and Solutions

Building AI agents may seem straightforward with frameworks like LangChain, but hidden complexities in orchestration, memory management, reproducibility, and scalability turn simple demos into fragile systems, requiring systematic engineering, observability, and robust design to achieve reliable, production‑grade intelligent agents.

AI agentsAgent DesignLangChain
0 likes · 21 min read
Why AI Agents Aren’t As Simple As They Appear: Engineering Challenges and Solutions
Ops Community
Ops Community
Oct 29, 2025 · Cloud Native

ELK vs Loki: Which Kubernetes Log Solution Saves Cost and Boosts Performance?

This article compares ELK and Loki for Kubernetes log collection, covering scenarios, prerequisites, architectural differences, storage costs, query performance, deployment steps with Helm, best‑practice optimizations, and troubleshooting tips to help you choose the most efficient solution.

Cloud NativeELKKubernetes
0 likes · 12 min read
ELK vs Loki: Which Kubernetes Log Solution Saves Cost and Boosts Performance?
Java Tech Enthusiast
Java Tech Enthusiast
Oct 28, 2025 · Backend Development

Why Rewriting a Java Microservice in Rust Cut Costs and Boosted Performance

A senior engineer recounts how replacing a noisy Java billing microservice with a lean Rust implementation slashed latency, reduced CPU and memory usage, lowered infrastructure bills, and exposed cultural and organizational challenges, offering a practical roadmap for teams considering similar migrations.

BackendObservabilityRust
0 likes · 11 min read
Why Rewriting a Java Microservice in Rust Cut Costs and Boosted Performance
Alibaba Cloud Observability
Alibaba Cloud Observability
Oct 27, 2025 · Operations

From Data Silos to Intelligent Insights: Building Future‑Ready Operation Intelligence

This article explains how enterprises can transform massive, fragmented operation data—technical, business, and security—into high‑value intelligent signals by unifying storage, enriching context, applying AI, and delivering a single, observable platform that enables proactive, data‑driven decision making.

AIData PlatformObservability
0 likes · 18 min read
From Data Silos to Intelligent Insights: Building Future‑Ready Operation Intelligence
DevOps Coach
DevOps Coach
Oct 22, 2025 · Cloud Native

Simplify Scalable Kubernetes Pod Logging with Grafana podLogs

This guide explains how Grafana's podLogs feature, powered by Vector.dev, transforms raw Kubernetes pod logs into enriched, searchable, cluster‑wide observability data, covering why pod‑level logs matter, configuration steps, advanced custom log paths, and practical examples.

Cloud NativeGrafanaKubernetes
0 likes · 14 min read
Simplify Scalable Kubernetes Pod Logging with Grafana podLogs
IT Architects Alliance
IT Architects Alliance
Oct 22, 2025 · Cloud Native

Avoid the Top 5 Cloud Migration Mistakes: Proven Cloud‑Native Strategies

This article analyzes the five most common cloud‑migration pitfalls—lift‑and‑shift, network latency, incomplete data‑architecture transformation, weak security models, and poor observability—offering concrete cloud‑native solutions, migration matrices, code examples, and best‑practice guidelines for successful architectural evolution.

ArchitectureCloud NativeDevOps
0 likes · 12 min read
Avoid the Top 5 Cloud Migration Mistakes: Proven Cloud‑Native Strategies
Linux Kernel Journey
Linux Kernel Journey
Oct 21, 2025 · Industry Insights

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

The article explains how bpftime extends eBPF to NVIDIA and AMD GPUs, exposing fine‑grained execution details that traditional CPU‑side tools miss, and demonstrates a unified, programmable observability stack that overcomes the limitations of existing GPU profilers in both synchronous and asynchronous workloads.

CUDAGPUObservability
0 likes · 23 min read
Bridging the GPU Observability Gap: Why eBPF on GPUs Matters
Alibaba Cloud Observability
Alibaba Cloud Observability
Oct 20, 2025 · Cloud Native

How ‘泡姆泡姆’ Leverages Cloud‑Native Architecture for Global Low‑Latency Gaming

The multiplayer party game 泡姆泡姆 combines colorful shooting, match‑3, physics puzzles and arcade mini‑games, and uses a cloud‑native stack on Alibaba Cloud Container Service with OpenKruiseGame, Keda‑driven auto‑scaling, multi‑region deployment, zero‑downtime updates and a three‑layer observability platform to deliver seamless low‑latency experiences worldwide.

Game DevelopmentObservabilityScalability
0 likes · 10 min read
How ‘泡姆泡姆’ Leverages Cloud‑Native Architecture for Global Low‑Latency Gaming
JavaGuide
JavaGuide
Oct 17, 2025 · Artificial Intelligence

Alibaba Open‑Sources Spring AI Alibaba Admin: A Full‑Lifecycle AI Agent Platform

Spring AI Alibaba extends Spring AI with multi‑agent and enterprise features, but faces three engineering hurdles—inefficient prompt debugging, unguaranteed AI quality, and opaque operations—so Alibaba released Spring AI Alibaba Admin, offering prompt templating, dataset versioning, evaluator configuration, experiment management, and deep observability to streamline AI agent development and deployment.

AI AgentDataset VersioningEvaluator
0 likes · 8 min read
Alibaba Open‑Sources Spring AI Alibaba Admin: A Full‑Lifecycle AI Agent Platform
Alibaba Cloud Native
Alibaba Cloud Native
Oct 16, 2025 · Artificial Intelligence

How Spring AI Alibaba Admin Powers Data‑Centric AI Agent Development and Ops

This article outlines the industry shift toward large‑scale AI Agent deployment, identifies key engineering challenges such as prompt management, quality assessment, and observability, and presents Spring AI Alibaba Admin—a cloud‑native platform that offers prompt, dataset, evaluator, and tracing capabilities, complete with setup instructions and future roadmap.

AI AgentNacosObservability
0 likes · 15 min read
How Spring AI Alibaba Admin Powers Data‑Centric AI Agent Development and Ops
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 16, 2025 · Operations

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

This article explains how Huawei Cloud's HyperRouter implements deterministic operations through a combination of L4/L7 load‑balancing co‑design, high‑performance data‑plane choices, self‑healing mechanisms, point‑to‑point architecture, Cell + Shuffle‑Sharding isolation, and user‑centric observability, providing a reproducible blueprint for reliable cloud services.

Cloud NativeDPDKObservability
0 likes · 17 min read
How HyperRouter Enables Deterministic Operations for L4 Load Balancing
MaGe Linux Operations
MaGe Linux Operations
Oct 14, 2025 · Cloud Native

How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale

This article explains how the cloud‑native Loki logging system combined with S3 object storage can reduce PB‑level log storage expenses by 80‑90%, while simplifying operations, improving query performance, and meeting compliance requirements through detailed architecture, configuration, deployment, and real‑world case studies.

Cost OptimizationLog ManagementLoki
0 likes · 23 min read
How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale
MaGe Linux Operations
MaGe Linux Operations
Oct 12, 2025 · Operations

How to Balance Loki Tag Design and Chunk Compression to Tame Log Floods

Learn how to design low‑cardinality Loki tags, fine‑tune Chunk compression settings, and implement best‑practice configurations, pipelines, and monitoring to prevent memory overload, improve query performance, and efficiently manage massive log volumes in cloud‑native environments.

Log ManagementLokiObservability
0 likes · 38 min read
How to Balance Loki Tag Design and Chunk Compression to Tame Log Floods
Cognitive Technology Team
Cognitive Technology Team
Oct 12, 2025 · Backend Development

Resilient Microservices: Practical Patterns to Keep Your Services Alive

Learn how to tame chaotic microservices with practical resilience patterns—circuit breakers, bulkheads, smart retries, timeouts with fallbacks, and event‑driven messaging—plus tool recommendations and observability tips that ensure your system stays responsive even when individual services fail.

ObservabilityResilienceRetry
0 likes · 9 min read
Resilient Microservices: Practical Patterns to Keep Your Services Alive