Tagged articles

Observability

1054 articles · Page 3 of 11

Jan 23, 2026 · Backend Development

Are You Still Using Old Java Patterns? 2026 Spring Boot Trends That May Obsolete Them

The article analyzes how Spring Boot 2026 reshapes Java backend development by embracing virtual threads, GraalVM native images, built‑in observability, modern security practices, immutable records, and cloud‑native defaults, urging developers to abandon legacy coding habits in favor of these emerging trends.

GraalVMNative ImageObservability

0 likes · 8 min read

Are You Still Using Old Java Patterns? 2026 Spring Boot Trends That May Obsolete Them

Volcano Engine Developer Services

Jan 21, 2026 · Operations

How Tail‑Based Sampling Boosts Distributed Tracing Accuracy While Cutting Costs

This article explains the challenges of accurate RED metric collection in high‑traffic microservices, compares head‑based and tail‑based sampling, and details Volcano Engine APMPlus's multi‑level, hash‑routed tail sampling design, performance optimizations, and real‑world evaluation results.

APMDistributed TracingKubernetes

0 likes · 13 min read

How Tail‑Based Sampling Boosts Distributed Tracing Accuracy While Cutting Costs

Efficient Ops

Jan 20, 2026 · Operations

Deploy Netdata for Real‑Time System Monitoring in Seconds

This guide introduces Netdata, an open‑source real‑time monitoring solution, outlines its key features, and provides step‑by‑step installation instructions for Linux and Docker, along with configuration of auto‑discovery, alerts, core metrics, and UI previews.

DockerLinuxNetdata

0 likes · 5 min read

Deploy Netdata for Real‑Time System Monitoring in Seconds

DevOps Coach

Jan 20, 2026 · Cloud Native

How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide

This article walks you through the complete journey from a single Kubernetes cluster to a production‑grade, multi‑cluster platform, covering managed services, capacity planning, GitOps pipelines, networking, observability, cost optimisation, upgrade strategies, and the people and processes needed for sustainable large‑scale operations.

KubernetesObservabilitycloud-native

0 likes · 27 min read

How to Scale Kubernetes to Hundreds of Clusters: A Practical Enterprise Guide

Efficient Ops

Jan 18, 2026 · Cloud Native

How to Deploy Loki for Cloud‑Native Log Management with Promtail and Grafana

This guide explains Loki's lightweight cloud‑native logging architecture, shows step‑by‑step configuration of Promtail, Loki service, and Grafana integration, and provides concrete YAML and systemd examples for collecting and visualizing secure logs.

GrafanaLogQLObservability

0 likes · 10 min read

How to Deploy Loki for Cloud‑Native Log Management with Promtail and Grafana

Alibaba Cloud Infrastructure

Jan 15, 2026 · Cloud Native

Deploy Alibaba Cloud Service Mesh (ASM): Gateways, Traffic Management & Zero‑Trust

This guide explains how to set up Alibaba Cloud Service Mesh (ASM) on an ACK Kubernetes cluster, covering prerequisites, two methods of cluster registration, creation of north‑south and east‑west gateways, traffic routing with HTTPRoute, security policies using PeerAuthentication and AuthorizationPolicy, and observability configuration via Telemetry.

ASMAlibaba CloudGateway API

0 likes · 9 min read

Deploy Alibaba Cloud Service Mesh (ASM): Gateways, Traffic Management & Zero‑Trust

Alibaba Cloud Observability

Jan 12, 2026 · Mobile Development

How to Bridge the Mobile Observability Gap with End‑to‑End Trace Integration

This article explains why mobile‑side observability often falls into a black hole, outlines a four‑step solution that makes the mobile client the first hop of a distributed trace using standard protocols, and demonstrates the approach with a real‑world slow‑query debugging case on Alibaba Cloud RUM.

ObservabilityPerformanceTracing

0 likes · 14 min read

How to Bridge the Mobile Observability Gap with End‑to‑End Trace Integration

Alibaba Cloud Developer

Jan 12, 2026 · Operations

Why Traditional Monitoring Fails and How UModel Redefines Observability for AI‑Powered Ops

The article explains how legacy monitoring based on isolated metrics, traces, and logs cannot keep up with the massive, fragmented, and dynamic data of modern IT systems, and introduces UModel—a graph‑based observability model that bridges data, model, and engineering gaps to enable AI‑driven operations.

AIOpsGraph ModelingObservability

0 likes · 11 min read

Why Traditional Monitoring Fails and How UModel Redefines Observability for AI‑Powered Ops

Ops Development Stories

Jan 12, 2026 · Operations

Choosing the Best 2026 Observability Stack: From Collection to Alerts

This article reviews the 2026 observability landscape, outlines selection principles, compares open‑source and commercial solutions for data collection, storage, alerting and event management, and discusses how AI is reshaping monitoring and AIOps practices.

AlertingObservabilitySRE

0 likes · 9 min read

Choosing the Best 2026 Observability Stack: From Collection to Alerts

Alibaba Cloud Native

Jan 11, 2026 · Cloud Native

How to Bridge the Mobile Observability Gap with End‑to‑End Trace Integration

This article explains why mobile observability often falls into a black‑hole, outlines a four‑step solution that makes the mobile client the first hop of a distributed trace by sharing a common Trace ID, and demonstrates the approach with a real‑world slow‑query debugging case using Alibaba Cloud RUM.

APMDistributed TracingObservability

0 likes · 13 min read

Tech Verticals & Horizontals

Jan 8, 2026 · Artificial Intelligence

Google Agent Whitepaper: Building Production‑Ready AI Agents from Architecture to Ops

This whitepaper explains how modern AI agents evolve from simple language models to autonomous, multi‑step systems, detailing their core components, five‑step reasoning loop, classification levels, design patterns, deployment options, observability, security, and continuous learning with concrete examples.

AI agentsDeploymentMulti-Agent Systems

0 likes · 49 min read

Google Agent Whitepaper: Building Production‑Ready AI Agents from Architecture to Ops

MaGe Linux Operations

Jan 7, 2026 · Operations

How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques

This comprehensive guide walks you through the architecture of Prometheus and Alertmanager, shows how to design, write, and test robust alert rules, and shares ten practical techniques—including proper for‑durations, rate() usage, recording rules, multi‑level alerts, and inhibition—to dramatically reduce alert noise and improve SRE reliability.

AlertingAlertmanagerObservability

0 likes · 40 min read

How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques

DeWu Technology

Jan 7, 2026 · Operations

From Chaos to Clarity: Building Full‑Stack Observability for Poizon’s Algorithm Ecosystem

This article details how Poizon’s algorithm platform evolved from fragmented tracing to a unified, scenario‑driven observability system that standardizes traces, metrics, logs, and events, introduces a knowledge‑graph of algorithm scenes, and applies compression, async reporting, and advanced anomaly detection to improve stability and debugging efficiency.

Algorithm PlatformAnomaly DetectionDistributed Tracing

0 likes · 26 min read

From Chaos to Clarity: Building Full‑Stack Observability for Poizon’s Algorithm Ecosystem

Huolala Tech

Jan 7, 2026 · Operations

How Exemplar Bridges the Last‑Mile Gap in Observability

Facing the “last mile” challenge of correlating metrics, logs, and traces, the article examines common heterogeneous storage architectures, critiques existing Exemplar implementations, and presents HuoLala’s end‑to‑end solution that treats Exemplar as an independent observable dimension, detailing its data model, SDK integration, collector, and interactive visualization.

ExemplarLogAggregationObservability

0 likes · 22 min read

How Exemplar Bridges the Last‑Mile Gap in Observability

LuTiao Programming

Jan 5, 2026 · Backend Development

8 Spring Boot Trends Shaping 2026: What You Must Adopt Now

The article outlines eight pivotal 2026 Spring Boot trends—from mandatory migration to Java 17 and Jakarta namespaces, GraalVM native images, reactive WebFlux, OpenAPI/GraphQL APIs, zero‑trust Spring Security 6, Spring Cloud governance, Actuator observability, to AI integration—explaining why each matters and how to prepare.

AI integrationGraalVMObservability

0 likes · 8 min read

8 Spring Boot Trends Shaping 2026: What You Must Adopt Now

Alibaba Cloud Observability

Jan 5, 2026 · Cloud Native

How Go Compile‑Time Instrumentation Enables Zero‑Code OpenTelemetry Tracing

The article explains a Go compile‑time instrumentation tool that automatically injects OpenTelemetry tracing into binaries without source changes, compares it with eBPF and manual instrumentation, and provides usage steps, code examples, and links to the open‑source project.

Automatic tracingCompile‑time instrumentationObservability

0 likes · 8 min read

How Go Compile‑Time Instrumentation Enables Zero‑Code OpenTelemetry Tracing

Past Memory Big Data

Jan 4, 2026 · Industry Insights

Upgrade Your Stack: 2025 Apache Top-Level Projects You Should Know

The article reviews the eleven Apache projects graduating to top-level status in 2025, explaining how each—ranging from big‑data shuffle services and unified data processing to dev‑ops analytics, web frameworks, and messaging platforms—addresses specific infrastructure challenges and why they merit inclusion in modern technology stacks.

Data InfrastructureObservabilityWeb Framework

0 likes · 11 min read

Upgrade Your Stack: 2025 Apache Top-Level Projects You Should Know

Alibaba Cloud Native

Jan 3, 2026 · Operations

Turning Chaotic Observability Data into Actionable Graphs with UModel

This article examines the evolution of IT observability, explains why traditional metrics, traces, and logs fall short for AI‑driven operations, and introduces UModel—a graph‑based universal observability model that structures fragmented data into a semantic runtime context for autonomous AIOps agents.

AIOpsGraph ModelingObservability

0 likes · 12 min read

Turning Chaotic Observability Data into Actionable Graphs with UModel

IT Services Circle

Jan 2, 2026 · Backend Development

Unlock Go 1.26’s New Goroutine Scheduling Metrics for Better Observability

Go 1.26 introduces six runtime/metrics counters that expose total and current Goroutine counts, runnable and running states, system‑call involvement, waiting resources, and active thread numbers, enabling precise production‑level monitoring and faster diagnosis of scheduling issues.

ObservabilityPerformanceRuntime Metrics

0 likes · 8 min read

Unlock Go 1.26’s New Goroutine Scheduling Metrics for Better Observability

Alibaba Cloud Native

Dec 30, 2025 · Cloud Native

Key Takeaways from Guangzhou AI‑Native App Salon: AgentScope, HiMarket, Serverless

The Guangzhou AI‑native application salon gathered over 140 tech professionals to share deep technical insights on AgentScope Java, the HiMarket AI platform, Serverless‑based AgentRun, LoongSuite observability, and RocketMQ‑driven A2A communication, concluding with a hands‑on workshop for building intelligent agents.

AIObservabilityServerless

0 likes · 4 min read

Key Takeaways from Guangzhou AI‑Native App Salon: AgentScope, HiMarket, Serverless

MaGe Linux Operations

Dec 24, 2025 · Backend Development

Mastering OpenTelemetry: From Setup to Advanced Sampling and Production‑Ready Practices

This guide walks through the fundamentals of OpenTelemetry, covering component architecture, environment setup, SDK and Collector configuration for Java, Go, and Kubernetes, and dives into common pitfalls, performance tuning, security hardening, high‑availability deployment, and advanced tail‑based sampling strategies.

CollectorDistributed TracingJaeger

0 likes · 27 min read

Mastering OpenTelemetry: From Setup to Advanced Sampling and Production‑Ready Practices

Amazon Cloud Developers

Dec 24, 2025 · Artificial Intelligence

Evaluating Agent Observability: A Multi‑Dimensional Framework for Behavior, Quality, and Cost

The guide outlines a comprehensive, multi‑dimensional observability framework for AI agents—covering behavior insight, quality assessment, latency and token metrics, tool‑call tracking, error tracing, and cost monitoring—while demonstrating practical implementation with OpenTelemetry, Amazon CloudWatch, and open‑source tools such as MLflow and Langfuse.

Amazon CloudWatchLangFuseMLflow

0 likes · 27 min read

Evaluating Agent Observability: A Multi‑Dimensional Framework for Behavior, Quality, and Cost

xkx's Tech General Store

Dec 23, 2025 · Operations

Why Teams Choose SkyWalking: Lightweight Deployment and Monitoring Tips

This article walks through the architecture, single‑node deployment steps, configuration details, core feature usage with a RuoYi example, and common pitfalls of Apache SkyWalking, showing how backend teams can quickly achieve observability for micro‑services.

APMDeploymentObservability

0 likes · 8 min read

Why Teams Choose SkyWalking: Lightweight Deployment and Monitoring Tips

DevOps Coach

Dec 22, 2025 · R&D Management

Why We Abandoned Scrum: Inside Our Developer‑Led Delivery Transformation

After discovering that traditional Agile rituals stifled high‑output engineering teams, we rebuilt our workflow around autonomous, domain‑owned squads using GitHub PRs, feature flags, and real‑time metrics, resulting in dramatically faster deployments, fewer incidents, and higher developer satisfaction.

Agile TransformationDeveloper-Led DeliveryFlow Engineering

0 likes · 8 min read

Why We Abandoned Scrum: Inside Our Developer‑Led Delivery Transformation

Ray's Galactic Tech

Dec 19, 2025 · Cloud Native

Mastering Kubernetes Networking: From Core Model to Production‑Ready Practices

This comprehensive guide explains Kubernetes' core networking model, CNI plugins, service networking, ingress, network policies, DNS, service mesh, advanced CNI features, kube‑proxyless alternatives, multi‑cluster setups, security, observability, and troubleshooting techniques for building high‑performance, secure, and observable clusters.

CNINetworkPolicyObservability

0 likes · 10 min read

Mastering Kubernetes Networking: From Core Model to Production‑Ready Practices

Alibaba Cloud Native

Dec 19, 2025 · Artificial Intelligence

What Enterprises Are Learning from the State of Agent Engineering Report

The recent LangChain "State of Agent Engineering" report, combined with data from the AI‑Native Application Architecture whitepaper, reveals rapid production adoption of AI agents, persistent quality challenges, widespread observability, multi‑model strategies, and evolving evaluation practices across organizations of all sizes.

AI agentsEvaluationLLM

0 likes · 10 min read

What Enterprises Are Learning from the State of Agent Engineering Report

Ray's Galactic Tech

Dec 17, 2025 · Cloud Native

Mastering Kubernetes Rolling Updates: From Safe Deployments to Automated Rollbacks

This article systematically explains production‑grade Kubernetes rolling updates, covering core principles, parameter tuning, risk‑control mechanisms, rollback strategies, monitoring integration, and advanced deployment patterns to achieve zero‑downtime releases with automated safety nets.

CI/CDDeploymentGitOps

0 likes · 13 min read

Mastering Kubernetes Rolling Updates: From Safe Deployments to Automated Rollbacks

Alibaba Cloud Observability

Dec 15, 2025 · Cloud Native

How UModel PaaS API Simplifies Observability Queries with Unified Entity Search

This article explains how the UModel PaaS API abstracts complex observability concepts—such as EntitySet, DataSet, StorageLink, and Filter—into a unified, object‑oriented query interface, offering Table, Object, and metadata modes, code examples, UI and SDK usage, and AI‑agent integration for efficient, low‑maintenance monitoring.

AI AgentAPIObservability

0 likes · 16 min read

How UModel PaaS API Simplifies Observability Queries with Unified Entity Search

Ray's Galactic Tech

Dec 13, 2025 · Cloud Native

Mastering Kubernetes Observability: From Basic Metrics to Production‑Ready Practices

This guide explains how to build a robust Kubernetes observability system, covering core concepts, why traditional monitoring fails, paradigm shifts, best‑practice recommendations, and real‑world case studies that illustrate troubleshooting, alert design, cost and security monitoring, and a step‑by‑step adoption checklist.

ObservabilityPrometheuscloud-native

0 likes · 10 min read

Mastering Kubernetes Observability: From Basic Metrics to Production‑Ready Practices

Java Companion

Dec 12, 2025 · Backend Development

Integrate OpenTelemetry with Spring Boot in 5 Minutes for Microservice Monitoring and Tracing

This guide shows how to quickly add OpenTelemetry to a Spring Boot microservice, covering Docker‑based Jaeger setup, Maven dependencies, YAML configuration, automatic instrumentation, custom spans, production tuning, e‑commerce tracing examples, and common pitfalls to avoid.

GrafanaJaegerObservability

0 likes · 9 min read

Integrate OpenTelemetry with Spring Boot in 5 Minutes for Microservice Monitoring and Tracing

Alibaba Cloud Native

Dec 9, 2025 · Cloud Native

How UModel Simplifies Observability with Unified Entity Search and Table/Object Modes

This article explains how UModel abstracts observability data into unified table and object models, hides complex routing and field‑mapping logic, provides a single SPL‑based query language, supports metadata reflection for AI agents, and offers SDK and dry‑run examples to streamline metric, log, and trace queries across multiple storage backends.

AI AgentAPIObservability

0 likes · 15 min read

How UModel Simplifies Observability with Unified Entity Search and Table/Object Modes

Alibaba Cloud Observability

Dec 9, 2025 · Cloud Native

Unlocking System Insights with Graph Queries in Cloud‑Native Observability

This article explains how integrating graph‑based data models into cloud‑native observability platforms transforms isolated metric monitoring into a relational view, enabling powerful queries such as graph‑match and Cypher to perform fault impact analysis, root‑cause tracing, and security audits across services, pods, and infrastructure.

CypherObservabilityPerformance Optimization

0 likes · 29 min read

Unlocking System Insights with Graph Queries in Cloud‑Native Observability

Alibaba Cloud Native

Dec 6, 2025 · Cloud Native

How Graph Queries Transform Cloud‑Native Observability and Fault Diagnosis

In modern cloud‑native systems, treating each service, container, or middleware as an isolated entity hides the essential connections between components, so this article explains how integrating graph‑based data models and query languages like graph‑match and Cypher unlocks powerful fault‑impact analysis, topology insights, and performance‑optimized troubleshooting.

CypherObservabilityfault-analysis

0 likes · 28 min read

How Graph Queries Transform Cloud‑Native Observability and Fault Diagnosis

Alibaba Cloud Native

Dec 4, 2025 · Cloud Native

Mastering Entity Queries in UModel: Fast, Cross‑Domain Retrieval and Analysis

This article explains how UModel’s Entity query, built on the USearch engine, enables fast, precise, and cross‑domain retrieval of runtime entity data, outlines its storage architecture, query syntax, scoring mechanisms, performance tips, and real‑world use cases for observability operations.

ObservabilitySPLSearch

0 likes · 14 min read

Mastering Entity Queries in UModel: Fast, Cross‑Domain Retrieval and Analysis

Smart Era Software Development

Dec 2, 2025 · Artificial Intelligence

The Prompt Software Crisis: Engineering Challenges of Agentic AI Systems

The rise of large language models has created a prompt‑software crisis for Agentic AI, where fragile natural‑language prompts cause robustness, observability, and adaptability problems, and existing software‑engineering methods fail to address these issues, prompting the need for a new systematic framework.

AdaptabilityObservabilityRobustness

0 likes · 12 min read

The Prompt Software Crisis: Engineering Challenges of Agentic AI Systems

Alibaba Cloud Developer

Dec 2, 2025 · Operations

How a Multi‑Agent AI System Revolutionizes AIOps Root‑Cause Analysis

This article details a multi‑agent AIOps solution built on the Dify platform that automates fault detection, root‑cause analysis, and incident reporting by integrating metrics, logs, and trace data, dramatically reducing mean time to detect and resolve complex cloud‑native service failures.

AIOpsDifyMCP

0 likes · 41 min read

How a Multi‑Agent AI System Revolutionizes AIOps Root‑Cause Analysis

Alibaba Cloud Observability

Dec 1, 2025 · Cloud Native

How Entity Explorer Revolutionizes Cloud‑Native Observability with USearch and SPL

Entity Explorer provides a unified, high‑performance way to discover, query, and visualize billions of heterogeneous infrastructure, application, and business entities in cloud‑native environments, tackling massive data scale, semantic heterogeneity, and tight UI coupling through a USearch‑based search engine, scenario‑driven apps, dynamic topology, and model‑driven rendering.

Entity ExplorerObservabilitySPL

0 likes · 18 min read

How Entity Explorer Revolutionizes Cloud‑Native Observability with USearch and SPL

Alibaba Cloud Developer

Dec 1, 2025 · Operations

How to Uncover Hidden Java Memory Leaks in Kubernetes Pods with Alibaba Cloud OS Console

When migrating automotive workloads to cloud-native containers, unexpected OOMKilled pods often hide a large amount of Java memory consumption caused by JNI, libc, and Transparent Huge Pages, which can be identified and resolved using the Alibaba Cloud OS Console's memory panorama analysis and hotspot tracing features.

Alibaba CloudJNIJava

0 likes · 11 min read

How to Uncover Hidden Java Memory Leaks in Kubernetes Pods with Alibaba Cloud OS Console

Huya Tech Engineering

Nov 28, 2025 · Operations

How LLMs Accelerate Root‑Cause Diagnosis in Large‑Scale Microservices

By abstracting a massive microservice system as a dynamic multi‑layer graph and integrating large language models, the article outlines three evolution stages—from manual expert debugging to rule‑based AIOps and finally LLM‑driven cognitive reasoning—detailing practical workflows, context engineering, and real‑world case studies that dramatically improve MTTR and accuracy.

AIOpsContext EngineeringLLM

0 likes · 20 min read

How LLMs Accelerate Root‑Cause Diagnosis in Large‑Scale Microservices

Java Web Project

Nov 27, 2025 · Artificial Intelligence

How Spring AI Alibaba Admin Overcomes Enterprise AI Agent Deployment Pain Points

Spring AI Alibaba Admin addresses three major engineering obstacles—inefficient prompt debugging, unreliable AI quality assessment, and opaque production operations—by providing a full AI agent lifecycle platform with versioned prompt management, dataset versioning, flexible evaluator configuration, experiment automation, and end‑to‑end observability.

AI AgentEnterprise AIObservability

0 likes · 10 min read

How Spring AI Alibaba Admin Overcomes Enterprise AI Agent Deployment Pain Points

DevOps Coach

Nov 26, 2025 · Operations

Why Kubernetes Monitoring Is Essential and How to Implement Best Practices

This article explains why monitoring is critical in dynamic Kubernetes environments, outlines the expanded observability scope introduced by containers and the control plane, and provides a practical checklist of best‑practice steps—including namespaces, labeling, resource limits, health probes, centralized telemetry, automation, and version upgrades—to achieve reliable production‑grade observability.

Best PracticesKubernetesObservability

0 likes · 7 min read

Why Kubernetes Monitoring Is Essential and How to Implement Best Practices

Alibaba Cloud Native

Nov 26, 2025 · Cloud Native

How Entity Explorer Redefines Cloud‑Native Observability with Unified Queries and Model‑Driven UI

Entity Explorer introduces a unified, model‑driven approach to cloud‑native observability that classifies infrastructure, application, business, and operations entities, tackles massive‑scale data, heterogeneity, and UI coupling challenges, and delivers fast, contextual search and visual analysis through USearch and SPL languages.

EntityObservabilitySPL

0 likes · 20 min read

How Entity Explorer Redefines Cloud‑Native Observability with Unified Queries and Model‑Driven UI

IT Architects Alliance

Nov 25, 2025 · Operations

Making Architecture Decisions Observable with DevOps Monitoring

The article explains how to integrate architecture decision tracking into DevOps monitoring, detailing tagging, multi‑layer metric design, time‑window analysis, automated alerts, reporting, and continuous optimization to turn architectural choices into measurable, data‑driven outcomes.

Observabilitycloud-nativedevops

0 likes · 9 min read

Making Architecture Decisions Observable with DevOps Monitoring

Alibaba Cloud Native

Nov 25, 2025 · Artificial Intelligence

AI‑Native Architecture Insights: Highlights from AgentX 2025 SECon

The AgentX 2025 SECon AI‑native application track, co‑hosted by Alibaba Cloud and the Institute of Information, delivered deep technical insights on AI‑native architecture, the AgentScope 1.0 framework, AI gateway capabilities, and observability‑driven reliability for long‑cycle agents, summarised here for practitioners.

AI gatewayAI-nativeAgentScope

0 likes · 7 min read

AI‑Native Architecture Insights: Highlights from AgentX 2025 SECon

DevOps Coach

Nov 24, 2025 · Operations

10 Essential Grafana Dashboards to Spot Incidents Early

This guide presents ten essential Grafana dashboards—covering SLO burn, user‑journey funnel, infrastructure USE metrics, queue lag, database health, cache hit‑rate, CDN latency, rollout guardrails, trace topology, and a command‑center view—each explained with its purpose, panel layout, and ready‑to‑use PromQL or LogQL queries.

DashboardsGrafanaObservability

0 likes · 13 min read

10 Essential Grafana Dashboards to Spot Incidents Early

Ops Development Stories

Nov 24, 2025 · Operations

How to Deploy OpenTelemetry, Grafana Tempo, and Jaeger with Docker Compose for End-to-End Tracing

This guide walks you through setting up a complete tracing pipeline using OpenTelemetry, Grafana Tempo, and Jaeger with Docker‑Compose, covering Tempo installation, collector configuration, sample application deployment, and Grafana UI integration to visualize traces, including code snippets and step‑by‑step commands.

Docker ComposeGrafana TempoObservability

0 likes · 7 min read

How to Deploy OpenTelemetry, Grafana Tempo, and Jaeger with Docker Compose for End-to-End Tracing

Code Ape Tech Column

Nov 22, 2025 · Backend Development

What’s New in Spring Boot 4? A Deep Dive into the Latest Spring Ecosystem Overhaul

Spring Boot 4 launches alongside Spring Framework 7, Spring Data 2025.1 and Spring AI 1.1, introducing Jakarta EE 11, JSpecify null‑safety, build‑time optimizations with Project Leyden, a declarative HTTP client, Jackson 3 support, native API versioning, integrated OpenTelemetry, and a dual‑track AI strategy.

AIJavaObservability

0 likes · 8 min read

What’s New in Spring Boot 4? A Deep Dive into the Latest Spring Ecosystem Overhaul

DevOps Coach

Nov 22, 2025 · Operations

What’s New in Grafana 12.3? Interactive Learning, Deep Log Insights, and Expanded Data Sources

Grafana 12.3 adds Interactive Learning for context‑aware help, a rebuilt log panel with faster rendering and richer features, new visualization options like panel‑level time settings and Switch variables, plus numerous data‑source enhancements and a critical CVE‑2025‑41115 security fix.

DataSourcesGrafanaLogging

0 likes · 11 min read

What’s New in Grafana 12.3? Interactive Learning, Deep Log Insights, and Expanded Data Sources

JavaGuide

Nov 19, 2025 · Artificial Intelligence

Spring AI 1.1 Released: Explosive New Features for Java AI Development

Spring AI 1.1.0 arrives with a major overhaul, adding out‑of‑the‑box Model Context Protocol support, five‑mode prompt caching that can cut LLM costs by up to 90%, reasoning APIs, recursive advisors, a broadened model ecosystem, enhanced vector‑store and chat‑memory options, and richer observability integrations.

AI integrationJavaMCP

0 likes · 9 min read

Spring AI 1.1 Released: Explosive New Features for Java AI Development

Instant Consumer Technology Team

Nov 17, 2025 · Cloud Native

How We Built a Scalable Traffic Governance System for Thousands of Microservices

This article details a company’s step‑by‑step evolution from basic observability to a full‑stack traffic governance framework—including automated tracing, adaptive rate‑limiting, circuit‑breaking, and intelligent gray‑release—enabling stable operation of a microservice ecosystem with tens of thousands of instances while cutting MTTR to minutes and resource waste by over 20%.

ObservabilityService MeshTraffic Management

0 likes · 24 min read

How We Built a Scalable Traffic Governance System for Thousands of Microservices

Alibaba Cloud Observability

Nov 17, 2025 · Operations

How to Build Full‑Stack Observability for Dify LLM Apps Using Alibaba Cloud Monitoring

This guide explains how to achieve end‑to‑end observability for Dify low‑code LLM applications by combining Dify's built‑in monitoring, third‑party tracing services like Langfuse, and Alibaba Cloud's CloudMonitor with Python and Go probes, covering component‑level tracing, configuration steps, and trace linking for debugging and performance optimization.

Alibaba CloudDifyObservability

0 likes · 27 min read

How to Build Full‑Stack Observability for Dify LLM Apps Using Alibaba Cloud Monitoring

Alibaba Cloud Developer

Nov 17, 2025 · Operations

Achieving Full‑Stack Observability for Dify Agentic Apps with Alibaba Cloud Monitoring

This guide explains the observability challenges of Dify's low‑code LLM platform, analyzes its native and third‑party monitoring capabilities, and provides a step‑by‑step solution using Alibaba Cloud's non‑intrusive Python and Go probes, Trace Link integration, and detailed deployment instructions to monitor every component from the API to plugins and sandbox.

Alibaba CloudDifyObservability

0 likes · 28 min read

Achieving Full‑Stack Observability for Dify Agentic Apps with Alibaba Cloud Monitoring

Network Intelligence Research Center (NIRC)

Nov 15, 2025 · Cloud Native

Why OpenTelemetry Is Becoming the De Facto Observability Standard for Cloud‑Native Systems

The article explains OpenTelemetry’s three core components—SDKs, Collector, and Operator—detailing how the Operator’s automatic injection simplifies Kubernetes deployments and how the modular Collector can export telemetry to any backend such as Jaeger.

CollectorKubernetesObservability

0 likes · 7 min read

Why OpenTelemetry Is Becoming the De Facto Observability Standard for Cloud‑Native Systems

dbaplus Community

Nov 10, 2025 · Backend Development

Why Most Developers Fail at Logging and How to Master It

This article reveals common logging pitfalls that cause silent failures, explains three levels of logging maturity from rookie to expert, and provides concrete Java code examples, structured‑logging techniques, MDC usage, and automated alerting to turn logs into a powerful observability tool.

LoggingMDCObservability

0 likes · 14 min read

Why Most Developers Fail at Logging and How to Master It

DevOps Coach

Nov 10, 2025 · Operations

How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases

This guide explains the SRE framework—SLA, SLO, SLI hierarchy, golden signals, error budgets, and DORA metrics—showing how to instrument a Python app with OpenTelemetry, query Prometheus, avoid common pitfalls, and adopt a cultural and technical process that balances feature velocity with system stability.

DORAError BudgetGolden Signals

0 likes · 18 min read

How to Use SRE Metrics for Data‑Driven Reliability and Faster Releases

Ops Development Stories

Nov 10, 2025 · Operations

Build a Low‑Cost Observability Platform with OpenObserve and Vector

This guide walks you through the architecture, deployment, and configuration of the Rust‑based OpenObserve observability platform together with the high‑performance Vector data pipeline, covering log, metric, and trace collection, Docker‑Compose setup, UI usage, and common FAQs for small teams.

ObservabilityTracingVector

0 likes · 11 min read

Build a Low‑Cost Observability Platform with OpenObserve and Vector

Alibaba Cloud Observability

Nov 10, 2025 · Cloud Native

How a Next‑Gen Cloud‑Native Observability Platform Boosted Ticketing Stability by 80%

A leading digital‑entertainment group tackled severe stability and monitoring challenges in its high‑traffic ticketing system by building a cloud‑native, full‑link observability platform on Alibaba Cloud, achieving an 80% improvement in fault detection speed, a 40% reduction in operational costs, and establishing data‑driven operations as the digital foundation for product growth.

AIOpsObservabilityOperations

0 likes · 15 min read

How a Next‑Gen Cloud‑Native Observability Platform Boosted Ticketing Stability by 80%

Efficient Ops

Nov 9, 2025 · Operations

How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE

The talk details Tencent PCG’s end‑to‑end observability platform, its data‑standardization pipeline, client‑backend session linking, AI‑enhanced SRE Agent with large language models, and the roadmap toward a SaaS offering, illustrating how modern operations integrate AI for rapid fault localization.

AILarge Language ModelObservability

0 likes · 17 min read

How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE

Didi Tech

Nov 7, 2025 · Cloud Native

How Didi’s Open‑Source Projects Are Shaping Cloud‑Native Innovation at Zhejiang University

On November 3, Didi Open‑Source presented its ecosystem and four flagship projects—XIAOJUSURVEY, HUATUO, MPX, and KnowStreaming—to over a hundred Zhejiang University software students, sharing insights on enterprise‑grade open‑source practices, cloud‑native observability, cross‑platform development, and the role of open source in talent cultivation.

AICross‑Platform DevelopmentObservability

0 likes · 7 min read

How Didi’s Open‑Source Projects Are Shaping Cloud‑Native Innovation at Zhejiang University

Architect

Nov 6, 2025 · Operations

Why Most Teams Should Choose Loki Over ELK for Log Management – A Cost‑Effective Guide

This comprehensive guide compares ELK, EFK, and Loki log‑management solutions, analyzing their architecture, performance, cost, and use‑case suitability, and provides a decision framework, real‑world case studies, migration strategies, and optimization tips to help teams select the most efficient logging stack for their needs.

Cost OptimizationELKObservability

0 likes · 36 min read

Why Most Teams Should Choose Loki Over ELK for Log Management – A Cost‑Effective Guide

Java Backend Technology

Nov 6, 2025 · Operations

Boost Java Performance with MyPerf4J: High‑Throughput, Low‑Latency Monitoring

MyPerf4J is a high‑throughput, low‑latency Java performance monitoring tool that uses a non‑intrusive JavaAgent to collect real‑time method, memory, GC, and class metrics, offering developers quick bottleneck detection in development and continuous observability in production.

JavaJavaAgentObservability

0 likes · 6 min read

Boost Java Performance with MyPerf4J: High‑Throughput, Low‑Latency Monitoring

DevOps Coach

Nov 5, 2025 · Operations

How to Pilot AIOps: A Practical Guide to Reducing Alert Noise and Boosting Reliability

This guide explains what AIOps is, why it matters, how it fits into modern observability stacks, and provides a step‑by‑step pilot plan, quick‑win ideas, build‑or‑buy considerations, a tiny Python anomaly‑detection sample, safety tips, risk traps, and metrics to prove its impact.

AIOpsAlert Noise ReductionAnomaly Detection

0 likes · 12 min read

How to Pilot AIOps: A Practical Guide to Reducing Alert Noise and Boosting Reliability

Linux Ops Smart Journey

Nov 4, 2025 · Operations

How to Build a Production‑Ready, High‑Availability VictoriaMetrics Cluster

This guide walks you through deploying a fault‑tolerant, scalable VictoriaMetrics monitoring cluster on bare‑metal or virtual machines, covering architecture, component setup, systemd services, HAProxy load balancing, and verification steps for a production‑grade observability solution.

HAProxyHigh AvailabilityObservability

0 likes · 8 min read

How to Build a Production‑Ready, High‑Availability VictoriaMetrics Cluster

JakartaEE China Community

Nov 4, 2025 · Operations

How Logs, Traces, and Metrics Differ—and Why It Matters

Logs, tracing, and metrics each serve distinct monitoring goals—logs capture discrete events for debugging and audit, traces map request flows to pinpoint performance bottlenecks, and metrics provide time‑series health data; understanding their differences and integrating tools like ELK, OpenTelemetry, Prometheus, and Grafana enables robust observability.

ELKGrafanaObservability

0 likes · 7 min read

How Logs, Traces, and Metrics Differ—and Why It Matters

Mingyi World Elasticsearch

Nov 2, 2025 · Backend Development

What’s New in the Elasticsearch 9.x Documentation?

The Elasticsearch 9.x documentation has moved to a new URL, unified version handling, reorganized by solution use‑cases, separated release notes, added versioned API paths, and introduced client library navigation and versioning guides, all aimed at improving discoverability and developer efficiency.

APIElasticsearchObservability

0 likes · 7 min read

What’s New in the Elasticsearch 9.x Documentation?

FunTester

Oct 31, 2025 · Fundamentals

Master Defensive Programming: Turn Failures into Manageable Events

This article explains why defensive programming is essential, outlines its core principles, presents common failure scenarios and practical guidelines, and shows how testing and observability can turn inevitable errors into controlled, recoverable events that keep systems stable and maintainable.

Defensive ProgrammingError handlingObservability

0 likes · 9 min read

Master Defensive Programming: Turn Failures into Manageable Events

Alibaba Cloud Developer

Oct 30, 2025 · Artificial Intelligence

Why AI Agents Aren’t As Simple As They Appear: Engineering Challenges and Solutions

Building AI agents may seem straightforward with frameworks like LangChain, but hidden complexities in orchestration, memory management, reproducibility, and scalability turn simple demos into fragile systems, requiring systematic engineering, observability, and robust design to achieve reliable, production‑grade intelligent agents.

AI agentsAgent DesignLangChain

0 likes · 21 min read

Why AI Agents Aren’t As Simple As They Appear: Engineering Challenges and Solutions

Ops Community

Oct 29, 2025 · Cloud Native

ELK vs Loki: Which Kubernetes Log Solution Saves Cost and Boosts Performance?

This article compares ELK and Loki for Kubernetes log collection, covering scenarios, prerequisites, architectural differences, storage costs, query performance, deployment steps with Helm, best‑practice optimizations, and troubleshooting tips to help you choose the most efficient solution.

ELKKubernetesLogging

0 likes · 12 min read

ELK vs Loki: Which Kubernetes Log Solution Saves Cost and Boosts Performance?

Java Tech Enthusiast

Oct 28, 2025 · Backend Development

Why Rewriting a Java Microservice in Rust Cut Costs and Boosted Performance

A senior engineer recounts how replacing a noisy Java billing microservice with a lean Rust implementation slashed latency, reduced CPU and memory usage, lowered infrastructure bills, and exposed cultural and organizational challenges, offering a practical roadmap for teams considering similar migrations.

ObservabilityRustService Migration

0 likes · 11 min read

Why Rewriting a Java Microservice in Rust Cut Costs and Boosted Performance

Rare Earth Juejin Tech Community

Oct 28, 2025 · Backend Development

Mastering Log Practices: From Rookie Mistakes to Expert Observability

This article walks developers through common logging pitfalls, explains three maturity levels of log implementation, and provides concrete Java examples and best‑practice techniques such as structured JSON logs, MDC trace IDs, and log‑bomb avoidance to turn logs into a powerful observability tool.

LoggingMDCObservability

0 likes · 14 min read

Mastering Log Practices: From Rookie Mistakes to Expert Observability

Alibaba Cloud Observability

Oct 27, 2025 · Operations

From Data Silos to Intelligent Insights: Building Future‑Ready Operation Intelligence

This article explains how enterprises can transform massive, fragmented operation data—technical, business, and security—into high‑value intelligent signals by unifying storage, enriching context, applying AI, and delivering a single, observable platform that enables proactive, data‑driven decision making.

AIData PlatformObservability

0 likes · 18 min read

From Data Silos to Intelligent Insights: Building Future‑Ready Operation Intelligence

DevOps Coach

Oct 22, 2025 · Cloud Native

Simplify Scalable Kubernetes Pod Logging with Grafana podLogs

This guide explains how Grafana's podLogs feature, powered by Vector.dev, transforms raw Kubernetes pod logs into enriched, searchable, cluster‑wide observability data, covering why pod‑level logs matter, configuration steps, advanced custom log paths, and practical examples.

GrafanaKubernetesLogging

0 likes · 14 min read

Simplify Scalable Kubernetes Pod Logging with Grafana podLogs

IT Architects Alliance

Oct 22, 2025 · Cloud Native

Avoid the Top 5 Cloud Migration Mistakes: Proven Cloud‑Native Strategies

This article analyzes the five most common cloud‑migration pitfalls—lift‑and‑shift, network latency, incomplete data‑architecture transformation, weak security models, and poor observability—offering concrete cloud‑native solutions, migration matrices, code examples, and best‑practice guidelines for successful architectural evolution.

Cloud MigrationObservabilityarchitecture

0 likes · 12 min read

Avoid the Top 5 Cloud Migration Mistakes: Proven Cloud‑Native Strategies

Linux Kernel Journey

Oct 21, 2025 · Industry Insights

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

The article explains how bpftime extends eBPF to NVIDIA and AMD GPUs, exposing fine‑grained execution details that traditional CPU‑side tools miss, and demonstrates a unified, programmable observability stack that overcomes the limitations of existing GPU profilers in both synchronous and asynchronous workloads.

CUDAGPUObservability

0 likes · 23 min read

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

Alibaba Cloud Observability

Oct 20, 2025 · Cloud Native

How ‘泡姆泡姆’ Leverages Cloud‑Native Architecture for Global Low‑Latency Gaming

The multiplayer party game 泡姆泡姆 combines colorful shooting, match‑3, physics puzzles and arcade mini‑games, and uses a cloud‑native stack on Alibaba Cloud Container Service with OpenKruiseGame, Keda‑driven auto‑scaling, multi‑region deployment, zero‑downtime updates and a three‑layer observability platform to deliver seamless low‑latency experiences worldwide.

Game DevelopmentObservabilitycloud-native

0 likes · 10 min read

How ‘泡姆泡姆’ Leverages Cloud‑Native Architecture for Global Low‑Latency Gaming

JavaGuide

Oct 17, 2025 · Artificial Intelligence

Alibaba Open‑Sources Spring AI Alibaba Admin: A Full‑Lifecycle AI Agent Platform

Spring AI Alibaba extends Spring AI with multi‑agent and enterprise features, but faces three engineering hurdles—inefficient prompt debugging, unguaranteed AI quality, and opaque operations—so Alibaba released Spring AI Alibaba Admin, offering prompt templating, dataset versioning, evaluator configuration, experiment management, and deep observability to streamline AI agent development and deployment.

AI AgentEvaluatorExperiment Management

0 likes · 8 min read

Alibaba Open‑Sources Spring AI Alibaba Admin: A Full‑Lifecycle AI Agent Platform

Alibaba Cloud Native

Oct 16, 2025 · Artificial Intelligence

How Spring AI Alibaba Admin Powers Data‑Centric AI Agent Development and Ops

This article outlines the industry shift toward large‑scale AI Agent deployment, identifies key engineering challenges such as prompt management, quality assessment, and observability, and presents Spring AI Alibaba Admin—a cloud‑native platform that offers prompt, dataset, evaluator, and tracing capabilities, complete with setup instructions and future roadmap.

AI AgentJavaObservability

0 likes · 15 min read

How Spring AI Alibaba Admin Powers Data‑Centric AI Agent Development and Ops

Linux Ops Smart Journey

Oct 16, 2025 · Operations

Master Nightingale Monitoring: Add Data Sources, Query Metrics, Build Dashboards

This guide walks you through setting up the open‑source Nightingale monitoring platform—adding Prometheus as a data source, performing metric queries with PromQL, and creating visual dashboards—providing practical steps for building an observable, reliable operations environment.

NightingaleObservabilityPrometheus

0 likes · 5 min read

Master Nightingale Monitoring: Add Data Sources, Query Metrics, Build Dashboards

Huawei Cloud Developer Alliance

Oct 16, 2025 · Operations

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

This article explains how Huawei Cloud's HyperRouter implements deterministic operations through a combination of L4/L7 load‑balancing co‑design, high‑performance data‑plane choices, self‑healing mechanisms, point‑to‑point architecture, Cell + Shuffle‑Sharding isolation, and user‑centric observability, providing a reproducible blueprint for reliable cloud services.

DPDKObservabilitySRE

0 likes · 17 min read

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

Amazon Cloud Developers

Oct 15, 2025 · Artificial Intelligence

From PoC to Production: Build a Full‑Featured Customer Support Agent with Amazon Bedrock AgentCore

This article walks through turning a simple proof‑of‑concept customer‑support AI agent into a production‑ready system by leveraging Amazon Bedrock AgentCore services—Memory, Gateway, Identity, Observability, and Runtime—while requiring only minimal code changes and no months of custom infrastructure work.

AI AgentAgentCoreAmazon Bedrock

0 likes · 30 min read

From PoC to Production: Build a Full‑Featured Customer Support Agent with Amazon Bedrock AgentCore

MaGe Linux Operations

Oct 14, 2025 · Cloud Native

How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale

This article explains how the cloud‑native Loki logging system combined with S3 object storage can reduce PB‑level log storage expenses by 80‑90%, while simplifying operations, improving query performance, and meeting compliance requirements through detailed architecture, configuration, deployment, and real‑world case studies.

Cost OptimizationObservabilityS3

0 likes · 23 min read

How Loki + S3 Cuts Log Storage Costs by Up to 90% at PB Scale

Amazon Cloud Developers

Oct 14, 2025 · Artificial Intelligence

Secure, Reliable, and Scalable AI Agent Deployments with Amazon Bedrock AgentCore

Amazon Bedrock AgentCore delivers an end‑to‑end, enterprise‑grade platform for building, deploying, and operating AI agents, offering built‑in security, observability, memory, tool integration, and runtime scaling that lets organizations move agents from pilot to production at scale.

AI agentsAgentCoreAmazon Bedrock

0 likes · 13 min read

Secure, Reliable, and Scalable AI Agent Deployments with Amazon Bedrock AgentCore

Amazon Cloud Developers

Oct 13, 2025 · Artificial Intelligence

Agentic AI Guide: Building and Deploying Robust AI Agents

This article provides a comprehensive technical guide on Agentic AI, detailing the core modules, infrastructure requirements, security considerations, observability practices, and deployment strategies needed to develop and operate production‑ready AI agents.

AI agentsAgentOpsMemory Management

0 likes · 27 min read

Agentic AI Guide: Building and Deploying Robust AI Agents

MaGe Linux Operations

Oct 12, 2025 · Operations

How to Balance Loki Tag Design and Chunk Compression to Tame Log Floods

Learn how to design low‑cardinality Loki tags, fine‑tune Chunk compression settings, and implement best‑practice configurations, pipelines, and monitoring to prevent memory overload, improve query performance, and efficiently manage massive log volumes in cloud‑native environments.

Observabilitychunk compressionlog management

0 likes · 38 min read

How to Balance Loki Tag Design and Chunk Compression to Tame Log Floods

Cognitive Technology Team

Oct 12, 2025 · Backend Development

Resilient Microservices: Practical Patterns to Keep Your Services Alive

Learn how to tame chaotic microservices with practical resilience patterns—circuit breakers, bulkheads, smart retries, timeouts with fallbacks, and event‑driven messaging—plus tool recommendations and observability tips that ensure your system stays responsive even when individual services fail.

ObservabilityResiliencebulkhead

0 likes · 9 min read

Resilient Microservices: Practical Patterns to Keep Your Services Alive

Su San Talks Tech

Oct 10, 2025 · Operations

How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

This comprehensive guide explains how to improve system stability and reduce online incidents by building observability, implementing distributed tracing, applying rate‑limiting and circuit‑breaker patterns, adopting blue‑green and gray deployments, managing data consistency with distributed transactions, planning capacity, optimizing performance, and preparing emergency response plans.

Deployment StrategiesDistributed TracingHigh Availability

0 likes · 19 min read

How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

Linux Code Review Hub

Oct 9, 2025 · Operations

Non‑Intrusive MCP Observability with eBPF: Introducing MCPSpy

The article explains how the emerging Model Context Protocol (MCP) for AI tools lacks visibility, outlines security and monitoring challenges, compares alternative tracing methods, and presents MCPSpy—a Linux‑only eBPF‑based, non‑intrusive solution that captures MCP stdio traffic, parses JSON‑RPC messages, and outputs human‑readable or JSON logs.

AI securityMCPObservability

0 likes · 17 min read

Non‑Intrusive MCP Observability with eBPF: Introducing MCPSpy

Radish, Keep Going!

Oct 9, 2025 · Operations

Add Observability to Legacy Java Apps with OpenTelemetry Agent (Zero Code)

This guide shows how to use the OpenTelemetry Java Agent to instantly add observability—metrics, traces, and error reporting—to long‑standing legacy Java applications without modifying a single line of code, covering setup, environment configuration, health monitoring, performance tracing, and visualizing data in Grafana.

JavaObservabilityOpenTelemetry

0 likes · 7 min read

Add Observability to Legacy Java Apps with OpenTelemetry Agent (Zero Code)

MaGe Linux Operations

Oct 7, 2025 · Operations

7 Fatal Monitoring Alert Mistakes That Keep You Up at 3 AM—and How to Fix Them

This article examines why ops engineers are repeatedly woken by false alerts, outlines seven common monitoring alert pitfalls—from over‑alerting to static thresholds—and provides practical solutions such as golden‑signal rules, dynamic baselines, alert enrichment, routing, suppression, and continuous quality audits.

AlertingObservabilityOperations

0 likes · 27 min read

7 Fatal Monitoring Alert Mistakes That Keep You Up at 3 AM—and How to Fix Them

Architect's Guide

Oct 7, 2025 · Backend Development

Mastering Backend Architecture: From Microservices to Service Mesh and Message Queues

This article presents a comprehensive roadmap for backend architects, covering microservice fundamentals, design principles, gateway patterns, communication protocols, service registration, configuration management, observability pillars, service mesh options, and a detailed comparison of modern message‑queue technologies.

Message QueueObservabilityService Mesh

0 likes · 29 min read

Mastering Backend Architecture: From Microservices to Service Mesh and Message Queues

IT Architects Alliance

Oct 6, 2025 · Cloud Native

Mastering Cloud‑Native Observability: From Metrics to Tracing

The article explains why enterprises struggle with cloud‑native observability, outlines the exponential complexity and dynamic nature of modern microservice environments, and presents a comprehensive three‑pillar approach—metrics, logging, tracing—along with practical Prometheus, OpenTelemetry, and sidecar configurations, storage choices, sampling, alerting, cost‑control, team upskilling, and future trends such as AIOps and eBPF.

ObservabilityOpenTelemetryPrometheus

0 likes · 12 min read

Mastering Cloud‑Native Observability: From Metrics to Tracing

MaGe Linux Operations

Oct 6, 2025 · Cloud Native

Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?

This article compares open‑source Prometheus + Grafana with managed cloud monitoring services, evaluating deployment complexity, functionality, scalability, security, and total cost of ownership across small, medium, and large workloads, and provides practical decision‑making guidance for teams of different sizes and requirements.

ObservabilityPrometheuscloud-native

0 likes · 56 min read

Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?

MaGe Linux Operations

Oct 5, 2025 · Operations

ELK vs EFK vs Loki: Which Log Solution Saves Money and Boosts Performance?

This in‑depth technical guide compares ELK, EFK, and Loki across cost, performance, deployment complexity, feature completeness, and suitability for small‑to‑large teams, providing real‑world case studies, decision trees, migration steps, and cost‑optimization tips to help you choose the most efficient logging stack for your organization.

EFKELKObservability

0 likes · 39 min read

ELK vs EFK vs Loki: Which Log Solution Saves Money and Boosts Performance?

IT Architects Alliance

Oct 2, 2025 · Cloud Native

Mastering Cloud‑Native Architecture: 6 Core Principles Every Engineer Should Know

This article outlines six fundamental cloud‑native architecture principles—immutable infrastructure, service mesh, observability, declarative APIs, resilient design, and shift‑left security—explaining their purpose, key practices, code examples, and how they interrelate to build scalable, reliable, and secure distributed systems.

Declarative APIObservabilityResilience

0 likes · 11 min read

Mastering Cloud‑Native Architecture: 6 Core Principles Every Engineer Should Know

Alibaba Cloud Observability

Sep 29, 2025 · Cloud Native

How Bull Group Boosted Observability by Migrating from SkyWalking to Alibaba Cloud ARMS

This article details Bull Group's journey from an open‑source SkyWalking monitoring setup to Alibaba Cloud ARMS, outlining the architectural challenges, technical selection criteria, migration steps, and the resulting improvements in observability, AI‑IoT integration, and operational efficiency.

AIAPMAlibaba Cloud

0 likes · 19 min read

How Bull Group Boosted Observability by Migrating from SkyWalking to Alibaba Cloud ARMS

Linux Ops Smart Journey

Sep 25, 2025 · Cloud Native

How to Monitor Envoy Metrics with Prometheus, Grafana, and Nacos

This guide explains how to enable Envoy's admin interface, register the service with Nacos, scrape metrics using Prometheus, and visualize them in Grafana, providing a complete observability pipeline for cloud‑native deployments.

EnvoyGrafanaObservability

0 likes · 4 min read

How to Monitor Envoy Metrics with Prometheus, Grafana, and Nacos

Tech Freedom Circle

Sep 25, 2025 · Operations

RAGFlow Link Tracing: GPS‑Style Observability for LLM‑Powered Applications

The article explains why RAGFlow needs end‑to‑end link tracing, introduces OpenTelemetry’s core concepts, shows how custom tracing utilities are implemented in Python, describes the layered architecture, provides concrete Docker and YAML configurations, and offers best‑practice guidelines for performance monitoring and fault diagnosis.

LLMObservabilityOpenTelemetry

0 likes · 24 min read

RAGFlow Link Tracing: GPS‑Style Observability for LLM‑Powered Applications

IT Architects Alliance

Sep 20, 2025 · Operations

Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies

This article explores the three core challenges of microservice governance—distributed tracing, centralized configuration management, and comprehensive monitoring—offering practical solutions, tool comparisons, and best‑practice guidelines to help architects build reliable, observable, and maintainable systems.

Distributed TracingObservabilitycloud-native

0 likes · 12 min read

Mastering Microservice Governance: Tracing, Config, and Monitoring Strategies

MaGe Linux Operations

Sep 18, 2025 · Cloud Native

Master Helm: Proven Best Practices for Kubernetes Deployments

This comprehensive guide walks you through Helm's architecture, chart structuring, template development, dependency management, production deployment strategies, security hardening, observability integration, testing, performance tuning, and enterprise governance, providing actionable examples and code snippets to help you become a Helm expert in cloud‑native environments.

DeploymentObservabilitychart

0 likes · 22 min read

Master Helm: Proven Best Practices for Kubernetes Deployments

Ops Community

Sep 15, 2025 · Cloud Native

Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions

This comprehensive guide explains why log management is critical for large Kubernetes clusters, outlines common pain points, presents full‑stack architectures, details EFK and Loki implementations with code samples, and offers performance, security, cost‑optimization, and future‑trend recommendations.

EFKKubernetesObservability

0 likes · 16 min read

Master Kubernetes Log Collection: From Basics to Advanced EFK & Loki Solutions