Tagged articles

Observability

1054 articles · Page 4 of 11

Sep 12, 2025 · Operations

How to Build End‑to‑End Observability for Large‑Model Applications on Alibaba Cloud

This guide explains how to design and implement a complete observability solution for large‑model AI services on Alibaba Cloud, covering architecture, core metrics, logging standards, demo code, log collection, dashboard design, alerting, monitoring tools, troubleshooting SOPs, and recovery procedures.

AI OperationsAlibaba CloudCloud Monitoring

0 likes · 21 min read

How to Build End‑to‑End Observability for Large‑Model Applications on Alibaba Cloud

dbaplus Community

Sep 11, 2025 · Cloud Native

Building a Scalable Kubernetes Monitoring Architecture and Alert Management

This guide presents a comprehensive, layered Kubernetes monitoring architecture—including control plane, node, resource, and extension layers—detailing high‑availability Prometheus deployment, alert grouping strategies, custom CRD metrics, visualization dashboards, and practical best‑practice recommendations for reliable observability in cloud‑native environments.

AlertingKubernetesObservability

0 likes · 11 min read

Building a Scalable Kubernetes Monitoring Architecture and Alert Management

Ops Community

Sep 8, 2025 · Operations

Mastering Distributed Log Architecture: From Flume to ELK and Beyond

This comprehensive guide walks you through the challenges of large‑scale log collection, real‑time processing, storage optimization, and visualization, detailing practical configurations for Flume, Logstash, Elasticsearch, Kibana, Filebeat, Kafka, Kubernetes, and future AIOps integrations to build a reliable, cost‑effective distributed logging system.

ELKFlumeKafka

0 likes · 24 min read

Mastering Distributed Log Architecture: From Flume to ELK and Beyond

Tech Freedom Circle

Sep 4, 2025 · Backend Development

How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?

The article dissects the interview question about ES latency in a MySQL‑Canal‑to‑Elasticsearch pipeline, explains the root causes across four system layers, and presents a comprehensive four‑layer optimization, end‑to‑end observability, routing‑based degradation, and a Java‑based LatencyProbe component to measure and control delay.

CanalData synchronizationElasticsearch

0 likes · 17 min read

How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?

Java One

Sep 3, 2025 · Operations

How to Install, Configure, and Run Prometheus: A Step‑by‑Step Guide

This guide walks you through installing Prometheus via binary download, configuring global scrape settings and job definitions, running the server with command‑line options, and using the web UI and PromQL to verify target health and query metrics, illustrated with screenshots and example queries.

InstallationObservabilityPromQL

0 likes · 6 min read

How to Install, Configure, and Run Prometheus: A Step‑by‑Step Guide

Java One

Sep 1, 2025 · Cloud Native

How Prometheus Transforms Cloud‑Native Monitoring: Architecture, Data Model, and PromQL Basics

This article explains Prometheus' origins, open‑source development, CNCF graduation, core components, time‑series data model, text‑based metric protocol, powerful PromQL queries, service discovery mechanisms, and alerting practices, providing a comprehensive guide for cloud‑native observability.

ObservabilityPromQLPrometheus

0 likes · 8 min read

How Prometheus Transforms Cloud‑Native Monitoring: Architecture, Data Model, and PromQL Basics

Architect's Guide

Sep 1, 2025 · Operations

How Does Distributed Link Tracing Work? Inside SkyWalking’s Architecture

This article explains the concept of distributed link tracing, its principles, metrics, and implementation details—including monolithic and microservice approaches, OpenTracing standards, and how SkyWalking solves challenges like automatic span collection, context propagation, unique trace IDs, and sampling performance.

Distributed TracingObservabilityOpenTracing

0 likes · 12 min read

How Does Distributed Link Tracing Work? Inside SkyWalking’s Architecture

Alibaba Cloud Native

Aug 31, 2025 · Cloud Native

How Ctrip Scaled AI Model Access with Higress: Architecture, Challenges, and Solutions

Ctrip’s R&D team built an AI gateway using Higress to unify access to diverse large‑model services, addressing authentication, traffic control, fault tolerance, monitoring, and integration with internal MCP platforms, while sharing practical lessons and future plans.

HigressMCP integrationObservability

0 likes · 14 min read

How Ctrip Scaled AI Model Access with Higress: Architecture, Challenges, and Solutions

php Courses

Aug 29, 2025 · Operations

How to Build a Real‑Time PHP Log Event Pipeline for Instant Insights

Learn how to transform PHP logs into real‑time, structured events by implementing a log event pipeline that includes JSON logging, lightweight collectors like Filebeat, streaming platforms such as Kafka or Flink, enrichment, and visualization with Grafana, enabling instant monitoring, alerting, and data‑driven decisions.

FlinkGrafanaKafka

0 likes · 7 min read

How to Build a Real‑Time PHP Log Event Pipeline for Instant Insights

Nightwalker Tech

Aug 28, 2025 · Operations

How to Diagnose and Fix E‑commerce Order Failures with Observability, APM, and Distributed Tracing

This article explains the hierarchical relationship between APM, distributed tracing, and observability, walks through a real Double‑11 e‑commerce incident, and demonstrates how a well‑designed observability stack can pinpoint the root cause, apply emergency fixes, and restore system performance within minutes.

APMDistributed TracingFault diagnosis

0 likes · 16 min read

How to Diagnose and Fix E‑commerce Order Failures with Observability, APM, and Distributed Tracing

Xiaohongshu Tech REDtech

Aug 27, 2025 · Databases

How RedHub Revolutionizes Database Access for Billion‑User Scale

RedHub is a next‑generation database proxy built by Xiaohongshu that unifies fragmented access methods, leverages PolarDB‑X for distributed SQL execution, and delivers high‑performance, highly available, and easily observable database connectivity, enabling seamless migration and advanced features for massive‑scale workloads.

Database ProxyDistributed SQLHigh Availability

0 likes · 15 min read

How RedHub Revolutionizes Database Access for Billion‑User Scale

Su San Talks Tech

Aug 27, 2025 · Backend Development

Master Distributed Tracing with SkyWalking: Principles, Architecture & Practices

This article explains the fundamentals of distributed tracing in microservice architectures, details the OpenTracing standard, examines SkyWalking’s design, sampling strategies, context propagation, and plugin development, and shares practical implementation experiences and performance comparisons, helping engineers choose and integrate effective tracing solutions.

Distributed TracingJavaObservability

0 likes · 19 min read

Master Distributed Tracing with SkyWalking: Principles, Architecture & Practices

Tencent Cloud Developer

Aug 26, 2025 · Artificial Intelligence

Building a Scalable, Observable Recommendation Scheduling Engine from Scratch

This article explains how recommendation systems work, distinguishes online services from offline computation, outlines a typical recommendation flow, and presents a three‑stage evolution (1.0, 2.0, 3.0) with design principles for stability, observability, and efficiency, culminating in a DAG‑based orchestration and traceable execution.

AIObservabilitySystem Design

0 likes · 13 min read

Building a Scalable, Observable Recommendation Scheduling Engine from Scratch

Wuming AI

Aug 26, 2025 · Artificial Intelligence

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

This article presents a hierarchical breakdown of Agentic AI, detailing the foundational large language models, the capabilities of AI agents, the coordination mechanisms of multi‑agent systems, and the supporting infrastructure needed for reliability, scalability, and security.

AI agentsLLMMulti-Agent Systems

0 likes · 5 min read

A Layered Overview of Agentic AI: From LLM Foundations to Multi‑Agent Systems

Kuaishou Tech

Aug 20, 2025 · Frontend Development

How AI Is Transforming Frontend Development: Highlights from Kuaishou’s Tech Salon

The Kuaishou AI‑driven Frontend Technology Evolution salon gathered over 300 engineers and 46,000 online viewers to showcase how AI is reshaping large‑scale front‑end development across business, R&D, and infrastructure, with deep dives into AI‑native platforms, AIDevOps, intelligent agents, AI‑powered D2C, and observability.

AIAIDevOpsAgent

0 likes · 11 min read

How AI Is Transforming Frontend Development: Highlights from Kuaishou’s Tech Salon

dbaplus Community

Aug 19, 2025 · Operations

Avoid These 10 System Architecture Sins That Sabotage Scaling

The article enumerates ten deadly system‑architecture mistakes—such as assuming natural scaling, treating microservices as monoliths, ignoring eventual consistency, over‑relying on a single database, lacking observability, over‑designing, mixing stateful logic, skipping chaos testing, underestimating third‑party risk, and ignoring human cost—providing concrete code examples, diagrams, and actionable lessons to prevent costly failures at scale.

ObservabilityPerformancemicroservices

0 likes · 10 min read

Avoid These 10 System Architecture Sins That Sabotage Scaling

360 Zhihui Cloud Developer

Aug 8, 2025 · Operations

Quickly Deploy Prometheus Nginx Log Exporter for Deep Nginx Monitoring

This guide explains how to install and configure the prometheus-nginxlog-exporter in the Yunzhou Observability platform, covering its core features, metric types, one‑click deployment steps, chart visualization, alert rule setup, and common troubleshooting tips for comprehensive Nginx monitoring.

ExporterNginxObservability

0 likes · 9 min read

Quickly Deploy Prometheus Nginx Log Exporter for Deep Nginx Monitoring

Didi Tech

Aug 7, 2025 · Cloud Native

How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing

HUATUO, Didi's open‑source cloud‑native observability project, leverages BPF‑based low‑overhead kernel tracing, unified metric and event frameworks, automatic flame‑graph generation, and seamless integration with Prometheus, Grafana and Elasticsearch to provide panoramic, zero‑intrusive monitoring and continuous performance profiling for complex production environments.

BPFObservabilitycloud-native

0 likes · 11 min read

How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing

Alibaba Cloud Big Data AI Platform

Aug 6, 2025 · Operations

How Alibaba Cloud’s Serverless Elasticsearch Powers Data‑Driven Operations

Alibaba Cloud’s Serverless Elasticsearch service, combined with the SREWorks data‑driven operations platform, offers a cloud‑native, real‑time search and analytics engine that integrates metric and log collection, cost management, and health monitoring to enhance scalability, performance, and operational efficiency for enterprise applications.

DataOpsElasticsearchObservability

0 likes · 11 min read

How Alibaba Cloud’s Serverless Elasticsearch Powers Data‑Driven Operations

StarRocks

Aug 6, 2025 · Databases

How Qunar Migrated to StarRocks: Architecture, Performance Gains & Best Practices

This article details Qunar's transition to StarRocks as a unified OLAP engine, covering the business background, engine evaluation, architecture redesign, observability, high‑availability strategies, query‑performance optimizations, real‑world application cases, community contributions, and future plans.

Data PlatformHigh AvailabilityOLAP

0 likes · 21 min read

How Qunar Migrated to StarRocks: Architecture, Performance Gains & Best Practices

Alibaba Cloud Observability

Aug 4, 2025 · Cloud Native

How LoongCollector Redefines Observability for Cloud‑Native AI Workloads

LoongCollector, the core component of Alibaba Cloud's LoongSuite, delivers zero‑intrusion, multi‑tenant, high‑performance data collection and processing for AI services, enabling full‑stack observability across logs, metrics, traces, events and profiles in cloud‑native environments.

AIKubernetesObservability

0 likes · 16 min read

How LoongCollector Redefines Observability for Cloud‑Native AI Workloads

Qunar Tech Salon

Jul 22, 2025 · Databases

Quark’s Data Platform Upgrade with StarRocks: Architecture, Performance, Roadmap

This article details how Quark’s data platform consolidated multiple analytics engines into a unified StarRocks‑based OLAP solution, covering business background, engine selection, architecture redesign, performance tuning, operational practices, and future plans for scalability and reliability.

Data PlatformKubernetesOLAP

0 likes · 19 min read

Quark’s Data Platform Upgrade with StarRocks: Architecture, Performance, Roadmap

DevOps Operations Practice

Jul 22, 2025 · Operations

Top 7 DevOps Best Practices to Accelerate Delivery and Boost Reliability

These seven essential DevOps best practices—from cultural transformation and full automation to continuous integration, observability, security, cloud-native microservices, and performance optimization—guide teams in accelerating software delivery, enhancing quality, ensuring reliability, and reducing costs through collaborative, automated, and measurable processes.

AutomationCI/CDObservability

0 likes · 4 min read

Top 7 DevOps Best Practices to Accelerate Delivery and Boost Reliability

Alibaba Cloud Native

Jul 18, 2025 · Artificial Intelligence

How AI Agent Architecture Is Evolving to Redefine Software Engineering

The article outlines the rapid evolution of AI Agent technology stacks, detailing multi‑dimensional development across perception, decision, memory, and tool integration, while highlighting cloud‑native deployment models, observability challenges, and the open‑source LoongSuite suite that provides high‑performance, low‑cost monitoring for AI workloads.

AI AgentLoongSuiteObservability

0 likes · 19 min read

How AI Agent Architecture Is Evolving to Redefine Software Engineering

Efficient Ops

Jul 15, 2025 · Operations

Top Open‑Source Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, and More

This article reviews the most popular log‑management solutions, summarizing each tool's core features, pricing model, advantages, and drawbacks to help readers choose the right logging stack for their observability needs.

ELKGrafana LokiObservability

0 likes · 16 min read

Top Open‑Source Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, and More

Ops Development & AI Practice

Jul 12, 2025 · Cloud Native

Mastering Observability: A Deep Dive into OpenTelemetry’s Architecture

This article explains OpenTelemetry’s purpose, three‑layer architecture (instrumentation, collector, backend), practical Go instrumentation code, and how the collector processes and exports telemetry to both open‑source and SaaS backends, helping developers avoid vendor lock‑in and achieve unified observability.

CollectorDistributed TracingInstrumentation

0 likes · 9 min read

Mastering Observability: A Deep Dive into OpenTelemetry’s Architecture

DeWu Technology

Jul 7, 2025 · Cloud Native

How to Achieve Service‑Level NAS Traffic Tracing with eBPF and Kubernetes

This article explains how to design and implement a service‑level NAS traffic tracing solution using Linux eBPF, NFS kernel hooks, and Kubernetes metadata to correlate container processes with NAS devices, generate real‑time metrics, and visualize them in Prometheus dashboards.

KubernetesNASNFS

0 likes · 18 min read

How to Achieve Service‑Level NAS Traffic Tracing with eBPF and Kubernetes

Java Architect Essentials

Jul 6, 2025 · Operations

How Logback, MDC, and ELK Can Rescue Your Nighttime Log Chaos

This article explains how chaotic, multi‑framework logging in Java microservices leads to debugging nightmares, and demonstrates a three‑step solution—standardizing on Logback, adding traceable MDC identifiers, and visualizing logs with ELK—to achieve unified log formats, sensitive data masking, and dramatically faster issue resolution.

ELKLogbackLogging

0 likes · 10 min read

How Logback, MDC, and ELK Can Rescue Your Nighttime Log Chaos

Alibaba Cloud Native

Jul 1, 2025 · Cloud Native

How Alibaba Cloud Function Compute Uses OpenTelemetry for Full‑Stack Tracing

The article explains how Alibaba Cloud Function Compute upgraded its tracing capabilities from Jeager 2.0 to the OpenTelemetry W3C standard, delivering end‑to‑end observability, transparent cold‑start analysis, cross‑environment context propagation, dynamic sampling, and AI‑assisted debugging for serverless workloads.

Function ComputeObservabilityOpenTelemetry

0 likes · 6 min read

How Alibaba Cloud Function Compute Uses OpenTelemetry for Full‑Stack Tracing

macrozheng

Jul 1, 2025 · Operations

Best Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, Datadog & More

This article provides a comprehensive comparison of popular log management solutions—including Filebeat, Graylog, the Elastic (ELK) stack, Grafana Loki, LogDNA, Datadog, Logstash, Fluentd, and Splunk—detailing their main features, pricing models, advantages, and drawbacks to help you choose the right tool for your needs.

ELK StackObservabilityOperations

0 likes · 16 min read

Best Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, Datadog & More

Alibaba Cloud Native

Jun 28, 2025 · Cloud Native

Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

This tutorial walks through deploying vLLM inference services on a GPU‑enabled Kubernetes cluster using llmaz, configuring Higress as an AI gateway for traffic control, observability, and fallback model switching, and demonstrates end‑to‑end request testing.

FallbackHigressObservability

0 likes · 15 min read

Deploying vLLM with llmaz and Higress: A Step‑by‑Step Cloud‑Native Guide

AI Algorithm Path

Jun 26, 2025 · Artificial Intelligence

The 10 Essential Components of a Retrieval‑Augmented Generation (RAG) System

This guide breaks down the ten core building blocks of a production‑ready RAG pipeline—from input handling and vector stores to prompt engineering, LLM inference, observability, and evaluation—showing why each piece matters, common pitfalls, and practical best‑practice recommendations.

LLMObservabilityRAG

0 likes · 9 min read

The 10 Essential Components of a Retrieval‑Augmented Generation (RAG) System

Alibaba Cloud Observability

Jun 24, 2025 · Operations

Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable

This article examines common log‑management anti‑patterns—such as copy‑truncate rotation, NAS storage, multi‑process writes, file‑hole creation, frequent overwrites, and Vim edits—explains why they cause data loss or duplicate collection, and offers practical best‑practice recommendations for reliable log handling in cloud‑native environments.

Anti-patternsBest PracticesObservability

0 likes · 8 min read

Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable

AI Large Model Application Practice

Jun 23, 2025 · Databases

How Google’s MCP Toolbox Simplifies Enterprise Database Access for LLM Agents

This guide explains Google’s open‑source MCP Toolbox for Databases, covering its core concepts, installation, configuration, two usage modes (native SDK and MCP), example LangGraph agent integration, security features, observability, and practical code snippets for building reliable LLM‑driven database tools.

DatabasesLLM AgentsMCP Toolbox

0 likes · 11 min read

How Google’s MCP Toolbox Simplifies Enterprise Database Access for LLM Agents

Tencent Technical Engineering

Jun 20, 2025 · Artificial Intelligence

Mastering AI Agents: Core Concepts, Protocols, and Golang Frameworks for Multi‑Agent Collaboration

This comprehensive article explores the evolution of AI agents, explains key protocols like MCP and A2A, compares reasoning frameworks such as CoT, ReAct, and Plan‑and‑Execute, and demonstrates how Golang frameworks Eino and tRPC‑A2A‑Go enable elegant development, orchestration, and observability of complex multi‑agent systems with practical code examples and visual diagrams.

A2AAI AgentEino

0 likes · 55 min read

Mastering AI Agents: Core Concepts, Protocols, and Golang Frameworks for Multi‑Agent Collaboration

Alibaba Cloud Developer

Jun 17, 2025 · Artificial Intelligence

Why AI Agent Engineering Is the Missing Link to Scalable, Usable AI

This article dissects AI Agent engineering into product and technical dimensions, explaining how demand modeling, UI/UX design, prompt engineering, multi‑agent architecture, feedback loops, security, and observability together determine whether an AI assistant is usable, reliable, and ready for large‑scale deployment.

AI AgentObservabilityProduct design

0 likes · 22 min read

Why AI Agent Engineering Is the Missing Link to Scalable, Usable AI

Alibaba Cloud Native

Jun 12, 2025 · Artificial Intelligence

Why AI Agent Engineering Matters: From Product Design to Technical Architecture

This article breaks down AI agent engineering into product and technical engineering, explains how demand modeling, UI/UX design, prompt engineering, multi‑agent coordination, and observability combine to make AI agents usable, scalable, and trustworthy, and shows concrete examples and implementation patterns.

AIAgent EngineeringObservability

0 likes · 23 min read

Why AI Agent Engineering Matters: From Product Design to Technical Architecture

vivo Internet Technology

Jun 11, 2025 · Big Data

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

This article details Vivo's end‑to‑end Pulsar observability solution, covering the challenges of Prometheus‑based monitoring, the architecture of the alerting pipeline, adaptor development, metric optimizations for subscription backlog and bundle load, and fixes for kop lag reporting issues.

Big DataObservabilityPrometheus

0 likes · 12 min read

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

Liangxu Linux

Jun 10, 2025 · Cloud Native

Why Loki Is the Ideal Cloud‑Native Log Aggregator for Prometheus & Grafana

Loki, an open‑source log aggregation system from Grafana Labs, integrates tightly with Prometheus and Grafana, stores logs efficiently using object storage, offers a simple label‑based model, and provides cost‑effective, high‑performance logging for cloud‑native environments while outlining its architecture, usage, configuration, advantages, limitations, and retention policies.

GrafanaObservabilityPrometheus

0 likes · 10 min read

Why Loki Is the Ideal Cloud‑Native Log Aggregator for Prometheus & Grafana

Big Data Technology Tribe

Jun 10, 2025 · Cloud Native

Mastering eBPF Maps: Design, Implementation, and Real‑World Use Cases

This article provides an in‑depth analysis of BPF maps—explaining their design principles, core features, various map types with code examples, and the macro expansion process that turns high‑level BCC helpers into native kernel map definitions for cloud‑native observability.

BCCBPF mapsLinux kernel

0 likes · 12 min read

Mastering eBPF Maps: Design, Implementation, and Real‑World Use Cases

JakartaEE China Community

Jun 9, 2025 · Cloud Native

How to Choose the Right Cloud‑Native Microservice Framework (MicroProfile vs Spring)

This article explains why cloud‑native microservices are beneficial, defines their key characteristics, compares the MicroProfile and Spring frameworks, and provides detailed code examples for REST APIs, configuration, fault tolerance, security, health checks, metrics, and distributed tracing to help developers select the most suitable technology stack.

KubernetesMicroProfileObservability

0 likes · 26 min read

How to Choose the Right Cloud‑Native Microservice Framework (MicroProfile vs Spring)

Alibaba Cloud Developer

Jun 6, 2025 · Big Data

Why Observability 2.0 and SLS Data Pipelines Are Revolutionizing Log Analytics

This article explains how Observability 2.0 reshapes log, metric and trace management by unifying health views, introduces the evolution of Alibaba Cloud's SLS data pipeline, compares its three service modes, and demonstrates performance, cost and integration benefits for large‑scale, real‑time log processing.

Big DataObservabilitySLS

0 likes · 11 min read

Why Observability 2.0 and SLS Data Pipelines Are Revolutionizing Log Analytics

JavaEdge

Jun 5, 2025 · Artificial Intelligence

How Amazon’s Strands Agents SDK Simplifies Building AI Agents

Amazon’s newly open‑source Strands Agents SDK lets developers create AI agents with minimal code by defining prompts, tools, and models, offering a lightweight, production‑ready framework that supports multiple model providers, observability, multi‑agent collaboration, and extensible tooling via dedicated packages.

AI agentsAmazonLLM

0 likes · 7 min read

How Amazon’s Strands Agents SDK Simplifies Building AI Agents

Linux Ops Smart Journey

May 29, 2025 · Cloud Native

Master Kubernetes Monitoring with kube-state-metrics and Prometheus

This guide walks you through deploying kube-state-metrics, configuring Prometheus scrape jobs, verifying metric collection, and adding Grafana dashboards to achieve a visible, manageable, and reliable Kubernetes monitoring solution for large‑scale clusters.

KubernetesObservabilityPrometheus

0 likes · 7 min read

Master Kubernetes Monitoring with kube-state-metrics and Prometheus

Java Architecture Diary

May 26, 2025 · Artificial Intelligence

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

This article explains why observability is essential for Spring AI applications, outlines common cost‑control and performance challenges, and provides a step‑by‑step guide—including Maven setup, client configuration, service implementation, metric exposure, Zipkin tracing, and architecture insights—to create a fully observable, enterprise‑grade AI translation service.

ObservabilitySpring AITracing

0 likes · 12 min read

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

Programmer DD

May 21, 2025 · Artificial Intelligence

What’s New in Spring AI 1.0 GA? A Deep Dive into Java AI Features

Spring AI 1.0 GA introduces a comprehensive suite of AI capabilities for Java developers, including a ChatClient supporting 20 models, vector‑store integrations, RAG pipelines, advanced chat memory, @Tool function calling, model evaluation, observability, Model Context Protocol, and autonomous agents, with examples for major cloud providers.

AI modelsJavaMCP

0 likes · 6 min read

What’s New in Spring AI 1.0 GA? A Deep Dive into Java AI Features

dbaplus Community

May 20, 2025 · Operations

How to Build a Production‑Ready, High‑Availability Kubernetes Cluster from Scratch

This guide walks through designing, deploying, securing, monitoring, backing up, and maintaining a production‑grade Kubernetes cluster, sharing real‑world pitfalls, configuration snippets, and best‑practice recommendations for high availability, security, observability, and upgrade strategies.

KubernetesObservabilityProduction

0 likes · 11 min read

How to Build a Production‑Ready, High‑Availability Kubernetes Cluster from Scratch

Alibaba Cloud Native

May 20, 2025 · Cloud Native

How Observability 2.0 Redefines Cloud‑Native Log Pipelines and Cuts Costs by 66%

Observability 2.0 unifies logs, metrics and traces into a single platform, introduces event‑centric Wide Events, and drives a complete redesign of Alibaba Cloud's SLS data pipeline that delivers higher performance, lower latency, richer low‑code SPL processing, and up to a 66.7% reduction in processing costs.

Cost OptimizationObservabilityPerformance

0 likes · 12 min read

How Observability 2.0 Redefines Cloud‑Native Log Pipelines and Cuts Costs by 66%

Alibaba Cloud Observability

May 19, 2025 · Information Security

How Tool‑Poisoning Attacks Exploit MCP and What to Do About It

This article analyzes the security risks of the Model Context Protocol (MCP), demonstrates a tool‑poisoning attack that steals private keys via malicious tool descriptions, explores client‑side and server‑side threat vectors, and presents observability‑based mitigation using eBPF and LoongCollector.

AI model securityMCPObservability

0 likes · 23 min read

How Tool‑Poisoning Attacks Exploit MCP and What to Do About It

Alibaba Cloud Observability

May 19, 2025 · Cloud Native

How LoongCollector Transforms Log Collection with High‑Performance Pipelines

LoongCollector, the 2025 evolution of iLogtail, introduces a fully redesigned pipeline architecture, hot‑reload isolation, significant CPU and memory reductions, and advanced monitoring, delivering up to 80% higher log‑collection throughput for cloud‑native environments while ensuring seamless upgrades and multi‑region fault tolerance.

Observabilitylog collectionpipeline

0 likes · 14 min read

How LoongCollector Transforms Log Collection with High‑Performance Pipelines

Alibaba Cloud Developer

May 16, 2025 · Artificial Intelligence

Designing Robust MCP Servers for Alibaba Cloud Observability 2.0 – Lessons & Best Practices

This article explains the Model Context Protocol (MCP), its components, and how to integrate MCP servers with Alibaba Cloud Observability 2.0, offering practical design experiences, tool simplification tips, default parameter strategies, output size control, and future AI‑driven observability insights.

LLMMCPObservability

0 likes · 17 min read

Designing Robust MCP Servers for Alibaba Cloud Observability 2.0 – Lessons & Best Practices

dbaplus Community

May 11, 2025 · Operations

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

This guide explains the four SRE golden signals—Latency, Traffic, Errors, and Saturation—covers their definitions, how to measure them with Prometheus in Node.js, compares them to RED and USE frameworks, and provides concrete alerting rules for reliable service monitoring.

Golden SignalsObservabilityPrometheus

0 likes · 12 min read

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

Bilibili Tech

May 9, 2025 · Artificial Intelligence

How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing

This article explains the design of an AI gateway that centralizes LLM access, detailing its background, overall architecture, authentication, quota management, multi‑model routing, load‑balancing strategies, multi‑tenant isolation, observability features, and the supported API protocols for enterprise integration.

AI gatewayAPI GatewayLLM

0 likes · 17 min read

How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing

StarRocks

May 8, 2025 · Backend Development

How Grab Supercharged Spark Observability 10× with StarRocks – Inside the Iris Architecture

Grab replaced its fragmented Grafana‑Superset stack with a StarRocks‑backed Iris platform, achieving over ten‑fold query speedups, 40% lower resource usage, and a unified real‑time and historical data store for Spark observability across its Southeast Asian super‑app ecosystem.

Data PlatformKafkaMaterialized Views

0 likes · 16 min read

How Grab Supercharged Spark Observability 10× with StarRocks – Inside the Iris Architecture

Liangxu Linux

May 7, 2025 · Operations

How to Install and Configure Loki, Promtail, and Grafana for Log Aggregation on Rocky Linux

This step‑by‑step guide shows how to install Loki, configure its YAML file, set up Promtail to ship logs, install Grafana, add Loki as a data source, and use LogQL to query logs—including collecting Nginx JSON logs—on a Rocky Linux system.

GrafanaLogQLObservability

0 likes · 10 min read

How to Install and Configure Loki, Promtail, and Grafana for Log Aggregation on Rocky Linux

Efficient Ops

May 7, 2025 · Operations

Why Choose SigNoz for Open‑Source Observability? A Deep Dive

This article introduces SigNoz, a self‑hosted open‑source observability platform that unifies metrics, logs, and traces, outlines its core capabilities, shows how to install it with Docker, and compares its resource efficiency to commercial solutions like DataDog and Elastic.

ObservabilityOpenTelemetryOperations

0 likes · 4 min read

Why Choose SigNoz for Open‑Source Observability? A Deep Dive

macrozheng

May 7, 2025 · Backend Development

What’s New in Spring Boot 3.5? 13 Must‑Know Features for Java Backend Developers

Spring Boot 3.5 introduces a suite of enhancements—including task decorator support, the Vibur connection pool, SSL health metrics, flexible configuration loading, automatic Trace‑ID headers, richer Actuator capabilities, functional programming hooks, and many more—each explained with code examples and practical usage tips for modern Java backend development.

Backend DevelopmentObservabilitySpring Boot

0 likes · 10 min read

What’s New in Spring Boot 3.5? 13 Must‑Know Features for Java Backend Developers

Java Architecture Diary

May 6, 2025 · Backend Development

Spring Boot 3.5 Release: Top 13 New Features You Must Know

Spring Boot 3.5 introduces a suite of powerful enhancements—including task decorator support, a new Vibur connection pool, SSL monitoring, flexible environment variable loading, Actuator-triggered Quartz jobs, automatic Trace ID headers, structured log customization, functional routing insights, expanded SSL client support, OpenTelemetry upgrades, Spring Batch tweaks, OAuth 2.0 JWT profiling, and functional bean registration—providing developers with richer capabilities for modern Java backend applications.

Backend DevelopmentObservabilitySpring Boot

0 likes · 11 min read

Spring Boot 3.5 Release: Top 13 New Features You Must Know

Linux Kernel Journey

May 5, 2025 · Operations

Reflections on the 3rd eBPF Developer Conference: Harnessing eBPF for AI

The article recaps the 3rd eBPF Developer Conference in Xi'an, highlighting talks on BPF‑on‑MPTCP, system‑wide PGO, bperf, autonomous‑driving use cases, and AI‑driven observability, while sharing the author's insights on continuous profiling, SysOM, and future challenges of scaling eBPF with large models.

AILinuxObservability

0 likes · 10 min read

Reflections on the 3rd eBPF Developer Conference: Harnessing eBPF for AI

Raymond Ops

Apr 30, 2025 · Cloud Native

Master Loki Logging: Step-by-Step Kubernetes Deployment & Troubleshooting Guide

This comprehensive guide explains Loki's lightweight log aggregation architecture, compares it with ELK, details AllInOne, Helm, Kubernetes, and bare‑metal deployment methods, shows Promtail and Logstash integration, and provides practical troubleshooting tips for common issues.

LoggingObservabilityTroubleshooting

0 likes · 23 min read

Master Loki Logging: Step-by-Step Kubernetes Deployment & Troubleshooting Guide

Efficient Ops

Apr 29, 2025 · Operations

Master Linux Performance: Essential Monitoring Tools & Commands

This guide compiles the most important Linux performance analysis utilities—such as vmstat, iostat, dstat, iotop, pidstat, top, htop, mpstat, netstat, ps, strace, uptime, lsof, and perf—explaining their usage, output fields, and how they fit into a comprehensive system observability workflow.

Command-line ToolsLinuxObservability

0 likes · 15 min read

Master Linux Performance: Essential Monitoring Tools & Commands

Efficient Ops

Apr 25, 2025 · Operations

How Changan Auto Earned Top‑Tier DevOps Certification with a Full‑Link Observability Platform

Changan Automobile’s full‑link observability platform passed both ITU DevOps international and domestic standards assessments, showcasing its advanced monitoring capabilities, improved system stability, and strategic role in the company’s digital transformation, while the interview reveals implementation challenges, benefits, and future AI‑driven enhancements.

Full‑Link MonitoringObservabilityOperations

0 likes · 21 min read

How Changan Auto Earned Top‑Tier DevOps Certification with a Full‑Link Observability Platform

Alibaba Cloud Native

Apr 23, 2025 · Cloud Native

Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide

This article analyzes the common pain points of Alibaba Cloud Serverless App Engine (SAE) deployments—slow release times, opaque status details, and black‑box instance startup—then presents a visualized, observable, and explainable solution that pinpoints bottlenecks, offers concrete optimizations, and demonstrates the resulting performance improvements.

Alibaba CloudDeployment OptimizationObservability

0 likes · 11 min read

Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide

Baidu Geek Talk

Apr 23, 2025 · Operations

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Baidu’s SRE digital‑immune system, evolved into an AI‑powered intelligent immunity platform, quantifies and mitigates risk across thousands of services by integrating data‑driven monitoring, rule‑based detection, and large‑model GraphRAG knowledge mining, cutting degradation cases by ~40% and shifting operations from reactive troubleshooting to proactive, data‑centric quality assurance.

AIDigital ImmunityObservability

0 likes · 14 min read

Baidu SRE Digital Immunity System: Construction, Evolution, and Practice

Linux Kernel Journey

Apr 23, 2025 · Industry Insights

Highlights from the 3rd eBPF Developer Conference: A Technical Recap

The 3rd eBPF Developer Conference held on April 19, 2025 at Xi'an University of Posts and Telecommunications featured 36 expert talks on eBPF advancements, network and security innovations, observability, performance optimization, a vibrant project marketplace, student projects, and provides video and PPT resources for the community.

Linux kernelObservabilityPerformance

0 likes · 7 min read

Highlights from the 3rd eBPF Developer Conference: A Technical Recap

dbaplus Community

Apr 22, 2025 · Backend Development

Explore Elasticsearch 9.0: Performance Boosts, AI Features & Security Upgrades

Elasticsearch 9.0, released on April 15, 2025, builds on Lucene 10.1.0 to deliver major performance gains, introduces Better Binary Quantization, Elastic Distributions of OpenTelemetry, LLM observability, AI‑driven attack discovery, enhanced ES|QL, and is available via Elastic Cloud with deployment tips and examples.

AICloudElasticsearch

0 likes · 7 min read

Explore Elasticsearch 9.0: Performance Boosts, AI Features & Security Upgrades

Zhuanzhuan Tech

Apr 16, 2025 · Backend Development

Analyzing Log4j2 Asynchronous Logging Blocking and Strategies for Fine-Grained Log Control

This article examines the causes of Log4j2 asynchronous logging blockage in high‑throughput Java services, explains the underlying Disruptor mechanics, and proposes a dual‑track logging architecture with compile‑time bytecode enhancement and IDE plugins for line‑level log activation.

Asynchronous LoggingJavaLogging Strategy

0 likes · 15 min read

Analyzing Log4j2 Asynchronous Logging Blocking and Strategies for Fine-Grained Log Control

21CTO

Apr 9, 2025 · Operations

9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments

This article reviews nine practical container‑monitoring solutions—from Last9 and Prometheus to Dynatrace and Elastic Observability—detailing their key features, pricing, and why developers prefer them, and then offers comprehensive best‑practice guidance for metrics, tagging, alerts, and advanced observability strategies in Kubernetes‑driven cloud‑native deployments.

AlertingKubernetesObservability

0 likes · 25 min read

9 Must‑Have Container Monitoring Tools and Best Practices for Modern Cloud‑Native Environments

Liangxu Linux

Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetIncident ManagementObservability

0 likes · 13 min read

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

ByteDance Cloud Native

Apr 3, 2025 · Operations

How to Seamlessly Integrate CloudWeGo with APMPlus for Full‑Stack Observability

This article explains the challenges of observability in distributed microservice and LLM architectures, introduces CloudWeGo and APMPlus, and provides step‑by‑step integration guides for Kitex, Hertz, and Eino frameworks, including code samples, data reporting methods, and advanced monitoring features such as RED metrics, LLM‑specific indicators, service topology, and future roadmap.

APMAPMPlusCloudWeGo

0 likes · 13 min read

How to Seamlessly Integrate CloudWeGo with APMPlus for Full‑Stack Observability

Volcano Engine Developer Services

Apr 1, 2025 · Artificial Intelligence

Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus

This article explores how high cardinality in Prometheus metrics impacts AI large‑model and autonomous‑driving observability, explains the underlying concepts, outlines the performance and cost challenges, and presents practical design, collection, and query‑side solutions—including metric modeling, pre‑aggregation, and remote‑read pushdown—to keep monitoring efficient and scalable.

AI monitoringCardinalityObservability

0 likes · 12 min read

Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus

ByteDance Cloud Native

Mar 27, 2025 · Operations

Taming High Cardinality in AI & Autonomous Driving with Prometheus

This article shares practical experience from Volcengine's managed Prometheus service and its deep integration with large‑model and autonomous‑driving platforms, explaining what high cardinality is, its impact on monitoring systems, root causes, and a range of design, collection, and analysis techniques to mitigate it.

AIObservabilityPrometheus

0 likes · 12 min read

Taming High Cardinality in AI & Autonomous Driving with Prometheus

Airbnb Technology Team

Mar 24, 2025 · Artificial Intelligence

Chronon: Open‑Source Feature Platform for Machine Learning – Architecture, Workflow, and Code Examples

Chronon is an open‑source ML feature platform that lets engineers declaratively define, compute, and serve both batch and real‑time features with built‑in observability, data‑quality checks, and a low‑latency retrieval API, ensuring online‑offline consistency while simplifying pipeline management and enabling future automation.

ChrononObservabilityStreaming

0 likes · 13 min read

Chronon: Open‑Source Feature Platform for Machine Learning – Architecture, Workflow, and Code Examples

Alibaba Cloud Observability

Mar 24, 2025 · Artificial Intelligence

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

AI inferenceObservabilityPrometheus

0 likes · 19 min read

Achieving Full Observability for AI Inference Apps with Prometheus

Alibaba Cloud Observability

Mar 24, 2025 · Information Security

DeepSeek ClickHouse Leak: AI Data Risks & Cloud Native Log Service Safeguards

An exposed ClickHouse database at DeepSeek revealed over a million sensitive logs—including chats, API keys, and backend details—highlighting AI data security gaps, while Alibaba Cloud’s Log Service (SLS) offers comprehensive protection through access control, data masking, fine-grained query limits, and real‑time monitoring.

AILog ServiceObservability

0 likes · 11 min read

DeepSeek ClickHouse Leak: AI Data Risks & Cloud Native Log Service Safeguards

Rare Earth Juejin Tech Community

Mar 23, 2025 · Frontend Development

Designing Effective Front-End Error Monitoring and Reporting Strategies

This article explains the core value of front‑end error monitoring, outlines key error categories, presents practical code examples for capturing explicit, implicit, resource, promise and framework errors, and proposes a multi‑layer defense strategy to improve observability, response time and team collaboration.

Observabilityerror-monitoringweb

0 likes · 12 min read

Designing Effective Front-End Error Monitoring and Reporting Strategies

360 Zhihui Cloud Developer

Mar 20, 2025 · Operations

Unlocking Application Reliability: Core APM Modules and Yunzhou’s OpenTelemetry Design

This article explains Application Performance Monitoring (APM), its key benefits such as business continuity, performance optimization, and cost reduction, outlines essential APM modules, and details Yunzhou Observation’s OpenTelemetry‑based design, data ingestion, processing, visualization, and future roadmap for observability.

APMObservabilityOpenTelemetry

0 likes · 10 min read

Unlocking Application Reliability: Core APM Modules and Yunzhou’s OpenTelemetry Design

Tencent Cloud Developer

Mar 19, 2025 · Cloud Native

Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure

Monitoring Kubernetes is essential to detect resource contention, component failures, and network issues; it involves tracking core component metrics such as API server latency, etcd write times, scheduler delays, as well as node‑level CPU, memory, disk, and network statistics, pod health, and custom application metrics exposed via Prometheus exporters for comprehensive observability.

ExportersKubernetesObservability

0 likes · 23 min read

Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure

Architect

Mar 18, 2025 · Artificial Intelligence

2025 AI Agent Technology Stack: Layers, Core Functions, and Future Directions

The article outlines the 2025 AI Agent technology stack, detailing its five layered architecture—model serving, storage & memory, tooling, framework orchestration, and deployment—while discussing current trends, challenges, and future directions such as tool ecosystem expansion, self‑evolution, and edge‑cloud hybrid deployments.

AI AgentDeploymentObservability

0 likes · 12 min read

2025 AI Agent Technology Stack: Layers, Core Functions, and Future Directions

Cloud Native Technology Community

Mar 18, 2025 · Cloud Native

Best Practices for Managing Core Services in Large‑Scale Kubernetes Deployments

Scaling Kubernetes across dozens or hundreds of clusters requires standardized core services—networking, security, observability, and automation—so organizations should adopt templated configurations, GitOps tools, centralized monitoring, and automated certificate management to reduce complexity, improve security, and lower operational overhead.

AutomationGitOpsKubernetes

0 likes · 8 min read

Best Practices for Managing Core Services in Large‑Scale Kubernetes Deployments

Lobster Programming

Mar 17, 2025 · Operations

How to Build a Lightweight Loki Logging Stack with Promtail and Grafana

This guide walks you through setting up a SpringBoot application, configuring Logback, installing and configuring Loki, Promtail, and Grafana, and comparing the lightweight Loki stack with the traditional ELK solution for efficient log collection and visualization.

ELKGrafanaLogging

0 likes · 14 min read

How to Build a Lightweight Loki Logging Stack with Promtail and Grafana

AI Algorithm Path

Mar 15, 2025 · Artificial Intelligence

Why the Industry Is Shifting From AI Agents to Agentic Workflows

The article explains that low accuracy and security risks of current AI agents—evidenced by a Claude AI Agent achieving only 14% of human performance and an average success rate of about 20%—are driving a move toward agentic workflows, which offer observable, auditable, and data‑synthesizing pipelines that dramatically improve enterprise productivity.

AI agentsAutomationData Synthesis

0 likes · 7 min read

Why the Industry Is Shifting From AI Agents to Agentic Workflows

Alibaba Cloud Observability

Mar 13, 2025 · Databases

How MetricStore 2.0 Redefines Cloud‑Native Time‑Series Storage Performance

MetricStore 2.0 introduces a comprehensive overhaul of memory, file, compute, and transport layers for cloud‑native time‑series data, delivering higher compression, lower latency, multi‑tenant resource control, and support for dynamic schemas, while addressing the scalability limits of its 1.0 predecessor.

Observabilitycloud-nativetime series

0 likes · 21 min read

How MetricStore 2.0 Redefines Cloud‑Native Time‑Series Storage Performance

MaGe Linux Operations

Mar 6, 2025 · Operations

How Large Language Models Are Revolutionizing SRE from Firefighting to Proactive Ops

This article explores how open‑source large language models like DeepSeek empower SRE teams to shift from reactive firefighting to proactive, predictive operations, detailing technical principles, real‑world case studies, essential skill sets, and future trends that reshape the operations landscape.

AI OpsAutomationLarge Language Models

0 likes · 8 min read

How Large Language Models Are Revolutionizing SRE from Firefighting to Proactive Ops

360 Zhihui Cloud Developer

Feb 27, 2025 · Operations

How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR

This article explains the importance, pain points, architecture, core capabilities, and future roadmap of the 360 Zhihui Cloud "Yunzhou" unified alert service, showing how it improves observability, reduces alert noise, and accelerates incident response for modern cloud‑native systems.

AlertingObservabilityOperations

0 likes · 14 min read

How 360’s Unified Alert Service Boosts System Reliability and Cuts MTTR

Alibaba Cloud Native

Feb 25, 2025 · Cloud Native

Turning APIServer Logs into Time‑Series Metrics for Fast Root‑Cause Detection

This article explains how to enrich Kubernetes APIServer observability by converting access logs into time‑series metrics, applying SPL‑based aggregation, anomaly detection, and root‑cause drill‑down, and supplementing with OpenTelemetry tracing to quickly pinpoint failures during large‑scale outages.

AIOpsObservabilityPrometheus

0 likes · 11 min read

Turning APIServer Logs into Time‑Series Metrics for Fast Root‑Cause Detection

Linux Ops Smart Journey

Feb 18, 2025 · Cloud Native

Deploy Filebeat with Helm on Kubernetes: Automated Log Collection to Kafka

This step‑by‑step guide shows how to use a Helm chart to deploy Filebeat in a Kubernetes cluster, automatically collect container logs, and forward them to a Kafka cluster for reliable, scalable observability.

KafkaKubernetesObservability

0 likes · 7 min read

Deploy Filebeat with Helm on Kubernetes: Automated Log Collection to Kafka

Sanyou's Java Diary

Feb 17, 2025 · Operations

How Visualized Full‑Link Log Tracing Boosts Business Debugging Efficiency

This article introduces a visualized full‑link log tracing solution that organizes and dynamically links business logs by leveraging DSL definitions, distributed parameter propagation, and a tree‑structured storage model, enabling fast, end‑to‑end issue localization in complex microservice systems such as the Dazhong Dianping content platform.

Big DataObservabilitylog tracing

0 likes · 25 min read

How Visualized Full‑Link Log Tracing Boosts Business Debugging Efficiency

Alibaba Cloud Observability

Feb 17, 2025 · Operations

What’s Driving Observability in 2025? AIOps, OpenTelemetry, and eBPF Trends

The article outlines 2025 observability trends, covering the rise of AIOps platforms, AI‑driven prediction, OpenTelemetry becoming the de‑facto standard, unified telemetry platforms, the shift of observability left and right, eBPF’s role in platform engineering, and cost‑effective strategies for modern cloud‑native environments.

AIOpsObservabilityOpenTelemetry

0 likes · 10 min read

What’s Driving Observability in 2025? AIOps, OpenTelemetry, and eBPF Trends

Infra Learning Club

Feb 16, 2025 · Operations

GPUprobe: Using eBPF to Monitor CUDA Memory Leaks

The article introduces GPUprobe, an eBPF‑based tool that provides lightweight, continuous, application‑level monitoring of CUDA memory allocation, leaks, and kernel launches, compares it with NSight Systems and DCGM, and demonstrates near‑zero overhead integration with Prometheus and Grafana through detailed code examples and real‑world output analysis.

GPU monitoringGrafanaMemory Leak Detection

0 likes · 13 min read

GPUprobe: Using eBPF to Monitor CUDA Memory Leaks

Efficient Ops

Feb 12, 2025 · R&D Management

How NIO Built a Unified Work Platform for Automotive Digital Cockpits

The article summarizes NIO R&D architect Min Jie’s presentation at the 2024 GOPS Global Operations Conference, detailing the development of an integrated work platform for automotive digital cockpits, the conference’s focus on DevOps, AIOps, cloud‑native and security, and the broader vision for measurable, observable engineering practices.

Digital CockpitObservabilityPlatform Engineering

0 likes · 3 min read

How NIO Built a Unified Work Platform for Automotive Digital Cockpits

ITPUB

Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

Data EngineeringObservabilitySRE

0 likes · 12 min read

Why Your Monitoring Fails and How to Build Effective Observability Data

Alibaba Cloud Observability

Feb 11, 2025 · Information Security

DeepSeek Attack Reveals AI Security Risks and Cloud‑Native Observability Best Practices

The article examines DeepSeek's rapid rise and the large‑scale malicious attacks it faced, highlighting AI security vulnerabilities, and then provides a detailed, cloud‑native guide on building a comprehensive, observable security architecture on Alibaba Cloud using DDoS protection, WAF, logging, and anomaly detection.

AI securityAlibaba CloudDDoS protection

0 likes · 13 min read

DeepSeek Attack Reveals AI Security Risks and Cloud‑Native Observability Best Practices

DeWu Technology

Feb 10, 2025 · Operations

White‑Screen Operations Platform for Multi‑Cloud Kubernetes Middleware Management

The White‑Screen Operations Platform unifies multi‑cloud Kubernetes cluster and middleware management—automating Kafka, Elasticsearch, node, PV, and YAML tasks through a visual UI, eliminating fragmented command‑line scripts, cutting operation times from hours to minutes, standardizing processes, providing auditability, and delivering significant cost savings while scaling for future Kubernetes resources.

AutomationKubernetesMiddleware

0 likes · 20 min read

White‑Screen Operations Platform for Multi‑Cloud Kubernetes Middleware Management

Alibaba Cloud Native

Feb 7, 2025 · Information Security

How DeepSeek’s Attack Highlights the Need for Robust Cloud‑Native Security Observability

The article examines DeepSeek’s rapid rise, the large‑scale malicious attacks it suffered, and then provides a detailed, cloud‑native security observability guide using Alibaba Cloud services such as DDoS protection, WAF, CLB, SAS, and SLS for logging, monitoring, anomaly detection, and alert response.

AI securityAlibaba CloudDDoS protection

0 likes · 15 min read

How DeepSeek’s Attack Highlights the Need for Robust Cloud‑Native Security Observability

DataFunSummit

Jan 23, 2025 · Artificial Intelligence

Improving Observability in Multi‑Agent Systems: Analysis and Extension of OpenAI Swarm

This article examines the research‑oriented topic of observability in multi‑agent systems, reviews existing open‑source MAS frameworks such as Swarm, MetaGPT, AutoGen, and AutoGPT, identifies their observability challenges, and proposes extensions and visualization techniques to enhance debugging, testing, and control of OpenAI Swarm‑based applications.

AIMulti-Agent SystemsObservability

0 likes · 26 min read

Improving Observability in Multi‑Agent Systems: Analysis and Extension of OpenAI Swarm

IT Architects Alliance

Jan 22, 2025 · Cloud Native

Understanding Service Mesh: Concepts, Capabilities, Tools, and Challenges in the Cloud‑Native Era

The article explains what a service mesh is, its core components, key capabilities such as traffic management, security, observability, and resilience, reviews major tools like Istio, Linkerd and Consul Connect, and discusses the operational challenges and future directions within cloud‑native environments.

ObservabilityPerformanceService Mesh

0 likes · 17 min read

Understanding Service Mesh: Concepts, Capabilities, Tools, and Challenges in the Cloud‑Native Era

DeWu Technology

Jan 20, 2025 · Backend Development

Migrating Observability Compute Layer from Java to Rust: Ownership, Concurrency, Deployment, and Monitoring

The article details how moving a high‑throughput observability compute layer from Java to Rust—leveraging Rust’s ownership, zero‑cost async, and static binary deployment—cut memory usage by roughly 68%, CPU consumption by 40%, while outlining monitoring setup, concurrency model, and the steep learning‑curve challenges.

DeploymentObservabilityRust

0 likes · 18 min read

Migrating Observability Compute Layer from Java to Rust: Ownership, Concurrency, Deployment, and Monitoring

Go Development Architecture Practice

Jan 17, 2025 · Backend Development

Mastering Go Backend: Project Structure, Error Handling, and Observability Best Practices

This article explores practical Go backend development techniques, covering project organization, package naming, internal packages, init usage, layer separation (controller, service, dao), dependency injection, global variable pitfalls, observability with logging, tracing and monitoring, comprehensive error handling, and DAO layer automation.

Observabilitybackenddependency-injection

0 likes · 23 min read

Mastering Go Backend: Project Structure, Error Handling, and Observability Best Practices

dbaplus Community

Jan 15, 2025 · Cloud Native

What’s New in Prometheus 3.0? UI Overhaul, Remote Write 2.0, UTF‑8 & OTLP Support

Prometheus 3.0, the first major release in seven years, introduces a revamped UI, Remote Write 2.0 with native metadata and histogram support, full UTF‑8 metric and label names, OTLP ingestion, performance gains over 2.x, and a roadmap of upcoming cloud‑native enhancements.

OTLPObservabilityPrometheus

0 likes · 9 min read

What’s New in Prometheus 3.0? UI Overhaul, Remote Write 2.0, UTF‑8 & OTLP Support