Tagged articles
969 articles
Page 4 of 10
Sanyou's Java Diary
Sanyou's Java Diary
Feb 17, 2025 · Operations

How Visualized Full‑Link Log Tracing Boosts Business Debugging Efficiency

This article introduces a visualized full‑link log tracing solution that organizes and dynamically links business logs by leveraging DSL definitions, distributed parameter propagation, and a tree‑structured storage model, enabling fast, end‑to‑end issue localization in complex microservice systems such as the Dazhong Dianping content platform.

Big DataMicroservicesObservability
0 likes · 25 min read
How Visualized Full‑Link Log Tracing Boosts Business Debugging Efficiency
Alibaba Cloud Observability
Alibaba Cloud Observability
Feb 17, 2025 · Operations

What’s Driving Observability in 2025? AIOps, OpenTelemetry, and eBPF Trends

The article outlines 2025 observability trends, covering the rise of AIOps platforms, AI‑driven prediction, OpenTelemetry becoming the de‑facto standard, unified telemetry platforms, the shift of observability left and right, eBPF’s role in platform engineering, and cost‑effective strategies for modern cloud‑native environments.

ObservabilityOpenTelemetryaiops
0 likes · 10 min read
What’s Driving Observability in 2025? AIOps, OpenTelemetry, and eBPF Trends
Infra Learning Club
Infra Learning Club
Feb 16, 2025 · Operations

GPUprobe: Using eBPF to Monitor CUDA Memory Leaks

The article introduces GPUprobe, an eBPF‑based tool that provides lightweight, continuous, application‑level monitoring of CUDA memory allocation, leaks, and kernel launches, compares it with NSight Systems and DCGM, and demonstrates near‑zero overhead integration with Prometheus and Grafana through detailed code examples and real‑world output analysis.

GPU monitoringGrafanaObservability
0 likes · 13 min read
GPUprobe: Using eBPF to Monitor CUDA Memory Leaks
Efficient Ops
Efficient Ops
Feb 12, 2025 · R&D Management

How NIO Built a Unified Work Platform for Automotive Digital Cockpits

The article summarizes NIO R&D architect Min Jie’s presentation at the 2024 GOPS Global Operations Conference, detailing the development of an integrated work platform for automotive digital cockpits, the conference’s focus on DevOps, AIOps, cloud‑native and security, and the broader vision for measurable, observable engineering practices.

DevOpsDigital CockpitObservability
0 likes · 3 min read
How NIO Built a Unified Work Platform for Automotive Digital Cockpits
ITPUB
ITPUB
Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

ObservabilitySREdata engineering
0 likes · 12 min read
Why Your Monitoring Fails and How to Build Effective Observability Data
Alibaba Cloud Observability
Alibaba Cloud Observability
Feb 11, 2025 · Information Security

DeepSeek Attack Reveals AI Security Risks and Cloud‑Native Observability Best Practices

The article examines DeepSeek's rapid rise and the large‑scale malicious attacks it faced, highlighting AI security vulnerabilities, and then provides a detailed, cloud‑native guide on building a comprehensive, observable security architecture on Alibaba Cloud using DDoS protection, WAF, logging, and anomaly detection.

AI securityAlibaba CloudDDoS protection
0 likes · 13 min read
DeepSeek Attack Reveals AI Security Risks and Cloud‑Native Observability Best Practices
DeWu Technology
DeWu Technology
Feb 10, 2025 · Operations

White‑Screen Operations Platform for Multi‑Cloud Kubernetes Middleware Management

The White‑Screen Operations Platform unifies multi‑cloud Kubernetes cluster and middleware management—automating Kafka, Elasticsearch, node, PV, and YAML tasks through a visual UI, eliminating fragmented command‑line scripts, cutting operation times from hours to minutes, standardizing processes, providing auditability, and delivering significant cost savings while scaling for future Kubernetes resources.

KubernetesObservabilityOperator
0 likes · 20 min read
White‑Screen Operations Platform for Multi‑Cloud Kubernetes Middleware Management
Alibaba Cloud Native
Alibaba Cloud Native
Feb 7, 2025 · Information Security

How DeepSeek’s Attack Highlights the Need for Robust Cloud‑Native Security Observability

The article examines DeepSeek’s rapid rise, the large‑scale malicious attacks it suffered, and then provides a detailed, cloud‑native security observability guide using Alibaba Cloud services such as DDoS protection, WAF, CLB, SAS, and SLS for logging, monitoring, anomaly detection, and alert response.

AI securityAlibaba CloudCloud Native
0 likes · 15 min read
How DeepSeek’s Attack Highlights the Need for Robust Cloud‑Native Security Observability
DataFunSummit
DataFunSummit
Jan 23, 2025 · Artificial Intelligence

Improving Observability in Multi‑Agent Systems: Analysis and Extension of OpenAI Swarm

This article examines the research‑oriented topic of observability in multi‑agent systems, reviews existing open‑source MAS frameworks such as Swarm, MetaGPT, AutoGen, and AutoGPT, identifies their observability challenges, and proposes extensions and visualization techniques to enhance debugging, testing, and control of OpenAI Swarm‑based applications.

AIAgent FrameworksMulti-Agent Systems
0 likes · 26 min read
Improving Observability in Multi‑Agent Systems: Analysis and Extension of OpenAI Swarm
IT Architects Alliance
IT Architects Alliance
Jan 22, 2025 · Cloud Native

Understanding Service Mesh: Concepts, Capabilities, Tools, and Challenges in the Cloud‑Native Era

The article explains what a service mesh is, its core components, key capabilities such as traffic management, security, observability, and resilience, reviews major tools like Istio, Linkerd and Consul Connect, and discusses the operational challenges and future directions within cloud‑native environments.

ObservabilityPerformanceService Mesh
0 likes · 17 min read
Understanding Service Mesh: Concepts, Capabilities, Tools, and Challenges in the Cloud‑Native Era
DeWu Technology
DeWu Technology
Jan 20, 2025 · Backend Development

Migrating Observability Compute Layer from Java to Rust: Ownership, Concurrency, Deployment, and Monitoring

The article details how moving a high‑throughput observability compute layer from Java to Rust—leveraging Rust’s ownership, zero‑cost async, and static binary deployment—cut memory usage by roughly 68%, CPU consumption by 40%, while outlining monitoring setup, concurrency model, and the steep learning‑curve challenges.

DeploymentObservabilityRust
0 likes · 18 min read
Migrating Observability Compute Layer from Java to Rust: Ownership, Concurrency, Deployment, and Monitoring
Go Development Architecture Practice
Go Development Architecture Practice
Jan 17, 2025 · Backend Development

Mastering Go Backend: Project Structure, Error Handling, and Observability Best Practices

This article explores practical Go backend development techniques, covering project organization, package naming, internal packages, init usage, layer separation (controller, service, dao), dependency injection, global variable pitfalls, observability with logging, tracing and monitoring, comprehensive error handling, and DAO layer automation.

BackendObservabilitydependency-injection
0 likes · 23 min read
Mastering Go Backend: Project Structure, Error Handling, and Observability Best Practices
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 8, 2025 · Cloud Native

Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage

Using the recent OpenAI service disruption as a case study, this article examines the stability challenges of large‑scale Kubernetes deployments and details how Alibaba Cloud Container Service and its Prometheus‑based observability solutions enhance reliability through high‑availability architecture, optimized exporters, out‑of‑band data links, and best‑practice guidelines.

Alibaba CloudLarge-Scale ClustersObservability
0 likes · 22 min read
Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage
Alibaba Cloud Observability
Alibaba Cloud Observability
Jan 6, 2025 · Operations

How Synthetic Monitoring Boosts Network Reliability and User Experience

This article explains the importance of network stability, outlines major real‑world outages, and introduces synthetic monitoring—its functions, advantages, disadvantages, and various types such as protocol, browser, and internal monitoring—while comparing probe point categories and guiding enterprises on selecting the right strategy to improve service reliability and performance.

Network ReliabilityObservabilityOperations
0 likes · 12 min read
How Synthetic Monitoring Boosts Network Reliability and User Experience
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 3, 2025 · Cloud Native

How to Enable LLM Traffic Observability with Alibaba Cloud Service Mesh (ASM)

This guide explains how to use Alibaba Cloud Service Mesh (ASM) to add infrastructure‑level observability for large language model (LLM) traffic, covering custom access‑log fields, new Prometheus metrics for token usage, and adding model dimensions to native Istio metrics, with step‑by‑step commands and configuration examples.

ASMKubernetesLLM
0 likes · 14 min read
How to Enable LLM Traffic Observability with Alibaba Cloud Service Mesh (ASM)
Zhihu Tech Column
Zhihu Tech Column
Dec 31, 2024 · Cloud Native

Cloud Native Innovation Forum: AutoMQ Table Topic, OceanBase Integrated Database, and Observability Practices

The article recaps Zhihu's Cloud Native Innovation Forum where experts from AutoMQ, OceanBase, and Flashcat shared practical solutions on streaming data ingestion, unified database architectures, and AI‑driven observability, highlighting real‑world deployments, performance optimizations, and cost‑saving strategies.

AIAutoMQCloud Native
0 likes · 10 min read
Cloud Native Innovation Forum: AutoMQ Table Topic, OceanBase Integrated Database, and Observability Practices
Alibaba Cloud Observability
Alibaba Cloud Observability
Dec 30, 2024 · Operations

Alibaba Cloud’s Mint Tracing Framework and FAMOS Diagnosis Earn Top‑Conference Spot

Alibaba Cloud’s recent research breakthroughs—Mint, a cost‑efficient tracing framework that captures all request flows while drastically cutting storage and network overhead, and FAMOS, a multi‑modal fault‑diagnosis method for microservice systems—have been accepted to the prestigious ASPLOS and ICSE conferences, marking the first top‑conference publications in observability for the company.

Cloud ComputingFault DiagnosisMicroservices
0 likes · 6 min read
Alibaba Cloud’s Mint Tracing Framework and FAMOS Diagnosis Earn Top‑Conference Spot
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2024 · Cloud Native

How a New Telemetry Service Overwhelmed OpenAI’s Kubernetes API Server

An in‑depth post‑mortem reveals how OpenAI’s newly deployed telemetry service generated massive Kubernetes API requests, overloading the API server, breaking DNS resolution, and slowing recovery, while contrasting OpenAI’s approach with LoongCollector/iLogtail’s design to minimize API load and improve cluster stability.

API ServerCloud NativeCluster Reliability
0 likes · 15 min read
How a New Telemetry Service Overwhelmed OpenAI’s Kubernetes API Server
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 25, 2024 · Cloud Native

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

This article analyses the OpenAI large‑scale Kubernetes outage, explains the inherent risks of massive K8s clusters, and presents Alibaba Cloud's architectural enhancements, observability improvements, and best‑practice guidelines to achieve high‑availability and reliable operation of thousands‑node Kubernetes environments.

Cloud NativeKubernetesLarge-Scale Clusters
0 likes · 21 min read
Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 17, 2024 · Cloud Native

Recap of Kubernetes Community Day 2024 Jakarta: Generative AI, eRDMA, Container Security, and Observability

The Kubernetes Community Day held in Jakarta on November 30, 2024 featured Alibaba Cloud experts presenting best‑practice sessions on scaling generative AI workloads, eRDMA network acceleration, container image security, and OpenTelemetry‑based observability within the ACK Kubernetes platform.

Cloud NativeContainer SecurityKubernetes
0 likes · 6 min read
Recap of Kubernetes Community Day 2024 Jakarta: Generative AI, eRDMA, Container Security, and Observability
macrozheng
macrozheng
Dec 3, 2024 · Backend Development

Master Spring Boot 3.4: Key Changes, New Features, and Migration Guide

This comprehensive guide explores Spring Boot 3.4’s performance boosts, enhanced observability, and developer experience improvements, detailing major changes such as RestClient/RestTemplate auto‑configuration, bean validation updates, graceful shutdown, structured logging formats, observability enhancements, dependency upgrades, testing enhancements, and deprecated feature handling, with practical code snippets.

Observabilitybackend-developmentconfiguration
0 likes · 9 min read
Master Spring Boot 3.4: Key Changes, New Features, and Migration Guide
Architect
Architect
Nov 29, 2024 · Operations

How to Combine SkyWalking and ELK for End-to-End Trace ID Logging

This article explains how to integrate SkyWalking's distributed tracing with an ELK logging stack, embed Trace IDs into logs via SkyWalking layouts or MDC, and use Kibana to query and visualize trace‑linked log data for comprehensive microservice observability.

APMELKMicroservices
0 likes · 11 min read
How to Combine SkyWalking and ELK for End-to-End Trace ID Logging
58 Tech
58 Tech
Nov 27, 2024 · Operations

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.

Distributed TracingError BudgetObservability
0 likes · 16 min read
Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned
Java Architecture Diary
Java Architecture Diary
Nov 25, 2024 · Backend Development

Master Spring Boot 3.4 Upgrade: Key Changes, Configurations & Code Samples

Spring Boot 3.4 introduces performance boosts, enhanced observability, and developer experience improvements, and this guide walks you through the most critical changes—including RestClient auto‑configuration, bean validation updates, graceful shutdown, structured logging formats, dependency upgrades, testing enhancements, and deprecated feature handling—complete with configuration snippets and code examples.

Observabilitybackend-developmentconfiguration
0 likes · 8 min read
Master Spring Boot 3.4 Upgrade: Key Changes, Configurations & Code Samples
ITPUB
ITPUB
Nov 23, 2024 · Operations

Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?

This article compares Zabbix and Prometheus across performance, data collection, visualization, and alerting, highlighting their architectural differences, ecosystem strengths, and suitability for traditional data‑center monitoring versus dynamic cloud‑native workloads.

AlertingObservabilityPrometheus
0 likes · 11 min read
Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?
Alibaba Cloud Native
Alibaba Cloud Native
Nov 18, 2024 · Information Security

How Browser Synthetic Monitoring Detects CDN Supply‑Chain Attacks

The article explains how browser‑based synthetic monitoring can observe the full user experience, use rich assertions and multi‑step scripts to spot CDN supply‑chain poisoning and traffic hijacking, illustrated with real polyfill.io and BootCDN attack cases.

CDN poisoningObservabilitySupply Chain Attack
0 likes · 10 min read
How Browser Synthetic Monitoring Detects CDN Supply‑Chain Attacks
Linux Kernel Journey
Linux Kernel Journey
Nov 14, 2024 · Artificial Intelligence

Deep Dive: How DeepFlow Collects Business Metrics for Large‑Model Services

This article explains how China Mobile built a hybrid‑cloud production environment for its customer‑service LLM, using eBPF and WebAssembly plugins from DeepFlow to achieve zero‑intrusion observability, automatically capture full‑stack topology, application/network metrics, and key LLM business indicators such as TTFT, TPOT, and token throughput.

DeepFlowGrafanaLLM
0 likes · 19 min read
Deep Dive: How DeepFlow Collects Business Metrics for Large‑Model Services
Alibaba Cloud Observability
Alibaba Cloud Observability
Nov 8, 2024 · Operations

Why Alibaba Cloud’s New Java Agent Outperforms OpenTelemetry in Performance and Features

This article examines the evolution from ARMS Java Agent to the OTel‑based Alibaba Cloud Java Agent 4.x, comparing tracing, metrics, logging, and profiling capabilities, highlighting innovative designs such as muzzle‑check and VirtualField, and detailing the performance, stability, and community contributions that make the new agent a superior observability solution.

Observabilitytracing
0 likes · 21 min read
Why Alibaba Cloud’s New Java Agent Outperforms OpenTelemetry in Performance and Features
Cloud Native Technology Community
Cloud Native Technology Community
Nov 7, 2024 · Cloud Native

Top Microservices Trends Shaping 2025: Edge, Serverless, AI & More

Microservices are evolving toward 2025 with trends such as edge computing, container orchestration via Kubernetes, DevSecOps, serverless functions, AI-driven management, advanced observability, API gateways, service meshes, multi-language services, event-driven designs, improved data handling, low-code integration, and stronger resilience, reshaping agile, scalable software development.

AICloud NativeDevSecOps
0 likes · 10 min read
Top Microservices Trends Shaping 2025: Edge, Serverless, AI & More
dbaplus Community
dbaplus Community
Oct 28, 2024 · Operations

How We Built a Real‑Time Cross‑Platform Troubleshooting System for Live Streaming

The article describes a high‑efficiency, cross‑device real‑time troubleshooting system for live‑streaming services, covering its motivation, key monitoring, unified trace design, component evolution, data processing, storage, and visualization, and demonstrates how these measures dramatically improved issue‑resolution speed and system stability.

Distributed TracingObservabilityPerformance Optimization
0 likes · 14 min read
How We Built a Real‑Time Cross‑Platform Troubleshooting System for Live Streaming
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 28, 2024 · Operations

How Zero‑Intrusion eBPF Transforms TCP Network Monitoring and Troubleshooting

This article explains how zero‑intrusion eBPF technology enables detailed, non‑disruptive TCP network monitoring, covering data collection interfaces, aggregation methods, implementation steps, usage limitations, and practical installation and visualization guidance for improving network performance and fault analysis.

Linux kernelNetwork MonitoringObservability
0 likes · 9 min read
How Zero‑Intrusion eBPF Transforms TCP Network Monitoring and Troubleshooting
Efficient Ops
Efficient Ops
Oct 24, 2024 · Operations

How Migu’s AI‑Powered Observability Boosts Cloud Gaming Operations

During the 24th GOPS Global Operations Conference, Migu Interactive Entertainment’s Vice President Su Yi discussed how their AI‑driven AIOps observability framework, validated by ITU standards, enhances cloud gaming platform stability, accelerates issue detection, and supports China Mobile’s 5G‑based digital transformation.

AIDigital TransformationObservability
0 likes · 19 min read
How Migu’s AI‑Powered Observability Boosts Cloud Gaming Operations
Efficient Ops
Efficient Ops
Oct 21, 2024 · Operations

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

This article shares practical Prometheus best‑practice tips—from understanding its accuracy‑reliability trade‑offs and self‑monitoring, to avoiding NFS storage, managing high‑cardinality metrics, handling rate() and recording‑rule pitfalls, and fine‑tuning alerting—so you can run a stable, low‑cost monitoring stack.

AlertingCloud NativeObservability
0 likes · 10 min read
Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability
JD Tech Talk
JD Tech Talk
Oct 21, 2024 · Operations

Observability and Quality Assurance: Strategies for Test Teams

This article examines how test teams can enhance application observability and quality assurance by distinguishing observability from traditional monitoring, defining goals, outlining a monitoring foundation, and proposing module‑level and system‑level strategies for proactive fault detection, data analysis, and alerting.

ObservabilityQuality assurancemonitoring
0 likes · 12 min read
Observability and Quality Assurance: Strategies for Test Teams
JD Cloud Developers
JD Cloud Developers
Oct 21, 2024 · Operations

How Test Teams Can Build Observability Beyond Traditional Monitoring

This article examines how quality assurance engineers can adopt observability principles—distinct from conventional monitoring—to enhance system health detection, root‑cause analysis, and proactive risk mitigation across resources, services, business functions, data, and logs.

ObservabilityOperationsQuality assurance
0 likes · 17 min read
How Test Teams Can Build Observability Beyond Traditional Monitoring
Efficient Ops
Efficient Ops
Oct 19, 2024 · Operations

How Migu’s Cloud Gaming Platform Achieved Leading AIOps Observability Standards

Migu Interactive Entertainment’s interview reveals how its cloud gaming platform leveraged AI, 5G, and standardized observability practices to pass both international and domestic AIOps assessments, highlighting the strategic importance of intelligent operations for business continuity in complex, distributed systems.

AIDigital TransformationIntelligent Operations
0 likes · 17 min read
How Migu’s Cloud Gaming Platform Achieved Leading AIOps Observability Standards
Lobster Programming
Lobster Programming
Oct 17, 2024 · Operations

Designing Scalable Log Systems: From Monoliths to Microservices

Effective logging is crucial for developers to diagnose system errors, and this article compares traditional monolithic file‑based logging with modern microservice‑oriented solutions such as ELK, MongoDB, and Loki, outlining their architectures, advantages, and selection criteria.

ELKLokiMongoDB
0 likes · 5 min read
Designing Scalable Log Systems: From Monoliths to Microservices
Alibaba Cloud Native
Alibaba Cloud Native
Oct 11, 2024 · Cloud Native

Can iLogtail Replace Logstash? A Deep Dive into Performance and Architecture

This article examines the traditional ELK stack, compares iLogtail with Filebeat and Logstash in real‑world performance tests, analyzes why iLogtail could not previously replace Logstash, and presents five concrete engineering solutions that enable iLogtail to become a viable, high‑performance alternative for log collection and processing.

Cloud NativeELKObservability
0 likes · 12 min read
Can iLogtail Replace Logstash? A Deep Dive into Performance and Architecture
Alibaba Cloud Observability
Alibaba Cloud Observability
Oct 9, 2024 · Cloud Native

How iLogtail Evolved Over 13 Years to Lead Cloud‑Native Observability

iLogtail, a lightweight log collector, has transformed over 13 years from a simple log‑gathering tool into a full‑stack, cloud‑native observability platform, introducing Go plugins, high‑performance C++ pipelines, SPL processing, modular architecture, and advanced self‑monitoring, reflecting broader trends in data collection technology.

ObservabilityPerformance Optimizationlog collection
0 likes · 22 min read
How iLogtail Evolved Over 13 Years to Lead Cloud‑Native Observability
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 9, 2024 · Operations

Introducing Kyanos: A Lightweight eBPF‑Based Tool for Fast Network Issue Diagnosis

Kyanos is an open‑source command‑line utility that leverages eBPF to provide low‑overhead, kernel‑compatible network tracing and performance analysis for HTTP, MySQL, and Redis traffic, offering simple watch and stat commands that replace slow tcpdump workflows with seconds‑level diagnostics.

ObservabilityPerformance debuggingcommand-line tool
0 likes · 11 min read
Introducing Kyanos: A Lightweight eBPF‑Based Tool for Fast Network Issue Diagnosis
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Sep 29, 2024 · Cloud Native

Building a Production‑Grade Observability System for Alibaba Cloud ACK Container Service

The presentation outlines Alibaba Cloud's ACK container service observability framework, covering its architecture, key capabilities such as eBPF‑based tracing, GPU profiling, network diagnostics, storage monitoring, and FinOps integration, and demonstrates how these features support AI workloads, large‑scale production stability, and automated incident response.

AICloud NativeContainer Service
0 likes · 15 min read
Building a Production‑Grade Observability System for Alibaba Cloud ACK Container Service
Alibaba Cloud Observability
Alibaba Cloud Observability
Sep 29, 2024 · Cloud Native

How to Achieve End-to-End Traceability with RUM and OpenTelemetry

This article explores the challenges of linking Real User Monitoring (RUM) with backend tracing, presents a comprehensive end-to-end traceability solution based on OpenTelemetry and the W3C Trace Context protocol, and offers best-practice guidance for integrating RUM into full-stack observability pipelines.

ObservabilityOpenTelemetryRUM
0 likes · 15 min read
How to Achieve End-to-End Traceability with RUM and OpenTelemetry
Alibaba Cloud Native
Alibaba Cloud Native
Sep 26, 2024 · Cloud Native

How iLogtail Evolved: From Simple Log Collector to Cloud‑Native Observability Platform

This article chronicles iLogtail's 13‑year journey—from its 2013 inception as a basic log collector to a fully open‑source, cloud‑native observability platform—highlighting technical milestones, emerging trends in log agents, architectural innovations, performance breakthroughs, and future directions.

Cloud NativeObservabilityiLogtail
0 likes · 21 min read
How iLogtail Evolved: From Simple Log Collector to Cloud‑Native Observability Platform
AntData
AntData
Sep 26, 2024 · Databases

Apache HoraeDB (CeresDB): An Open‑Source Distributed Time‑Series Database

Apache HoraeDB (CeresDB) is an open‑source, distributed, high‑availability time‑series database developed by Ant Group, supporting multi‑dimensional queries, compatible with Prometheus and OpenTSDB, and offering SQL and OLAP capabilities for use cases such as APM, IoT monitoring, financial analytics, and AI‑infra observability.

Distributed SystemsObservabilityOpen-source
0 likes · 5 min read
Apache HoraeDB (CeresDB): An Open‑Source Distributed Time‑Series Database
Sohu Tech Products
Sohu Tech Products
Sep 25, 2024 · Cloud Native

Observability Concepts and OpenTelemetry Architecture Overview

Observability turns a black‑box application into a system by gathering logs, metrics, and traces, using alerts to spot anomalies, then linking trace IDs to logs; OpenTelemetry standardizes this with instrumented client agents, a Collector (receivers, processors, exporters), and backend storage, while Java agents, span propagation, exemplars, eBPF, and bundles like SigNoz or OpenObserve let teams choose between a custom OTel stack or a solution.

Cloud NativeMetricsObservability
0 likes · 11 min read
Observability Concepts and OpenTelemetry Architecture Overview
DevOps Operations Practice
DevOps Operations Practice
Sep 25, 2024 · Operations

Prometheus 3.0‑beta Released: New UI, Remote Write 2.0, OpenTelemetry Support, and Other Major Changes

Prometheus 3.0‑beta introduces a completely redesigned UI, Remote Write 2.0 with native support for metadata and histograms, built‑in OpenTelemetry metrics handling, UTF‑8 label support, native histograms, and several feature‑flag removals, while encouraging community testing before production use.

BetaReleaseObservabilityOpenTelemetry
0 likes · 6 min read
Prometheus 3.0‑beta Released: New UI, Remote Write 2.0, OpenTelemetry Support, and Other Major Changes
dbaplus Community
dbaplus Community
Sep 23, 2024 · Operations

How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture

Bilibili rebuilt its monitoring platform to handle explosive metric growth by separating collection, storage, and compute, adopting VictoriaMetrics, zone‑based scheduling, and Flink‑driven pre‑aggregation, which together improved stability, query performance, cloud data quality, and overall observability.

FlinkObservabilityPrometheus
0 likes · 31 min read
How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture
Ops Development Stories
Ops Development Stories
Sep 19, 2024 · Artificial Intelligence

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

This tutorial walks through setting up a local k3d cluster, installing Higress, and using its AI plugins—including AI Proxy, AI JSON formatter, AI Agent, and AI Statistics—to integrate and observe Alibaba Cloud's Qwen large language models across various use cases such as weather and flight queries.

AI gatewayAI pluginsHigress
0 likes · 30 min read
How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide
Architect
Architect
Sep 13, 2024 · Operations

Introducing MyPerf4J: A High‑Performance Java Monitoring and Statistics Tool

The article presents MyPerf4J, a Java‑agent based, low‑overhead performance monitoring library that provides real‑time metrics such as method latency, QPS, memory usage, GC statistics, and class loading, along with quick‑start instructions, configuration details, and open‑source links for Java backend services.

BackendJavaAgentMetrics
0 likes · 7 min read
Introducing MyPerf4J: A High‑Performance Java Monitoring and Statistics Tool
Architect
Architect
Sep 12, 2024 · Operations

How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation

The article details Bilibili's evolution of its monitoring platform, describing the stability and performance challenges of a Prometheus‑Thanos stack, the redesign using VictoriaMetrics, collection‑storage separation, unit‑level disaster recovery, query‑tree auto‑replacement, Flink‑based pre‑aggregation, Grafana upgrades, and future roadmap for observability.

Cloud NativeFlinkMetrics
0 likes · 30 min read
How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation
BirdNest Tech Talk
BirdNest Tech Talk
Sep 11, 2024 · Cloud Native

How to Build a Complete eBPF Development Environment on Ubuntu

This guide walks through the purpose, advantages, required Linux packages, Go libraries, exact installation commands, and version details needed to set up a functional eBPF development environment on an Ubuntu system, while explaining each step’s rationale.

Cloud NativeDevelopment EnvironmentGo
0 likes · 10 min read
How to Build a Complete eBPF Development Environment on Ubuntu
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 9, 2024 · Cloud Native

Applying eBPF for Cloud‑Native Observability and Continuous Profiling

By deploying eBPF agents as DaemonSets that hook kernel network and performance events, the Xiaohongshu observability team extended cloud‑native monitoring from the application to the kernel, delivering real‑time traffic analysis and low‑overhead continuous profiling for C++ services, aggregating data into centralized collectors for dashboards, flame‑graphs, and rapid root‑cause diagnosis.

KubernetesObservabilityProfiling
0 likes · 37 min read
Applying eBPF for Cloud‑Native Observability and Continuous Profiling
21CTO
21CTO
Aug 30, 2024 · Backend Development

How to Stay Ahead as a Java Developer: Tips for JDK 21, Spring Boot 3.2, and Beyond

This article compiles practical advice for Java developers feeling out‑of‑practice, covering migration to JDK 21, Spring Boot 3.2 observability, new language features, community resources, and strategies to boost confidence and stay current with the evolving Java ecosystem.

JDK 21Observabilitybackend-development
0 likes · 9 min read
How to Stay Ahead as a Java Developer: Tips for JDK 21, Spring Boot 3.2, and Beyond
Su San Talks Tech
Su San Talks Tech
Aug 28, 2024 · Operations

SkyWalking Guide: Setup, Tracing, Logging & Alerts for Distributed Apps

This article walks through SkyWalking, an open‑source APM solution, covering its architecture, server and client installation, configuration for MySQL persistence, log collection, performance profiling, and alerting, while comparing it with Spring Cloud Sleuth + Zipkin and showing practical code examples.

Distributed TracingMicroservicesObservability
0 likes · 15 min read
SkyWalking Guide: Setup, Tracing, Logging & Alerts for Distributed Apps
Sohu Tech Products
Sohu Tech Products
Aug 21, 2024 · Operations

Step-by-Step Guide: Integrating OpenTelemetry Tracing in Java and Go Projects

This tutorial walks through setting up OpenTelemetry tracing from scratch for both Java and Go microservices, covering collector and Jaeger deployment, required dependencies, configuration parameters, code examples for automatic and manual instrumentation, and how to add custom span attributes and spans.

Distributed TracingGoObservability
0 likes · 15 min read
Step-by-Step Guide: Integrating OpenTelemetry Tracing in Java and Go Projects
Alibaba Cloud Native
Alibaba Cloud Native
Aug 21, 2024 · Cloud Native

What Drives iLogtail Adoption? Insights from a Two‑Year Community Survey

A two‑year community survey of the open‑source iLogtail collector reveals that high performance, container‑friendly design, extensive plugin ecosystem, and strong Kubernetes integration drive widespread production use, while users request better documentation, a more polished ConfigServer tool, and clearer contribution pathways.

Cloud NativeObservabilitylog collection
0 likes · 10 min read
What Drives iLogtail Adoption? Insights from a Two‑Year Community Survey
DevOps
DevOps
Aug 20, 2024 · Operations

CI/CD Maturity Levels and DevOps Practices in Chinese Enterprises

This article analyzes the current state of DevOps adoption in China, presents detailed CI/CD capability levels with a maturity model table, and discusses future operational trends such as automation, AIOps, security integration, observability, and reliability engineering to guide enterprises toward more efficient software delivery.

CI/CDDevOpsObservability
0 likes · 20 min read
CI/CD Maturity Levels and DevOps Practices in Chinese Enterprises
Alibaba Cloud Observability
Alibaba Cloud Observability
Aug 15, 2024 · Cloud Native

How SPL’s High‑Performance Mode Transforms Log Query at Scale

This article explains how the SLS Processing Language (SPL) combines pipeline syntax with SQL‑like operators, introduces a high‑performance mode that pushes computation to storage nodes and uses vectorized processing, and demonstrates sub‑second query times on billions of log entries while supporting rich filtering, histogram visualization, and random paging.

ObservabilitySPLhigh performance query
0 likes · 17 min read
How SPL’s High‑Performance Mode Transforms Log Query at Scale
Alibaba Cloud Observability
Alibaba Cloud Observability
Aug 15, 2024 · Cloud Native

How LoongCollector Transforms iLogtail into a Next‑Gen Cloud‑Native Observability Agent

This article chronicles the two‑year evolution of iLogtail into LoongCollector, detailing its origins, technical milestones, community contributions, feature set—including high‑performance pipelines, programmable SPL, extensive K8s support, and unified config management—and outlines the roadmap that positions it as a leading cloud‑native observability solution.

ObservabilityPipelinecloud-native
0 likes · 19 min read
How LoongCollector Transforms iLogtail into a Next‑Gen Cloud‑Native Observability Agent
Eric Tech Circle
Eric Tech Circle
Aug 15, 2024 · Backend Development

Lightweight Distributed Tracing in Spring Cloud Without Third‑Party Tools

This guide shows how to implement end‑to‑end trace ID propagation across Spring Cloud gateways, downstream services, and asynchronous threads using a custom GlobalTraceFilter, a patched LogbackMDCAdapter with Alibaba TransmittableThreadLocal, and minimal configuration, avoiding heavyweight tracing libraries.

Distributed TracingMicroservicesObservability
0 likes · 5 min read
Lightweight Distributed Tracing in Spring Cloud Without Third‑Party Tools
Sohu Tech Products
Sohu Tech Products
Aug 14, 2024 · Operations

How to Combine SkyWalking and ELK for End-to-End Trace ID Logging

This article explains why ELK alone lacks Trace ID support, describes the architectures of SkyWalking and ELK, compares their capabilities, and provides step‑by‑step configurations—including a Logback layout and MDC approach—to embed Trace IDs into logs for full distributed tracing.

APMDistributed TracingELK
0 likes · 10 min read
How to Combine SkyWalking and ELK for End-to-End Trace ID Logging
Alibaba Cloud Native
Alibaba Cloud Native
Aug 12, 2024 · Cloud Native

How SPL’s High‑Performance Mode Supercharges Log Queries in the Cloud

Log data’s immutable, random, and multi‑source nature makes traditional search inefficient, so Alibaba Cloud’s SLS introduces the SPL pipeline language, combining Unix‑style piping with SQL‑like functions, and leverages computation push‑down, vectorized processing, and optimized I/O to deliver high‑performance log queries at scale.

Cloud NativeObservabilitySPL
0 likes · 18 min read
How SPL’s High‑Performance Mode Supercharges Log Queries in the Cloud
ITPUB
ITPUB
Aug 11, 2024 · Operations

Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation

This article details how Bilibili redesigned its monitoring system to overcome explosive metric growth by separating collection and storage, adopting VictoriaMetrics, implementing zone‑based scheduling, automating PromQL query replacement, and using Flink for efficient pre‑aggregation, resulting in dramatically lower latency and higher stability.

ArchitectureFlinkObservability
0 likes · 31 min read
Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation
Wukong Talks Architecture
Wukong Talks Architecture
Aug 9, 2024 · Operations

Integrating SkyWalking with ELK for Distributed Trace ID Logging

This article explains how to combine SkyWalking and the ELK stack to embed Trace IDs into logs, enabling end‑to‑end request tracing, discusses the strengths and limitations of each platform, and provides configuration examples for Logback, MDC, and Kibana visualisation.

Distributed TracingELKObservability
0 likes · 12 min read
Integrating SkyWalking with ELK for Distributed Trace ID Logging
Bilibili Tech
Bilibili Tech
Aug 9, 2024 · Operations

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

The new Monitoring 2.0 architecture separates collection, compute and storage, adopts VictoriaMetrics for compact time‑series storage and a zone‑based scheduler, introduces push‑based ingestion, uses Flink for real‑time pre‑aggregation and automatic PromQL rewrite, delivering ten‑fold query speedups, sub‑300 ms p90 latency, and dramatically higher write and query throughput.

FlinkMetricsObservability
0 likes · 29 min read
Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink
ITPUB
ITPUB
Aug 8, 2024 · Operations

Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)

Before launching costly observability initiatives, ensure your monitoring is comprehensive and efficient, covering business, application, component, resource, network, and endpoint metrics, and that you have the data collection, storage, alerting, and event‑distribution capabilities to turn raw signals into actionable insights.

AlertingObservabilitymonitoring
0 likes · 9 min read
Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)
Alibaba Cloud Native
Alibaba Cloud Native
Aug 7, 2024 · Operations

How iLogtail Achieves Million‑Scale Observability with SRE Practices

This article details how Alibaba Cloud's iLogtail agent, serving tens of thousands of hosts and containers, overcomes unique stability challenges through a comprehensive SRE approach that spans design, development, testing, gray‑release, operations, and customer‑support, ultimately boosting reliability and reducing incident rates.

Cloud NativeObservabilitySRE
0 likes · 32 min read
How iLogtail Achieves Million‑Scale Observability with SRE Practices
FunTester
FunTester
Jul 30, 2024 · Operations

Mastering True Observability: Models, Practices, and AI‑Driven Automation

This article explains why true observability is essential for modern software, outlines its five core pillars, details a four‑stage maturity model with benefits and drawbacks, and provides practical steps—including data collection, team organization, and AI automation—to advance from basic monitoring to predictive, self‑healing systems.

AIMaturity ModelObservability
0 likes · 13 min read
Mastering True Observability: Models, Practices, and AI‑Driven Automation
DaTaobao Tech
DaTaobao Tech
Jul 29, 2024 · Operations

Testing Environment Reliability, Routing Isolation, Monitoring, and Efficient Deployment Practices

Alibaba Taotian’s testing platform now lets business owners self‑service reliable environments by binding accounts to isolated routes, monitoring lightweight health metrics with automated self‑healing, accelerating deployments via code caching and JVM tricks, and enabling rapid “time‑travel” scenario testing, while planning tighter observability and production alignment.

ObservabilityTesting Environmentdeployment efficiency
0 likes · 11 min read
Testing Environment Reliability, Routing Isolation, Monitoring, and Efficient Deployment Practices
Architecture and Beyond
Architecture and Beyond
Jul 28, 2024 · Frontend Development

Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices

This extensive article presents a systematic approach to front‑end stability, covering observability systems, full‑chain monitoring, high‑availability design, performance management, risk governance, process mechanisms, and engineering practices to ensure reliable user experiences and business continuity.

FrontendObservabilityPerformance
0 likes · 44 min read
Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices
MaGe Linux Operations
MaGe Linux Operations
Jul 23, 2024 · Operations

Master Loki Logging: Deploy, Configure, and Troubleshoot on Kubernetes

This guide walks you through Loki, a lightweight log aggregation system, covering its architecture, advantages, deployment options (All‑In‑One, Kubernetes, and bare‑metal), Promtail configuration, Helm installation, and common troubleshooting steps for reliable log collection and querying in Grafana.

KubernetesLokiObservability
0 likes · 26 min read
Master Loki Logging: Deploy, Configure, and Troubleshoot on Kubernetes
ITPUB
ITPUB
Jul 22, 2024 · Operations

How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts

This article details the end‑to‑end redesign of Quora Travel's Watcher monitoring platform from minute‑level to second‑level precision, covering architectural changes, storage engine migration, client‑side metric collection, server‑side scheduling, dashboard and alarm adaptations, and the resulting operational improvements.

DevOpsObservabilityTime Series
0 likes · 20 min read
How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts
Bilibili Tech
Bilibili Tech
Jul 19, 2024 · Big Data

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Cluster ManagementContainerizationObservability
0 likes · 12 min read
Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation
MaGe Linux Operations
MaGe Linux Operations
Jul 16, 2024 · Cloud Native

How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained

This article explains how Prometheus generates and sends alerts, covering the definition of alert rules with PromQL, grouping, templating, configuring evaluation intervals, deploying a custom alert receiver in Kubernetes, and analyzing alert payloads and delivery frequency, while also detailing alert silencing and resolution behavior.

AlertingAlertmanagerGo
0 likes · 26 min read
How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Jul 14, 2024 · Backend Development

Master Spring Boot Observability with @Timed, @Counted, and @MeterTag

Learn how to enable comprehensive observability in Spring Boot 3.2.5 by leveraging Micrometer’s @Timed, @Counted, and @MeterTag annotations, configuring Actuator endpoints, and customizing aspects to monitor method execution time, request counts, and parameters, complete with practical code examples and Prometheus integration.

MicrometerObservabilityPrometheus
0 likes · 7 min read
Master Spring Boot Observability with @Timed, @Counted, and @MeterTag
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jul 10, 2024 · Cloud Native

Why CNCF’s Acceptance of openGemini Boosts Cloud‑Native Time‑Series Databases

The Cloud Native Computing Foundation (CNCF) has officially welcomed Huawei Cloud’s open‑source high‑performance time‑series database project openGemini, highlighting its role in advancing cloud‑native database technology, supporting massive observability data storage and analysis, and fostering community growth and industry adoption.

CNCFObservabilitycloud-native
0 likes · 4 min read
Why CNCF’s Acceptance of openGemini Boosts Cloud‑Native Time‑Series Databases
Cloud Native Technology Community
Cloud Native Technology Community
Jul 9, 2024 · Cloud Native

Answering the Top 9 Questions About Monitoring in Kubernetes

This article discusses essential Kubernetes monitoring topics, including cost tracking, tool selection, observability frameworks, responsibility allocation, baseline establishment, namespace best practices, the importance of monitoring, backup solutions, and a comparison of Datadog versus Splunk for metrics.

DatadogKubernetesObservability
0 likes · 6 min read
Answering the Top 9 Questions About Monitoring in Kubernetes
Yum! Tech Team
Yum! Tech Team
Jul 3, 2024 · Backend Development

Implementing Sentinel for Traffic Protection and Rate Limiting in a Large-Scale Restaurant Digital Platform

This article details how a large restaurant chain leveraged the open‑source Sentinel framework to implement comprehensive traffic protection, rate limiting, and circuit‑breaking across millions of daily orders, describing challenges, design choices, high‑availability rule distribution, monitoring, user‑experience considerations, and providing Java code examples for integration.

BackendObservabilityjava
0 likes · 11 min read
Implementing Sentinel for Traffic Protection and Rate Limiting in a Large-Scale Restaurant Digital Platform
Efficient Ops
Efficient Ops
Jul 1, 2024 · Cloud Native

How to Monitor Business Metrics with Prometheus in Kubernetes

This article explains the concept of observability, details Prometheus metric definitions and types, and provides Go code examples for exposing, defining, generating, and scraping business‑level metrics in a Kubernetes‑based cloud‑native environment.

GoKubernetesMetrics
0 likes · 11 min read
How to Monitor Business Metrics with Prometheus in Kubernetes
IT Services Circle
IT Services Circle
Jul 1, 2024 · Operations

Understanding Distributed Tracing with SkyWalking: Principles, Architecture, and Practical Implementation

This article explains the fundamentals of distributed tracing in microservice environments, introduces OpenTracing standards, details SkyWalking's architecture and sampling strategies, evaluates its performance against competitors, and shares practical company adaptations such as custom plugins, forced sampling, and trace ID logging.

Distributed TracingObservabilityOpenTracing
0 likes · 15 min read
Understanding Distributed Tracing with SkyWalking: Principles, Architecture, and Practical Implementation
MaGe Linux Operations
MaGe Linux Operations
Jul 1, 2024 · Operations

Mastering Jaeger: A Complete Guide to Distributed Tracing and Deployment

Jaeger is an open‑source, CNCF‑graduated distributed tracing system built by Uber, and this guide explains its core concepts, architecture, sampling strategies, and various deployment options—including all‑in‑one, Kubernetes, and OpenTelemetry—plus how it compares with other tracing tools.

Distributed TracingKubernetesObservability
0 likes · 13 min read
Mastering Jaeger: A Complete Guide to Distributed Tracing and Deployment