Tagged articles

Observability

1054 articles · Page 5 of 11

Jan 10, 2025 · Operations

Deploy Loki & Grafana for Lightweight Log Monitoring with Docker and SpringBoot

This tutorial walks through setting up the lightweight Loki log aggregation system together with Grafana using Docker Compose, integrating Loki into a SpringBoot application via the loki‑logback‑appender, and configuring Grafana dashboards to visualize distributed logs on a low‑memory server.

DockerGrafanaLog Monitoring

0 likes · 7 min read

Deploy Loki & Grafana for Lightweight Log Monitoring with Docker and SpringBoot

Alibaba Cloud Developer

Jan 8, 2025 · Cloud Native

Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage

Using the recent OpenAI service disruption as a case study, this article examines the stability challenges of large‑scale Kubernetes deployments and details how Alibaba Cloud Container Service and its Prometheus‑based observability solutions enhance reliability through high‑availability architecture, optimized exporters, out‑of‑band data links, and best‑practice guidelines.

Alibaba CloudLarge-Scale ClustersObservability

0 likes · 22 min read

Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage

Alibaba Cloud Observability

Jan 6, 2025 · Operations

How Synthetic Monitoring Boosts Network Reliability and User Experience

This article explains the importance of network stability, outlines major real‑world outages, and introduces synthetic monitoring—its functions, advantages, disadvantages, and various types such as protocol, browser, and internal monitoring—while comparing probe point categories and guiding enterprises on selecting the right strategy to improve service reliability and performance.

Network ReliabilityObservabilityOperations

0 likes · 12 min read

How Synthetic Monitoring Boosts Network Reliability and User Experience

Alibaba Cloud Infrastructure

Jan 3, 2025 · Cloud Native

How to Enable LLM Traffic Observability with Alibaba Cloud Service Mesh (ASM)

This guide explains how to use Alibaba Cloud Service Mesh (ASM) to add infrastructure‑level observability for large language model (LLM) traffic, covering custom access‑log fields, new Prometheus metrics for token usage, and adding model dimensions to native Istio metrics, with step‑by‑step commands and configuration examples.

ASMKubernetesLLM

0 likes · 14 min read

How to Enable LLM Traffic Observability with Alibaba Cloud Service Mesh (ASM)

Zhihu Tech Column

Dec 31, 2024 · Cloud Native

Cloud Native Innovation Forum: AutoMQ Table Topic, OceanBase Integrated Database, and Observability Practices

The article recaps Zhihu's Cloud Native Innovation Forum where experts from AutoMQ, OceanBase, and Flashcat shared practical solutions on streaming data ingestion, unified database architectures, and AI‑driven observability, highlighting real‑world deployments, performance optimizations, and cost‑saving strategies.

AIAutoMQObservability

0 likes · 10 min read

Cloud Native Innovation Forum: AutoMQ Table Topic, OceanBase Integrated Database, and Observability Practices

Alibaba Cloud Observability

Dec 30, 2024 · Operations

Alibaba Cloud’s Mint Tracing Framework and FAMOS Diagnosis Earn Top‑Conference Spot

Alibaba Cloud’s recent research breakthroughs—Mint, a cost‑efficient tracing framework that captures all request flows while drastically cutting storage and network overhead, and FAMOS, a multi‑modal fault‑diagnosis method for microservice systems—have been accepted to the prestigious ASPLOS and ICSE conferences, marking the first top‑conference publications in observability for the company.

Cloud ComputingFault diagnosisObservability

0 likes · 6 min read

Alibaba Cloud’s Mint Tracing Framework and FAMOS Diagnosis Earn Top‑Conference Spot

Alibaba Cloud Developer

Dec 26, 2024 · Cloud Native

How a New Telemetry Service Overwhelmed OpenAI’s Kubernetes API Server

An in‑depth post‑mortem reveals how OpenAI’s newly deployed telemetry service generated massive Kubernetes API requests, overloading the API server, breaking DNS resolution, and slowing recovery, while contrasting OpenAI’s approach with LoongCollector/iLogtail’s design to minimize API load and improve cluster stability.

API ServerObservabilityTelemetry

0 likes · 15 min read

How a New Telemetry Service Overwhelmed OpenAI’s Kubernetes API Server

Alibaba Cloud Infrastructure

Dec 25, 2024 · Cloud Native

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

This article analyses the OpenAI large‑scale Kubernetes outage, explains the inherent risks of massive K8s clusters, and presents Alibaba Cloud's architectural enhancements, observability improvements, and best‑practice guidelines to achieve high‑availability and reliable operation of thousands‑node Kubernetes environments.

High AvailabilityKubernetesLarge-Scale Clusters

0 likes · 21 min read

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

Alibaba Cloud Native

Dec 24, 2024 · Operations

How to Quickly Diagnose Error and Latency Issues in Cloud‑Native Applications

This article outlines a practical, end‑to‑end approach for identifying and resolving both error‑related and slow‑request problems in online systems by leveraging trace links, correlated logs, entity relationships, and large‑language‑model‑driven analysis to achieve rapid root‑cause isolation.

APMLLMObservability

0 likes · 12 min read

How to Quickly Diagnose Error and Latency Issues in Cloud‑Native Applications

Alibaba Cloud Infrastructure

Dec 17, 2024 · Cloud Native

Recap of Kubernetes Community Day 2024 Jakarta: Generative AI, eRDMA, Container Security, and Observability

The Kubernetes Community Day held in Jakarta on November 30, 2024 featured Alibaba Cloud experts presenting best‑practice sessions on scaling generative AI workloads, eRDMA network acceleration, container image security, and OpenTelemetry‑based observability within the ACK Kubernetes platform.

Generative AIKubernetesObservability

0 likes · 6 min read

Recap of Kubernetes Community Day 2024 Jakarta: Generative AI, eRDMA, Container Security, and Observability

macrozheng

Dec 3, 2024 · Backend Development

Master Spring Boot 3.4: Key Changes, New Features, and Migration Guide

This comprehensive guide explores Spring Boot 3.4’s performance boosts, enhanced observability, and developer experience improvements, detailing major changes such as RestClient/RestTemplate auto‑configuration, bean validation updates, graceful shutdown, structured logging formats, observability enhancements, dependency upgrades, testing enhancements, and deprecated feature handling, with practical code snippets.

Backend DevelopmentJavaObservability

0 likes · 9 min read

Master Spring Boot 3.4: Key Changes, New Features, and Migration Guide

Architect

Nov 29, 2024 · Operations

How to Combine SkyWalking and ELK for End-to-End Trace ID Logging

This article explains how to integrate SkyWalking's distributed tracing with an ELK logging stack, embed Trace IDs into logs via SkyWalking layouts or MDC, and use Kibana to query and visualize trace‑linked log data for comprehensive microservice observability.

APMELKLogging

0 likes · 11 min read

How to Combine SkyWalking and ELK for End-to-End Trace ID Logging

58 Tech

Nov 27, 2024 · Operations

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.

Distributed TracingError BudgetObservability

0 likes · 16 min read

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

Java Architecture Diary

Nov 25, 2024 · Backend Development

Master Spring Boot 3.4 Upgrade: Key Changes, Configurations & Code Samples

Spring Boot 3.4 introduces performance boosts, enhanced observability, and developer experience improvements, and this guide walks you through the most critical changes—including RestClient auto‑configuration, bean validation updates, graceful shutdown, structured logging formats, dependency upgrades, testing enhancements, and deprecated feature handling—complete with configuration snippets and code examples.

Backend DevelopmentLoggingObservability

0 likes · 8 min read

Master Spring Boot 3.4 Upgrade: Key Changes, Configurations & Code Samples

ITPUB

Nov 23, 2024 · Operations

Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?

This article compares Zabbix and Prometheus across performance, data collection, visualization, and alerting, highlighting their architectural differences, ecosystem strengths, and suitability for traditional data‑center monitoring versus dynamic cloud‑native workloads.

AlertingObservabilityPrometheus

0 likes · 11 min read

Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?

Alibaba Cloud Native

Nov 18, 2024 · Information Security

How Browser Synthetic Monitoring Detects CDN Supply‑Chain Attacks

The article explains how browser‑based synthetic monitoring can observe the full user experience, use rich assertions and multi‑step scripts to spot CDN supply‑chain poisoning and traffic hijacking, illustrated with real polyfill.io and BootCDN attack cases.

CDN poisoningObservabilitySupply Chain Attack

0 likes · 10 min read

How Browser Synthetic Monitoring Detects CDN Supply‑Chain Attacks

Linux Kernel Journey

Nov 14, 2024 · Artificial Intelligence

Deep Dive: How DeepFlow Collects Business Metrics for Large‑Model Services

This article explains how China Mobile built a hybrid‑cloud production environment for its customer‑service LLM, using eBPF and WebAssembly plugins from DeepFlow to achieve zero‑intrusion observability, automatically capture full‑stack topology, application/network metrics, and key LLM business indicators such as TTFT, TPOT, and token throughput.

DeepFlowGrafanaLLM

0 likes · 19 min read

Deep Dive: How DeepFlow Collects Business Metrics for Large‑Model Services

Alibaba Cloud Observability

Nov 13, 2024 · Cloud Native

How iLogtail’s Community Contributors Shaped Cloud‑Native Log Collection

Celebrating iLogtail’s two‑year open‑source anniversary, this article explores the journeys of community committers, their technical contributions such as Kafka dynamic‑topic support and data flattening, and how their work advances cloud‑native observability and log collection for enterprises.

CommunityObservabilityiLogtail

0 likes · 14 min read

How iLogtail’s Community Contributors Shaped Cloud‑Native Log Collection

Ops Development Stories

Nov 12, 2024 · Cloud Native

Why VictoriaLogs Beats Loki: Fast, Low‑Resource Log Management in Kubernetes

This guide walks through deploying VictoriaLogs in Kubernetes with Helm, configuring Promtail, exploring its LogsSQL query language, comparing resource usage to Grafana Loki, and integrating the VictoriaLogs datasource into Grafana for efficient, low‑overhead log monitoring.

KubernetesLogsSQLObservability

0 likes · 22 min read

Why VictoriaLogs Beats Loki: Fast, Low‑Resource Log Management in Kubernetes

Alibaba Cloud Observability

Nov 8, 2024 · Operations

Why Alibaba Cloud’s New Java Agent Outperforms OpenTelemetry in Performance and Features

This article examines the evolution from ARMS Java Agent to the OTel‑based Alibaba Cloud Java Agent 4.x, comparing tracing, metrics, logging, and profiling capabilities, highlighting innovative designs such as muzzle‑check and VirtualField, and detailing the performance, stability, and community contributions that make the new agent a superior observability solution.

ObservabilityTracing

0 likes · 21 min read

Why Alibaba Cloud’s New Java Agent Outperforms OpenTelemetry in Performance and Features

Cloud Native Technology Community

Nov 7, 2024 · Cloud Native

Top Microservices Trends Shaping 2025: Edge, Serverless, AI & More

Microservices are evolving toward 2025 with trends such as edge computing, container orchestration via Kubernetes, DevSecOps, serverless functions, AI-driven management, advanced observability, API gateways, service meshes, multi-language services, event-driven designs, improved data handling, low-code integration, and stronger resilience, reshaping agile, scalable software development.

AIDevSecOpsEdge Computing

0 likes · 10 min read

Top Microservices Trends Shaping 2025: Edge, Serverless, AI & More

dbaplus Community

Oct 28, 2024 · Operations

How We Built a Real‑Time Cross‑Platform Troubleshooting System for Live Streaming

The article describes a high‑efficiency, cross‑device real‑time troubleshooting system for live‑streaming services, covering its motivation, key monitoring, unified trace design, component evolution, data processing, storage, and visualization, and demonstrates how these measures dramatically improved issue‑resolution speed and system stability.

Distributed TracingLive StreamingObservability

0 likes · 14 min read

How We Built a Real‑Time Cross‑Platform Troubleshooting System for Live Streaming

360 Zhihui Cloud Developer

Oct 28, 2024 · Operations

How Zero‑Intrusion eBPF Transforms TCP Network Monitoring and Troubleshooting

This article explains how zero‑intrusion eBPF technology enables detailed, non‑disruptive TCP network monitoring, covering data collection interfaces, aggregation methods, implementation steps, usage limitations, and practical installation and visualization guidance for improving network performance and fault analysis.

Linux kernelNetwork MonitoringObservability

0 likes · 9 min read

How Zero‑Intrusion eBPF Transforms TCP Network Monitoring and Troubleshooting

Efficient Ops

Oct 24, 2024 · Operations

How Migu’s AI‑Powered Observability Boosts Cloud Gaming Operations

During the 24th GOPS Global Operations Conference, Migu Interactive Entertainment’s Vice President Su Yi discussed how their AI‑driven AIOps observability framework, validated by ITU standards, enhances cloud gaming platform stability, accelerates issue detection, and supports China Mobile’s 5G‑based digital transformation.

AIAIOpsObservability

0 likes · 19 min read

How Migu’s AI‑Powered Observability Boosts Cloud Gaming Operations

Efficient Ops

Oct 21, 2024 · Operations

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

This article shares practical Prometheus best‑practice tips—from understanding its accuracy‑reliability trade‑offs and self‑monitoring, to avoiding NFS storage, managing high‑cardinality metrics, handling rate() and recording‑rule pitfalls, and fine‑tuning alerting—so you can run a stable, low‑cost monitoring stack.

AlertingObservabilityOperations

0 likes · 10 min read

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

JD Tech Talk

Oct 21, 2024 · Operations

Observability and Quality Assurance: Strategies for Test Teams

This article examines how test teams can enhance application observability and quality assurance by distinguishing observability from traditional monitoring, defining goals, outlining a monitoring foundation, and proposing module‑level and system‑level strategies for proactive fault detection, data analysis, and alerting.

ObservabilityTestingmonitoring

0 likes · 12 min read

Observability and Quality Assurance: Strategies for Test Teams

JD Cloud Developers

Oct 21, 2024 · Operations

How Test Teams Can Build Observability Beyond Traditional Monitoring

This article examines how quality assurance engineers can adopt observability principles—distinct from conventional monitoring—to enhance system health detection, root‑cause analysis, and proactive risk mitigation across resources, services, business functions, data, and logs.

ObservabilityOperationsmonitoring

0 likes · 17 min read

How Test Teams Can Build Observability Beyond Traditional Monitoring

Efficient Ops

Oct 19, 2024 · Operations

How Migu’s Cloud Gaming Platform Achieved Leading AIOps Observability Standards

Migu Interactive Entertainment’s interview reveals how its cloud gaming platform leveraged AI, 5G, and standardized observability practices to pass both international and domestic AIOps assessments, highlighting the strategic importance of intelligent operations for business continuity in complex, distributed systems.

AIAIOpsIntelligent Operations

0 likes · 17 min read

How Migu’s Cloud Gaming Platform Achieved Leading AIOps Observability Standards

Lobster Programming

Oct 17, 2024 · Operations

Designing Scalable Log Systems: From Monoliths to Microservices

Effective logging is crucial for developers to diagnose system errors, and this article compares traditional monolithic file‑based logging with modern microservice‑oriented solutions such as ELK, MongoDB, and Loki, outlining their architectures, advantages, and selection criteria.

ELKLoggingMongoDB

0 likes · 5 min read

Designing Scalable Log Systems: From Monoliths to Microservices

Alibaba Cloud Big Data AI Platform

Oct 12, 2024 · Operations

How GitOps Powers AI‑Driven Large‑Scale Cloud‑Native Operations

The article summarizes Alibaba Cloud's 2024 conference talks on AI‑enhanced observability, presenting a cloud‑native GitOps solution for massive clusters and showcasing large‑model applications in intelligent Q&A and diagnosis to improve operational stability, cost, and efficiency.

AIOpsGitOpsIntelligent Operations

0 likes · 6 min read

How GitOps Powers AI‑Driven Large‑Scale Cloud‑Native Operations

Alibaba Cloud Native

Oct 11, 2024 · Cloud Native

Can iLogtail Replace Logstash? A Deep Dive into Performance and Architecture

This article examines the traditional ELK stack, compares iLogtail with Filebeat and Logstash in real‑world performance tests, analyzes why iLogtail could not previously replace Logstash, and presents five concrete engineering solutions that enable iLogtail to become a viable, high‑performance alternative for log collection and processing.

ELKObservabilitycloud-native

0 likes · 12 min read

Can iLogtail Replace Logstash? A Deep Dive into Performance and Architecture

Alibaba Cloud Observability

Oct 9, 2024 · Cloud Native

How iLogtail Evolved Over 13 Years to Lead Cloud‑Native Observability

iLogtail, a lightweight log collector, has transformed over 13 years from a simple log‑gathering tool into a full‑stack, cloud‑native observability platform, introducing Go plugins, high‑performance C++ pipelines, SPL processing, modular architecture, and advanced self‑monitoring, reflecting broader trends in data collection technology.

ObservabilityPerformance Optimizationlog collection

0 likes · 22 min read

How iLogtail Evolved Over 13 Years to Lead Cloud‑Native Observability

Rare Earth Juejin Tech Community

Oct 9, 2024 · Operations

Introducing Kyanos: A Lightweight eBPF‑Based Tool for Fast Network Issue Diagnosis

Kyanos is an open‑source command‑line utility that leverages eBPF to provide low‑overhead, kernel‑compatible network tracing and performance analysis for HTTP, MySQL, and Redis traffic, offering simple watch and stat commands that replace slow tcpdump workflows with seconds‑level diagnostics.

ObservabilityPerformance debuggingcommand-line tool

0 likes · 11 min read

Introducing Kyanos: A Lightweight eBPF‑Based Tool for Fast Network Issue Diagnosis

Alibaba Cloud Infrastructure

Sep 29, 2024 · Cloud Native

Building a Production‑Grade Observability System for Alibaba Cloud ACK Container Service

The presentation outlines Alibaba Cloud's ACK container service observability framework, covering its architecture, key capabilities such as eBPF‑based tracing, GPU profiling, network diagnostics, storage monitoring, and FinOps integration, and demonstrates how these features support AI workloads, large‑scale production stability, and automated incident response.

AIContainer ServiceFinOps

0 likes · 15 min read

Building a Production‑Grade Observability System for Alibaba Cloud ACK Container Service

Alibaba Cloud Observability

Sep 29, 2024 · Cloud Native

How to Achieve End-to-End Traceability with RUM and OpenTelemetry

This article explores the challenges of linking Real User Monitoring (RUM) with backend tracing, presents a comprehensive end-to-end traceability solution based on OpenTelemetry and the W3C Trace Context protocol, and offers best-practice guidance for integrating RUM into full-stack observability pipelines.

ObservabilityOpenTelemetryRUM

0 likes · 15 min read

How to Achieve End-to-End Traceability with RUM and OpenTelemetry

Alibaba Cloud Native

Sep 26, 2024 · Cloud Native

How iLogtail Evolved: From Simple Log Collector to Cloud‑Native Observability Platform

This article chronicles iLogtail's 13‑year journey—from its 2013 inception as a basic log collector to a fully open‑source, cloud‑native observability platform—highlighting technical milestones, emerging trends in log agents, architectural innovations, performance breakthroughs, and future directions.

Observabilitycloud-nativeiLogtail

0 likes · 21 min read

How iLogtail Evolved: From Simple Log Collector to Cloud‑Native Observability Platform

AntData

Sep 26, 2024 · Databases

Apache HoraeDB (CeresDB): An Open‑Source Distributed Time‑Series Database

Apache HoraeDB (CeresDB) is an open‑source, distributed, high‑availability time‑series database developed by Ant Group, supporting multi‑dimensional queries, compatible with Prometheus and OpenTSDB, and offering SQL and OLAP capabilities for use cases such as APM, IoT monitoring, financial analytics, and AI‑infra observability.

ObservabilitySQLdistributed systems

0 likes · 5 min read

Apache HoraeDB (CeresDB): An Open‑Source Distributed Time‑Series Database

Sohu Tech Products

Sep 25, 2024 · Cloud Native

Observability Concepts and OpenTelemetry Architecture Overview

Observability turns a black‑box application into a system by gathering logs, metrics, and traces, using alerts to spot anomalies, then linking trace IDs to logs; OpenTelemetry standardizes this with instrumented client agents, a Collector (receivers, processors, exporters), and backend storage, while Java agents, span propagation, exemplars, eBPF, and bundles like SigNoz or OpenObserve let teams choose between a custom OTel stack or a solution.

ObservabilityOpenTelemetryTracing

0 likes · 11 min read

Observability Concepts and OpenTelemetry Architecture Overview

DevOps Operations Practice

Sep 25, 2024 · Operations

Prometheus 3.0‑beta Released: New UI, Remote Write 2.0, OpenTelemetry Support, and Other Major Changes

Prometheus 3.0‑beta introduces a completely redesigned UI, Remote Write 2.0 with native support for metadata and histograms, built‑in OpenTelemetry metrics handling, UTF‑8 label support, native histograms, and several feature‑flag removals, while encouraging community testing before production use.

BetaReleaseObservabilityOpenTelemetry

0 likes · 6 min read

Prometheus 3.0‑beta Released: New UI, Remote Write 2.0, OpenTelemetry Support, and Other Major Changes

dbaplus Community

Sep 23, 2024 · Operations

How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture

Bilibili rebuilt its monitoring platform to handle explosive metric growth by separating collection, storage, and compute, adopting VictoriaMetrics, zone‑based scheduling, and Flink‑driven pre‑aggregation, which together improved stability, query performance, cloud data quality, and overall observability.

FlinkObservabilityPrometheus

0 likes · 31 min read

How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture

BirdNest Tech Talk

Sep 22, 2024 · Operations

How to Trace Go Goroutine State Changes with eBPF Uprobes and Ring Buffers

This article explains how to monitor Go goroutine state transitions without modifying source code by attaching eBPF uprobes to the runtime.casgstatus function, defining a custom data struct, using a ring buffer to deliver events, and processing them in a Go user‑space program.

ObservabilityeBPFgo

0 likes · 16 min read

How to Trace Go Goroutine State Changes with eBPF Uprobes and Ring Buffers

Ops Development Stories

Sep 19, 2024 · Artificial Intelligence

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

This tutorial walks through setting up a local k3d cluster, installing Higress, and using its AI plugins—including AI Proxy, AI JSON formatter, AI Agent, and AI Statistics—to integrate and observe Alibaba Cloud's Qwen large language models across various use cases such as weather and flight queries.

AI PluginsAI gatewayHigress

0 likes · 30 min read

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

Architect

Sep 13, 2024 · Operations

Introducing MyPerf4J: A High‑Performance Java Monitoring and Statistics Tool

The article presents MyPerf4J, a Java‑agent based, low‑overhead performance monitoring library that provides real‑time metrics such as method latency, QPS, memory usage, GC statistics, and class loading, along with quick‑start instructions, configuration details, and open‑source links for Java backend services.

JavaJavaAgentObservability

0 likes · 7 min read

Introducing MyPerf4J: A High‑Performance Java Monitoring and Statistics Tool

Top Architect

Sep 13, 2024 · Operations

Comparison of ELK, EFK, and PLG Logging Systems and Their Architectural Differences

The article explains the components and workflows of ELK, EFK, and PLG (Promtail + Loki + Grafana) logging stacks, compares their architectures, and highlights the trade‑offs between Elasticsearch‑based solutions and Loki‑based solutions for observability in cloud‑native environments.

EFKELKGrafana

0 likes · 8 min read

Architect

Sep 12, 2024 · Operations

How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation

The article details Bilibili's evolution of its monitoring platform, describing the stability and performance challenges of a Prometheus‑Thanos stack, the redesign using VictoriaMetrics, collection‑storage separation, unit‑level disaster recovery, query‑tree auto‑replacement, Flink‑based pre‑aggregation, Grafana upgrades, and future roadmap for observability.

FlinkObservabilityPrometheus

0 likes · 30 min read

How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation

BirdNest Tech Talk

Sep 11, 2024 · Cloud Native

How to Build a Complete eBPF Development Environment on Ubuntu

This guide walks through the purpose, advantages, required Linux packages, Go libraries, exact installation commands, and version details needed to set up a functional eBPF development environment on an Ubuntu system, while explaining each step’s rationale.

Development EnvironmentLinuxObservability

0 likes · 10 min read

How to Build a Complete eBPF Development Environment on Ubuntu

Cloud Native Technology Community

Sep 10, 2024 · Industry Insights

What Makes Cloudflare AI Gateway Stand Out? A Deep Dive into AI API Gateway Features

This article analyzes the emerging AI Gateway market, compares major products such as Kong, Gloo, Higress, Portkey, and OneAPI, and provides a detailed technical review of Cloudflare AI Gateway’s architecture, capabilities, advantages, limitations, and practical usage for LLM integration.

AI gatewayAPI GatewayCloudflare

0 likes · 9 min read

What Makes Cloudflare AI Gateway Stand Out? A Deep Dive into AI API Gateway Features

Xiaohongshu Tech REDtech

Sep 9, 2024 · Cloud Native

Applying eBPF for Cloud‑Native Observability and Continuous Profiling

By deploying eBPF agents as DaemonSets that hook kernel network and performance events, the Xiaohongshu observability team extended cloud‑native monitoring from the application to the kernel, delivering real‑time traffic analysis and low‑overhead continuous profiling for C++ services, aggregating data into centralized collectors for dashboards, flame‑graphs, and rapid root‑cause diagnosis.

KubernetesObservabilityProfiling

0 likes · 37 min read

Applying eBPF for Cloud‑Native Observability and Continuous Profiling

Soul Technical Team

Sep 2, 2024 · Databases

Comparative Analysis of VictoriaMetrics and Thanos for Large‑Scale Metric Storage

This article examines the migration from Thanos to VictoriaMetrics for large‑scale metric storage, detailing background challenges, VictoriaMetrics architecture and storage engine, data write and read processes, and a comparative analysis of performance, scalability, and operational costs between the two systems.

ObservabilityPerformanceThanos

0 likes · 15 min read

Comparative Analysis of VictoriaMetrics and Thanos for Large‑Scale Metric Storage

21CTO

Aug 30, 2024 · Backend Development

How to Stay Ahead as a Java Developer: Tips for JDK 21, Spring Boot 3.2, and Beyond

This article compiles practical advice for Java developers feeling out‑of‑practice, covering migration to JDK 21, Spring Boot 3.2 observability, new language features, community resources, and strategies to boost confidence and stay current with the evolving Java ecosystem.

Backend DevelopmentJDK 21Java

0 likes · 9 min read

How to Stay Ahead as a Java Developer: Tips for JDK 21, Spring Boot 3.2, and Beyond

Alibaba Cloud Observability

Aug 29, 2024 · Cloud Native

What Drives iLogtail Adoption? Insights from the 2‑Year Community Survey

The two‑year iLogtail community survey reveals that high performance, container‑friendly design, and a rich plugin ecosystem drive adoption, while users request better documentation, a more active development roadmap, and improved configuration tools to boost community participation and ecosystem growth.

KubernetesObservabilityPlugins

0 likes · 10 min read

What Drives iLogtail Adoption? Insights from the 2‑Year Community Survey

Efficient Ops

Aug 28, 2024 · Operations

Mastering Prometheus: Architecture, Metrics, and Real-World Monitoring Techniques

This article provides a comprehensive overview of Prometheus, covering its architecture, suitable and unsuitable use cases, data model with TSDB and WAL, metric types, PromQL query language, and practical examples for monitoring CPU, memory, and disk usage in Kubernetes environments.

KubernetesObservabilityPromQL

0 likes · 12 min read

Mastering Prometheus: Architecture, Metrics, and Real-World Monitoring Techniques

Su San Talks Tech

Aug 28, 2024 · Operations

SkyWalking Guide: Setup, Tracing, Logging & Alerts for Distributed Apps

This article walks through SkyWalking, an open‑source APM solution, covering its architecture, server and client installation, configuration for MySQL persistence, log collection, performance profiling, and alerting, while comparing it with Spring Cloud Sleuth + Zipkin and showing practical code examples.

Distributed TracingJavaObservability

0 likes · 15 min read

SkyWalking Guide: Setup, Tracing, Logging & Alerts for Distributed Apps

Sohu Tech Products

Aug 21, 2024 · Operations

Step-by-Step Guide: Integrating OpenTelemetry Tracing in Java and Go Projects

This tutorial walks through setting up OpenTelemetry tracing from scratch for both Java and Go microservices, covering collector and Jaeger deployment, required dependencies, configuration parameters, code examples for automatic and manual instrumentation, and how to add custom span attributes and spans.

Distributed TracingJavaObservability

0 likes · 15 min read

Step-by-Step Guide: Integrating OpenTelemetry Tracing in Java and Go Projects

Alibaba Cloud Native

Aug 21, 2024 · Cloud Native

What Drives iLogtail Adoption? Insights from a Two‑Year Community Survey

A two‑year community survey of the open‑source iLogtail collector reveals that high performance, container‑friendly design, extensive plugin ecosystem, and strong Kubernetes integration drive widespread production use, while users request better documentation, a more polished ConfigServer tool, and clearer contribution pathways.

Observabilitycloud-nativelog collection

0 likes · 10 min read

What Drives iLogtail Adoption? Insights from a Two‑Year Community Survey

DevOps

Aug 20, 2024 · Operations

CI/CD Maturity Levels and DevOps Practices in Chinese Enterprises

This article analyzes the current state of DevOps adoption in China, presents detailed CI/CD capability levels with a maturity model table, and discusses future operational trends such as automation, AIOps, security integration, observability, and reliability engineering to guide enterprises toward more efficient software delivery.

AutomationCI/CDObservability

0 likes · 20 min read

CI/CD Maturity Levels and DevOps Practices in Chinese Enterprises

macrozheng

Aug 20, 2024 · Operations

Boost Java App Performance with MyPerf4J: High‑Throughput, Low‑Latency Monitoring

MyPerf4J is a high‑performance, non‑intrusive Java Agent that records millions of method calls per second with nanosecond precision, offering real‑time metrics, low memory overhead, and comprehensive monitoring for both development and production environments.

JavaMyPerf4JObservability

0 likes · 8 min read

Boost Java App Performance with MyPerf4J: High‑Throughput, Low‑Latency Monitoring

Alibaba Cloud Observability

Aug 15, 2024 · Cloud Native

How SPL’s High‑Performance Mode Transforms Log Query at Scale

This article explains how the SLS Processing Language (SPL) combines pipeline syntax with SQL‑like operators, introduces a high‑performance mode that pushes computation to storage nodes and uses vectorized processing, and demonstrates sub‑second query times on billions of log entries while supporting rich filtering, histogram visualization, and random paging.

ObservabilitySPLhigh performance query

0 likes · 17 min read

How SPL’s High‑Performance Mode Transforms Log Query at Scale

Alibaba Cloud Observability

Aug 15, 2024 · Cloud Native

How LoongCollector Transforms iLogtail into a Next‑Gen Cloud‑Native Observability Agent

This article chronicles the two‑year evolution of iLogtail into LoongCollector, detailing its origins, technical milestones, community contributions, feature set—including high‑performance pipelines, programmable SPL, extensive K8s support, and unified config management—and outlines the roadmap that positions it as a leading cloud‑native observability solution.

Observabilitycloud-nativedata-collection

0 likes · 19 min read

How LoongCollector Transforms iLogtail into a Next‑Gen Cloud‑Native Observability Agent

Eric Tech Circle

Aug 15, 2024 · Backend Development

Lightweight Distributed Tracing in Spring Cloud Without Third‑Party Tools

This guide shows how to implement end‑to‑end trace ID propagation across Spring Cloud gateways, downstream services, and asynchronous threads using a custom GlobalTraceFilter, a patched LogbackMDCAdapter with Alibaba TransmittableThreadLocal, and minimal configuration, avoiding heavyweight tracing libraries.

Distributed TracingLogbackMDC

0 likes · 5 min read

Lightweight Distributed Tracing in Spring Cloud Without Third‑Party Tools

Sohu Tech Products

Aug 14, 2024 · Operations

How to Combine SkyWalking and ELK for End-to-End Trace ID Logging

This article explains why ELK alone lacks Trace ID support, describes the architectures of SkyWalking and ELK, compares their capabilities, and provides step‑by‑step configurations—including a Logback layout and MDC approach—to embed Trace IDs into logs for full distributed tracing.

APMDistributed TracingELK

0 likes · 10 min read

Alibaba Cloud Native

Aug 12, 2024 · Cloud Native

How SPL’s High‑Performance Mode Supercharges Log Queries in the Cloud

Log data’s immutable, random, and multi‑source nature makes traditional search inefficient, so Alibaba Cloud’s SLS introduces the SPL pipeline language, combining Unix‑style piping with SQL‑like functions, and leverages computation push‑down, vectorized processing, and optimized I/O to deliver high‑performance log queries at scale.

ObservabilitySPLcloud-native

0 likes · 18 min read

How SPL’s High‑Performance Mode Supercharges Log Queries in the Cloud

ITPUB

Aug 11, 2024 · Operations

Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation

This article details how Bilibili redesigned its monitoring system to overcome explosive metric growth by separating collection and storage, adopting VictoriaMetrics, implementing zone‑based scheduling, automating PromQL query replacement, and using Flink for efficient pre‑aggregation, resulting in dramatically lower latency and higher stability.

FlinkObservabilityPromQL

0 likes · 31 min read

Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation

Wukong Talks Architecture

Aug 9, 2024 · Operations

Integrating SkyWalking with ELK for Distributed Trace ID Logging

This article explains how to combine SkyWalking and the ELK stack to embed Trace IDs into logs, enabling end‑to‑end request tracing, discusses the strengths and limitations of each platform, and provides configuration examples for Logback, MDC, and Kibana visualisation.

Distributed TracingELKLogging

0 likes · 12 min read

Integrating SkyWalking with ELK for Distributed Trace ID Logging

Bilibili Tech

Aug 9, 2024 · Operations

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

The new Monitoring 2.0 architecture separates collection, compute and storage, adopts VictoriaMetrics for compact time‑series storage and a zone‑based scheduler, introduces push‑based ingestion, uses Flink for real‑time pre‑aggregation and automatic PromQL rewrite, delivering ten‑fold query speedups, sub‑300 ms p90 latency, and dramatically higher write and query throughput.

FlinkObservabilityPrometheus

0 likes · 29 min read

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

ITPUB

Aug 8, 2024 · Operations

Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)

Before launching costly observability initiatives, ensure your monitoring is comprehensive and efficient, covering business, application, component, resource, network, and endpoint metrics, and that you have the data collection, storage, alerting, and event‑distribution capabilities to turn raw signals into actionable insights.

AlertingObservabilitymonitoring

0 likes · 9 min read

Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)

Alibaba Cloud Native

Aug 7, 2024 · Operations

How iLogtail Achieves Million‑Scale Observability with SRE Practices

This article details how Alibaba Cloud's iLogtail agent, serving tens of thousands of hosts and containers, overcomes unique stability challenges through a comprehensive SRE approach that spans design, development, testing, gray‑release, operations, and customer‑support, ultimately boosting reliability and reducing incident rates.

ObservabilityReliability EngineeringSRE

0 likes · 32 min read

How iLogtail Achieves Million‑Scale Observability with SRE Practices

FunTester

Jul 30, 2024 · Operations

Mastering True Observability: Models, Practices, and AI‑Driven Automation

This article explains why true observability is essential for modern software, outlines its five core pillars, details a four‑stage maturity model with benefits and drawbacks, and provides practical steps—including data collection, team organization, and AI automation—to advance from basic monitoring to predictive, self‑healing systems.

AIAutomationLogging

0 likes · 13 min read

Mastering True Observability: Models, Practices, and AI‑Driven Automation

DaTaobao Tech

Jul 29, 2024 · Operations

Testing Environment Reliability, Routing Isolation, Monitoring, and Efficient Deployment Practices

Alibaba Taotian’s testing platform now lets business owners self‑service reliable environments by binding accounts to isolated routes, monitoring lightweight health metrics with automated self‑healing, accelerating deployments via code caching and JVM tricks, and enabling rapid “time‑travel” scenario testing, while planning tighter observability and production alignment.

ObservabilityTesting Environmentdeployment efficiency

0 likes · 11 min read

Testing Environment Reliability, Routing Isolation, Monitoring, and Efficient Deployment Practices

Architecture and Beyond

Jul 28, 2024 · Frontend Development

Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices

This extensive article presents a systematic approach to front‑end stability, covering observability systems, full‑chain monitoring, high‑availability design, performance management, risk governance, process mechanisms, and engineering practices to ensure reliable user experiences and business continuity.

FrontendObservabilityPerformance

0 likes · 44 min read

Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices

MaGe Linux Operations

Jul 23, 2024 · Operations

Master Loki Logging: Deploy, Configure, and Troubleshoot on Kubernetes

This guide walks you through Loki, a lightweight log aggregation system, covering its architecture, advantages, deployment options (All‑In‑One, Kubernetes, and bare‑metal), Promtail configuration, Helm installation, and common troubleshooting steps for reliable log collection and querying in Grafana.

KubernetesObservabilityloki

0 likes · 26 min read

Master Loki Logging: Deploy, Configure, and Troubleshoot on Kubernetes

ITPUB

Jul 22, 2024 · Operations

How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts

This article details the end‑to‑end redesign of Quora Travel's Watcher monitoring platform from minute‑level to second‑level precision, covering architectural changes, storage engine migration, client‑side metric collection, server‑side scheduling, dashboard and alarm adaptations, and the resulting operational improvements.

Observabilitydevopsmonitoring

0 likes · 20 min read

How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts

Efficient Ops

Jul 21, 2024 · Cloud Native

Choosing the Right Log Stack: ELK vs EFK vs PLG (Loki) Explained

This article compares popular log aggregation stacks—ELK, EFK, and the PLG combination of Promtail, Loki, and Grafana—detailing their components, architecture, and trade‑offs for cloud‑native environments such as Kubernetes.

EFKELKGrafana

0 likes · 6 min read

Choosing the Right Log Stack: ELK vs EFK vs PLG (Loki) Explained

Bilibili Tech

Jul 19, 2024 · Big Data

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Observabilitybig data platformcluster management

0 likes · 12 min read

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Spring Full-Stack Practical Cases

Jul 19, 2024 · Backend Development

Boost Spring Boot 3.2 Performance: CDS, Virtual Threads, GraalVM & Observability

This article explains how Spring Boot 3.2 can be accelerated with Class Data Sharing, virtual threads, GraalVM native images, ProblemDetail error handling, and new observability features, providing practical commands and code examples for faster startup, lower memory usage, and richer monitoring.

CDSGraalVMJVM

0 likes · 9 min read

Boost Spring Boot 3.2 Performance: CDS, Virtual Threads, GraalVM & Observability

MaGe Linux Operations

Jul 16, 2024 · Cloud Native

How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained

This article explains how Prometheus generates and sends alerts, covering the definition of alert rules with PromQL, grouping, templating, configuring evaluation intervals, deploying a custom alert receiver in Kubernetes, and analyzing alert payloads and delivery frequency, while also detailing alert silencing and resolution behavior.

AlertingAlertmanagerKubernetes

0 likes · 26 min read

How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained

Alibaba Cloud Infrastructure

Jul 15, 2024 · Operations

Managing LLM Traffic in Alibaba Service Mesh (ASM): Routing, Observability, and Security

This article explains how to use Alibaba Service Mesh (ASM) to register large language model (LLM) providers, configure LLMProvider and LLMRoute resources, and implement traffic routing, observability, and security for LLM services through step‑by‑step Kubernetes manifests and curl tests.

AIASMKubernetes

0 likes · 13 min read

Managing LLM Traffic in Alibaba Service Mesh (ASM): Routing, Observability, and Security

Spring Full-Stack Practical Cases

Jul 14, 2024 · Backend Development

Master Spring Boot Observability with @Timed, @Counted, and @MeterTag

Learn how to enable comprehensive observability in Spring Boot 3.2.5 by leveraging Micrometer’s @Timed, @Counted, and @MeterTag annotations, configuring Actuator endpoints, and customizing aspects to monitor method execution time, request counts, and parameters, complete with practical code examples and Prometheus integration.

ObservabilityPrometheusSpring Boot

0 likes · 7 min read

Master Spring Boot Observability with @Timed, @Counted, and @MeterTag

MaGe Linux Operations

Jul 13, 2024 · Operations

Unlocking Observability: A Complete Guide to OpenTelemetry Architecture and APIs

This article explains what OpenTelemetry is, its core components, key terminology, benefits, usage steps, and detailed architecture—including APIs, SDK pipelines, and the collector—providing a comprehensive overview for developers and operators seeking vendor‑neutral observability solutions.

ObservabilityOpenTelemetryTelemetry

0 likes · 13 min read

Unlocking Observability: A Complete Guide to OpenTelemetry Architecture and APIs

Huawei Cloud Developer Alliance

Jul 10, 2024 · Cloud Native

Why CNCF’s Acceptance of openGemini Boosts Cloud‑Native Time‑Series Databases

The Cloud Native Computing Foundation (CNCF) has officially welcomed Huawei Cloud’s open‑source high‑performance time‑series database project openGemini, highlighting its role in advancing cloud‑native database technology, supporting massive observability data storage and analysis, and fostering community growth and industry adoption.

CNCFObservabilitycloud-native

0 likes · 4 min read

Why CNCF’s Acceptance of openGemini Boosts Cloud‑Native Time‑Series Databases

Cloud Native Technology Community

Jul 9, 2024 · Cloud Native

Answering the Top 9 Questions About Monitoring in Kubernetes

This article discusses essential Kubernetes monitoring topics, including cost tracking, tool selection, observability frameworks, responsibility allocation, baseline establishment, namespace best practices, the importance of monitoring, backup solutions, and a comparison of Datadog versus Splunk for metrics.

DatadogKubernetesObservability

0 likes · 6 min read

Answering the Top 9 Questions About Monitoring in Kubernetes

Yum! Tech Team

Jul 3, 2024 · Backend Development

Implementing Sentinel for Traffic Protection and Rate Limiting in a Large-Scale Restaurant Digital Platform

This article details how a large restaurant chain leveraged the open‑source Sentinel framework to implement comprehensive traffic protection, rate limiting, and circuit‑breaking across millions of daily orders, describing challenges, design choices, high‑availability rule distribution, monitoring, user‑experience considerations, and providing Java code examples for integration.

JavaObservabilitySentinel

0 likes · 11 min read

Implementing Sentinel for Traffic Protection and Rate Limiting in a Large-Scale Restaurant Digital Platform

Efficient Ops

Jul 1, 2024 · Cloud Native

How to Monitor Business Metrics with Prometheus in Kubernetes

This article explains the concept of observability, details Prometheus metric definitions and types, and provides Go code examples for exposing, defining, generating, and scraping business‑level metrics in a Kubernetes‑based cloud‑native environment.

KubernetesObservabilityPrometheus

0 likes · 11 min read

How to Monitor Business Metrics with Prometheus in Kubernetes

IT Services Circle

Jul 1, 2024 · Operations

Understanding Distributed Tracing with SkyWalking: Principles, Architecture, and Practical Implementation

This article explains the fundamentals of distributed tracing in microservice environments, introduces OpenTracing standards, details SkyWalking's architecture and sampling strategies, evaluates its performance against competitors, and shares practical company adaptations such as custom plugins, forced sampling, and trace ID logging.

Distributed TracingJavaObservability

0 likes · 15 min read

Understanding Distributed Tracing with SkyWalking: Principles, Architecture, and Practical Implementation

MaGe Linux Operations

Jul 1, 2024 · Operations

Mastering Jaeger: A Complete Guide to Distributed Tracing and Deployment

Jaeger is an open‑source, CNCF‑graduated distributed tracing system built by Uber, and this guide explains its core concepts, architecture, sampling strategies, and various deployment options—including all‑in‑one, Kubernetes, and OpenTelemetry—plus how it compares with other tracing tools.

Distributed TracingJaegerKubernetes

0 likes · 13 min read

Mastering Jaeger: A Complete Guide to Distributed Tracing and Deployment

Meituan Technology Team

Jun 27, 2024 · Mobile Development

Meituan Mini‑Program Testability Practices and Implementation

Meituan’s Tech Salon Session 77 describes how the company built a generic JavaScript‑hooking SDK packaged as an NPM module to give mini‑programs on platforms such as WeChat, Alipay and Kuaishou observable and controllable capabilities across UI, storage, network and system layers, enabling automated cache management, request mocking, state inspection and visual diff testing, which the ticket team leveraged to achieve over 30 % test scenario coverage, 100 % page coverage and discover hundreds of defects, while outlining future plans to stabilize and expand the framework.

AutomationJavaScriptMini Program

0 likes · 19 min read

Meituan Mini‑Program Testability Practices and Implementation

Alibaba Cloud Observability

Jun 27, 2024 · Operations

iLogtail SPL vs Logstash: Faster, Lighter, More Flexible Log Processing

iLogtail 2.0 introduces an SPL processing mode that outperforms Logstash’s filter plugins in functionality, resource consumption, and throughput across multiple test scenarios, offering lower CPU and memory usage, faster start‑up, and superior handling of complex JSON and high‑volume log streams.

Log ProcessingObservabilitySPL

0 likes · 17 min read

iLogtail SPL vs Logstash: Faster, Lighter, More Flexible Log Processing

dbaplus Community

Jun 24, 2024 · Operations

How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute

Qunar’s Watcher monitoring platform was upgraded from minute‑level to second‑level precision, redesigning storage, data collection, and alerting pipelines, adopting VictoriaMetrics, enhancing client SDKs, and adding fine‑grained alarm rules, which reduced fault detection from four minutes to under one minute while improving reliability and scalability.

Observabilitydevopsmonitoring

0 likes · 20 min read

How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute

Sanyou's Java Diary

Jun 24, 2024 · Operations

How Visualized Full‑Chain Log Tracing Transforms Complex Business Systems

This article explains a new visualized full‑chain log tracing solution that organizes business logs by logical flow, dynamically links them during execution, and provides a visual, searchable view of the entire business process, dramatically improving issue localization in large‑scale distributed systems.

ObservabilityOperationsbackend

0 likes · 26 min read

How Visualized Full‑Chain Log Tracing Transforms Complex Business Systems

Sohu Tech Products

Jun 20, 2024 · Cloud Native

How to Expose and Collect Metrics with OpenTelemetry and Prometheus in Cloud‑Native Java Apps

This article explains the background of metrics in cloud‑native systems, shows how to expose custom Prometheus metrics using OpenTelemetry's MeterProvider, compares different exporters, and provides a complete Pulsar client example with code snippets and configuration for end‑to‑end observability.

JavaObservabilityOpenTelemetry

0 likes · 10 min read

How to Expose and Collect Metrics with OpenTelemetry and Prometheus in Cloud‑Native Java Apps

Cloud Native Technology Community

Jun 19, 2024 · Cloud Native

Lessons Learned from Migrating Applications to Kubernetes

This article recounts a two‑year journey of moving from Ansible‑based EC2 deployments to a Kubernetes‑centric platform, detailing motivations, migration strategies, operational challenges, tooling choices, cost considerations, and practical lessons for teams contemplating a similar cloud‑native transformation.

CI/CDInfrastructure MigrationKubernetes

0 likes · 19 min read

Lessons Learned from Migrating Applications to Kubernetes

360 Smart Cloud

Jun 18, 2024 · Cloud Native

Understanding eBPF: Principles, Applications, Development Process, and Sample Programs

This article introduces eBPF, explains its zero‑intrusion observability advantages, describes its architecture and runtime workflow, outlines development prerequisites and compilation steps, and provides concrete Go‑based examples for tracing bash commands and measuring TCP connection latency.

ObservabilityeBPFkernel

0 likes · 13 min read

Understanding eBPF: Principles, Applications, Development Process, and Sample Programs

Bilibili Tech

Jun 18, 2024 · Frontend Development

Design and Implementation of a Front-End Observability System for Business Monitoring

The article describes a unified front‑end observability platform that standardizes data‑point collection via a common SDK, automatically generates health and business dashboards, integrates real‑time monitoring and heatmaps, and has been adopted on 140 pages, delivering faster first‑screen loads, lower error and bounce rates, and higher conversion.

FrontendObservabilitySystem Design

0 likes · 22 min read

Design and Implementation of a Front-End Observability System for Business Monitoring

DevOps

Jun 16, 2024 · Operations

Performance Engineering Challenges and Practices for Software‑Defined Vehicles

The article examines how the shift to Software‑Defined Vehicles introduces complex performance engineering challenges across software, hardware, and organizational domains, and proposes an engineering‑driven, continuous‑observability approach—including modeling, monitoring, iterative optimization, and specialized team structures—to sustainably improve automotive software performance.

ObservabilityPerformance OptimizationSDV

0 likes · 17 min read

Performance Engineering Challenges and Practices for Software‑Defined Vehicles

21CTO

Jun 7, 2024 · Artificial Intelligence

Why AI Gateways Are the Next Evolution of API Gateways

AI gateways have emerged as essential infrastructure for modern AI applications, offering specialized security, load balancing, cost management, and observability that go beyond traditional API gateways, and understanding their differences and deployment considerations is crucial for developers and ops teams.

AI InfrastructureAI gatewayAPI Gateway

0 likes · 10 min read

Why AI Gateways Are the Next Evolution of API Gateways

Linux Code Review Hub

Jun 5, 2024 · Operations

Observing Virtio‑Net NIC Queues with eBPF: A Practical Guide

This article explains how to extend eBPF to make virtio‑net NIC queue metrics observable, walks through the front‑end send/receive flow, defines key queue indices, integrates the probes into the rtrace tool, and demonstrates fault detection with real‑time data.

Linux kernelNIC queueObservability

0 likes · 17 min read

Observing Virtio‑Net NIC Queues with eBPF: A Practical Guide

Efficient Ops

Jun 4, 2024 · Operations

How Huya Unified Its Monitoring Platform with OpenTelemetry for Zero‑Cost Integration

This article details Huya's transition from fragmented, non‑standard monitoring solutions to a unified OpenTelemetry‑based platform, covering project background, pain points, design decisions, SDK architecture, data pipeline, storage, alerting, root‑cause analysis, and future plans, highlighting the benefits of standardization and zero‑cost service integration.

HuyaObservabilityOpenTelemetry

0 likes · 13 min read

How Huya Unified Its Monitoring Platform with OpenTelemetry for Zero‑Cost Integration

Alibaba Cloud Observability

May 29, 2024 · Cloud Native

Why iLogtail Needed a Complete Architecture Overhaul and How It Was Done

This article explains the evolution of iLogtail from a single‑file collector to a multi‑language, plugin‑based observability pipeline, outlines the motivations for refactoring, describes the new unified data model, plugin abstractions, pipeline design, configuration management, hot‑reload mechanisms, and the separation of enterprise and open‑source code, providing a comprehensive view of the architectural upgrade.

C#GolangObservability

0 likes · 43 min read

Why iLogtail Needed a Complete Architecture Overhaul and How It Was Done

Alibaba Cloud Observability

May 29, 2024 · Operations

How Alibaba Cloud Prometheus Enables Ultra‑Fast Host Monitoring for Supercomputing

This article explains the unique business characteristics of large‑scale supercomputing workloads, outlines the observability challenges they pose, and details how Alibaba Cloud Prometheus host monitoring provides automated service discovery, rapid probe deployment, fine‑grained metrics, and ready‑to‑use Grafana dashboards to achieve second‑level monitoring at massive scale.

Observabilityhost monitoringsupercomputing

0 likes · 15 min read

How Alibaba Cloud Prometheus Enables Ultra‑Fast Host Monitoring for Supercomputing

Alibaba Cloud Observability

May 29, 2024 · Cloud Native

How to Achieve End-to-End Cloud Native Tracing and Solve the 3 Major Challenges

This article explains why distributed tracing is essential for modern cloud‑native systems, outlines the three toughest problems—instrumentation, data collection, and context propagation—and shows how Alibaba Cloud ARMS and OpenTelemetry provide a comprehensive, multi‑language solution for end‑to‑end traceability.

ARMSAlibaba CloudDistributed Tracing

0 likes · 14 min read

How to Achieve End-to-End Cloud Native Tracing and Solve the 3 Major Challenges