Cloud Native 17 min read

How F6 Engineered Cloud‑Native Observability: From ELK to eBPF and OpenTelemetry

This article examines how F6 tackled growing stability demands by evolving from traditional ELK‑based logging to a cloud‑native observability stack that combines Grafana, Prometheus, eBPF, OpenTelemetry, and ARMS, illustrating practical steps, challenges, and lessons learned for modern microservice environments.

Alibaba Cloud Native

Aug 2, 2022

How F6 Engineered Cloud‑Native Observability: From ELK to eBPF and OpenTelemetry

Introduction

R&D engineers often encounter two puzzling situations: "Why does it not run?" and "Why does it run?" Observability promises to provide systematic ways to answer these questions. The concept entered developers' awareness in 2017 when Twitter engineer Cindy published "Monitoring and Observability," distinguishing monitoring (is the service up) from observability (why it is down).

Growth of Observability

Search trends show a rapid rise in observability interest, especially in China, driven by the spread of Site Reliability Engineering (SRE) and related hiring. As business scales, stability challenges increase, making observability a critical system attribute.

Challenges and Organizational Influence

F6 Automotive Technology, a leading automotive‑aftermarket platform, experienced a sudden surge in merchant count and added C‑end services such as VIN decoding. This growth amplified stability requirements. Conway's Law explains how organizational structures tend to be reflected in microservice boundaries, leading to complex call graphs that hinder holistic understanding.

Traditional Monitoring & Log Collection

ELK Stack + ElastAlert – Elasticsearch, Logstash, Kibana were used to collect logs and generate alerts via ElastAlert.

While ELK provided basic log search and alerting, developers still faced difficulty tracing distributed interactions across services.

Architecture Upgrade & Observability Introduction

Grafana + Zorka – Grafana replaced Kibana for dashboards; Zorka collected Java metrics and fed them to Zabbix, which was visualized in Grafana.

These tools improved metric visualization and alerting but still required manual configuration for each service.

Cloud‑Native Transformation

Kubernetes – Adopted for container orchestration, enabling liveness/readiness probes for self‑healing.

Prometheus & ARMS – Prometheus became the primary metrics collector; Alibaba Cloud ARMS provided zero‑code tracing for middleware such as Kafka, MySQL, and Dubbo.

JMX Exporter – Used to expose JVM metrics to Prometheus without relying on ARMS for all Java services.

The monitoring model shifted from a push‑based approach (fixed hosts) to a pull‑based model aligned with Prometheus, accommodating dynamic pod lifecycles.

Alerting Enhancements

By integrating Apollo configuration center with a custom Go service, alerts from Grafana/Prometheus are enriched with owner information (name, phone) and delivered via DingTalk, dramatically increasing alert read‑through rates.

Advanced Observability Practices

Trace ID Propagation – ARMS now injects trace IDs into HTTP headers, enabling log‑trace correlation in Kibana.

Root‑Cause Analysis Service – A lightweight service clusters logs using text‑similarity algorithms to suggest probable root causes.

Chaos Engineering – Leveraged as an observability‑driven tool to proactively expose system weaknesses, following CNCF's special interest group recommendations.

OpenTelemetry Adoption

The team moved toward a unified observability view by adopting OpenTelemetry, aiming to correlate logs, metrics, and traces across services and reduce data silos.

eBPF‑Based One‑Stop Observability

eBPF components were introduced to capture low‑level system and network metrics, addressing latency, traffic, error‑rate, and saturation issues that surface beyond the application layer.

Cost Observability

Using the open‑source kubecost component, the team monitors resource usage and generates cost reports, helping developers balance CPU and memory allocations.

Future Outlook

The roadmap includes deeper integration of eBPF, expanded chaos‑engineering experiments, and broader OpenTelemetry coverage to achieve a fully observable, cost‑effective cloud‑native platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring cloud-native OpenTelemetry eBPF

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.