How Exemplar Bridges the Last‑Mile Gap in Observability
Facing the “last mile” challenge of correlating metrics, logs, and traces, the article examines common heterogeneous storage architectures, critiques existing Exemplar implementations, and presents HuoLala’s end‑to‑end solution that treats Exemplar as an independent observable dimension, detailing its data model, SDK integration, collector, and interactive visualization.
Problem Overview
In modern observability stacks, metrics, logs, and distributed traces are stored in heterogeneous systems—time‑series databases (e.g., Prometheus, InfluxDB) for metrics, log stores (e.g., Elasticsearch, Loki) for logs, and tracing back‑ends (e.g., Jaeger, Tempo, SkyWalking) for traces. While this separation optimizes each workload, it creates a hard barrier for cross‑type correlation, forcing engineers to manually stitch context across subsystems when diagnosing incidents.
Existing Correlation Capabilities at HuoLala
HuoLala’s Monitor system currently provides three basic association methods:
Metric‑to‑trace linking via application name + metric type + tags .
Metric‑to‑log linking using ClickHouse’s native metric support for middleware such as Kong.
Business‑order‑to‑trace directed queries based on pre‑defined retention fields.
Evaluation shows that these paths still leave significant gaps, especially for metric‑trace and metric‑log fusion.
Limitations of Mainstream Exemplar Solutions
We analyzed OpenTelemetry, Grafana, and Dynatrace OneAgent implementations and identified three fundamental limitations:
Positioning limitation: Exemplar is treated as an attribute of a metric rather than an independent observable dimension, restricting analysis to metric‑driven queries.
Architecture limitation: Tight coupling with time‑series storage prevents independent scaling and raises query‑experience fragmentation.
Value limitation: Exemplar data cannot be queried, aggregated, or filtered like logs, and its business context (e.g., order ID) remains hidden.
Our Vision: Exemplar as an Independent Dimension
To overcome these constraints we propose the "Exemplar‑as‑a‑Dimension" concept, which gives Exemplar a dual identity:
It remains the bridge between metrics, traces, and logs.
It is also a structured log stream that can be queried, aggregated, and visualized independently.
Key design principles include:
Dual identity : Exemplar links metric → trace and also supports metric → log mapping.
Dimensional reduction : By storing Exemplar in a log‑oriented engine, teams can obtain business context without deploying a full log system.
Scenario expansion : Exemplar can be reverse‑aggregated to generate new metrics (e.g., error‑rate per business ID).
Technical Architecture
We selected VictoriaLogs (VLog) as the dedicated Exemplar store because it offers schemaless ingestion, tight integration with the VictoriaMetrics ecosystem, and low storage cost (≈1/10 of Elasticsearch for comparable volumes).
Schemaless indexing : All fields are auto‑indexed, eliminating the need for predefined schemas.
VictoriaMetrics compatibility : Shared operational tooling and LogQL syntax aligned with PromQL.
Performance : Benchmarks show comparable throughput with a fraction of the resource consumption.
Data is transmitted using a text‑based protocol compatible with OpenMetrics. The core Exemplar line includes a label set (e.g., trace_id, segment_id) and a value with an optional nanosecond timestamp.
# TYPE foo_request_duration_seconds histogram
foo_request_duration_seconds_bucket{app="foo-service",env="prod",method="GET",status="200",le="0.005"} 0
# Exemplar {trace_id="abc123def4567890",segment_id="xyz789",user_id="user-123",endpoint="/api/v1/query",client_ip="192.168.1.100"} 0.008456 1700000000123456789The SDK pushes Exemplar data via a push model (instead of Prometheus pull) to avoid aggregation loss. It uses a sliding‑window cache (default 10 MB or 10 s) and applies back‑pressure protection and gzip compression.
// Java example using Micrometer extensions
Timer.Sample sample = Timer.start();
LalaMetricRegistry.counterBuilder("exemplar_counter_test")
.tag("city_id", "13333")
.tags("business_type", "driver")
.withExemplarTag("userId", "uid123445")
.increment();
sample.stop(LalaMetricRegistry.timerBuilder("exemplar_timer_test")
.tag("city_id", "1333")
.serviceLevelObjectives(Duration.ofMillis(3), Duration.ofMillis(10), Duration.ofMillis(100))
.register());The collector, named Exemplar‑Collector , receives SDK pushes, performs relabeling (dropping job, instance), writes data to VLog via the internal insert API, and records metadata (app‑id ↔ metric) in MySQL for query optimization.
Visualization and Interaction
Integrated into the in‑house Lala‑Monitor platform, Exemplar is visualized through four core capabilities:
Smart association : Automatic linking of chart points to Exemplar logs; manual overrides via app‑id and LogSQL are also supported.
Drill‑down : One‑click transition from metric → Exemplar log → trace, preserving label context and time window.
Log analysis : Full‑text search, field filtering, regex, and Top‑N statistics similar to Elasticsearch, but optimized for the structured Exemplar schema.
Multi‑dimensional statistics : Automatic generation of time‑series charts, leaderboards, pie charts, and bar charts based on high‑cardinality business dimensions (e.g., user_id, driver_id).
Images illustrate the smart association UI, drill‑down flow, and multi‑dimensional dashboards.
Future Outlook
In 2026 we plan to deepen Exemplar’s role in data fusion:
Precise metric‑to‑trace mapping by standardizing TraceID/SpanID reporting across Java frameworks.
Automated fault‑diagnosis pipelines that trigger on metric alerts, retrieve relevant Exemplar samples, batch‑query traces, and converge on root‑cause dimensions (e.g., error_code=DB_TIMEOUT).
Collapsing multi‑metric alerts by intersecting Exemplar trace_id sets to reduce alert storms.
Extending Exemplar support to lower‑level components such as SOA middleware, MySQL/Redis clients, and other infrastructure services.
These efforts aim to turn raw observability data into actionable, end‑to‑end insight, shortening MTTR and improving system reliability.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
