Operations 26 min read

From Chaos to Clarity: Building Full‑Stack Observability for Poizon’s Algorithm Ecosystem

This article details how Poizon’s algorithm platform evolved from fragmented tracing to a unified, scenario‑driven observability system that standardizes traces, metrics, logs, and events, introduces a knowledge‑graph of algorithm scenes, and applies compression, async reporting, and advanced anomaly detection to improve stability and debugging efficiency.

DeWu Technology

Jan 7, 2026

From Chaos to Clarity: Building Full‑Stack Observability for Poizon’s Algorithm Ecosystem

Background and Motivation

In Poizon’s (得物) business, the algorithm ecosystem spans search, community recommendation, image recognition, and advertising, with Java gateways dispatching requests to high‑performance C++ services such as DSearch, DGraph, and DFeature. Rapid growth created observability gaps, prompting the construction of a business‑scenario‑centric full‑link change‑event center to increase transparency, stability, and fault‑recovery speed.

Four Pillars of Observability and Vision

The team defined a four‑pillar model—Trace, Metric, Log, and Event—linked by the slogan “Trace as the path, Metric as the pulse, Log as the evidence, Event as the source.” The goal is to break data silos and enable intelligent algorithm governance.

Trace Standardization

Existing C++ services lacked a Trace SDK, leaving them isolated from the micro‑service observability mesh. Poizon built a custom C++ Trace2.0 (based on OpenTelemetry) with strict performance constraints:

CPU and memory overhead of Span creation, context propagation, and attribute writing must be minimal.

OpenTelemetry C++ SDK’s generic design introduced unacceptable latency under high QPS.

Compatibility with brpc+bthread scheduling required a lock‑free, thread‑friendly implementation.

Dependency conflicts (e.g., Protobuf ABI) were avoided.

Key components include: APM Cpp SDK for Span creation, collection, and Kafka reporting. brpc‑tracer adapter supporting HTTP and baidu‑std protocols.

Engine integration via the brpc‑tracer.

To reduce bandwidth, the SDK compresses payloads using length filtering, field compression, batch aggregation, static‑info extraction, and Snappy compression (≈30% reduction).

Asynchronous reporting uses an MPSC lock‑free ring queue; spans are enqueued and flushed by a background thread, dropping data when the queue is full to protect business threads.

Log Standardization

Java logs already followed a strict schema, while C++ logs were inconsistent. The team introduced a unified log format: /logs/{app_name}/{app_name}-error.log Each log line follows a pipe‑delimited schema:

timestamp|pid:tid|level|[app,trace_id,span_id,scene,errCode,]|interface|line|[zone,cluster,]|exception|message

Log template clustering uses a regular‑expression mask followed by the Drain algorithm (online hierarchical clustering) to extract semi‑structured templates, enabling downstream analysis such as anomaly detection.

Scenario‑Centric Knowledge Graph (AlgoScene)

Instead of viewing the system purely as a physical call graph, the team models each business scenario as a node (AlgoScene) and connects operators and components via RPC calls. A scene may consist of multiple operators, each built from zero or more components invoked through HTTP/GRPC/Dubbo/Redis/BRPC.

Scene context is propagated using Baggage:

Context ctx = AlgoBaggageOperator.putAlgoSceneToBaggage("trans_product");
try (Scope scope = ctx.activate()) {
    // business logic
}

During data cleaning, the algo_scene field is split into algoScene (full path), rootScene (first scene), and currentScene (last scene) for fine‑grained analysis.

Dynamic Metadata and Streaming Computation

A configuration‑center subscription system publishes dynamic metadata to the tracing pipeline, allowing rapid changes without code redeployment. Graph data is stored in Neo4j, while time‑series metrics reside in VictoriaMetrics.

Streaming queries use EPL‑style SQL to compute multi‑dimensional aggregates, e.g.:

@TimeWindow(10)
@Metric(name = 'algo_redis_client', tags = {'algoScene','rootScene','currentScene','props','env','serviceName','clusterName','redisUrl','statusCode'}, fields = {'timerCount','timerSum','timerMax'}, sampling='sampling')
SELECT algoScene, rootScene, currentScene, get_value(origin.props) AS props, env, serviceName, clusterName, statusCode, redisUrl,
       trunc_sec(startTime,10) AS timestamp,
       max(duration) AS timerMax,
       sum(duration) AS timerSum,
       count(1) AS timerCount,
       sampling(new Object[]{duration,traceId}) AS sampling
FROM algoRedisSpan AS origin
GROUP BY algoScene, rootScene, currentScene, props, env, serviceName, clusterName, redisUrl, statusCode, trunc_sec(startTime,10)

Intelligent Evolution: Anomaly Detection and Periodic Pattern Recognition

Two core algorithms were introduced:

Adaptive Periodicity Recognition : replaces FFT with a self‑adaptive method that evaluates candidate periods via lag‑1 autocorrelation, handling noisy, non‑stationary traffic.

Improved IQR Anomaly Detection : extends the classic IQR rule with zero‑baseline handling, dual thresholds, and tunable parameters (upper/lower quartile multipliers) to reduce false alarms in skewed error‑count distributions.

Results show significant noise reduction for zero‑baseline metrics and the ability to spot localized anomalies in periodic signals.

Event Standardization

Events from >10 sources (config center, release platform, experiment platform, etc.) are normalized into a unified schema containing Source, ChangeObject, ChangeStatus, StartTime, ChangeName, Severity, beforeChangeContent, changeContent, and optional extraInfo (scene, isGlobal, isReboot, …). Events are ingested via OpenAPI into Elasticsearch and linked to traces through Baggage and InnerBaggage, enabling causal correlation between change events and performance anomalies.

Outcome and Future Work

The first phase integrated Trace, Metric, Log, and Event data, providing a vertical view from infrastructure to business logic and enabling rapid fault isolation. The second phase will focus on offline change ingestion, ErrLog/Business‑code standardization, and extending observability to business‑level SLA metrics, completing a closed‑loop “system‑visible → business‑stable” monitoring ecosystem.