How Lianjia Built LTrace: A Low‑Overhead, Scalable Distributed Tracing Platform
This article explains how Lianjia designed and implemented LTrace, a zero‑intrusion, high‑performance distributed tracing system that captures full request chains across heterogeneous services, supports multi‑language environments, offers flexible sampling, and enables rapid fault isolation and performance optimization.
Background
Modern internet services run on massive distributed clusters, often composed of heterogeneous teams, languages, and thousands of servers. As Lianjia’s business grew, its distributed system became increasingly complex, making rapid detection and precise pinpointing of request‑level failures critical.
Why LTrace Was Created
To address these challenges, Lianjia built the LTrace platform, a full‑stack tracing solution that collects rich request‑level data, visualizes inter‑service relationships, and enables fast problem localization, bottleneck identification, and code‑level performance analysis.
Key Design Challenges
Stability: The platform must impose negligible overhead and never affect the stability of production services.
High Performance: It must ingest and store billions of trace records in real time.
Transparency & Extensibility: Low coupling, hidden implementation details, and easy extensibility are essential.
Cross‑Language & Multi‑Protocol: Support for Java, PHP, and other heterogeneous systems is required.
Core Features
1. Call Chain Visualization – Reconstructs a complete call‑graph for each request, showing service nodes, IPs, timestamps, durations, network latency, and exception details.
2. Rapid Issue Localization – Allows operators to instantly identify the faulty service node, host IP, and root cause without manually logging into multiple machines.
3. Bottleneck Detection – Highlights nodes that dominate latency, guiding targeted optimizations.
4. Code‑Level Optimization Insights – Detects redundant calls or identical parameters that can be batch‑processed to improve efficiency.
Platform Architecture
The architecture consists of three layers:
Data collection via lightweight middleware that writes trace data to local files and forwards them through rsyslog to a message queue.
Real‑time aggregation and storage in HBase and Elasticsearch.
Front‑end UI for querying, analyzing, and visualizing trace data.
Fundamental Concepts
TraceId : A globally unique identifier that propagates across RPC calls.
SpanId : Hierarchical identifier (e.g., 0, 0.1, 0.1.1) marking each RPC’s position in the trace tree.
Annotation / BinaryAnnotation : Key‑value pairs capturing important moments and business‑specific data (e.g., phone number, user ID).
Trace Tree & Span Nodes
A request’s call chain forms a Trace tree, with each edge represented by a Span node identified by its SpanId.
Instrumentation
LTrace provides near‑zero‑intrusion agents for various languages (Java, PHP, etc.) that automatically generate TraceId, SpanId, and collect metadata such as service name, IP, latency, hierarchy, exceptions, and custom business fields.
Four instrumentation points per distributed request:
Client Send – records outgoing request parameters.
Server Receive – captures inbound request details.
Server Send – logs response data.
Client Receive – records the final response.
Data Storage & Query
Trace data is stored in HBase using TraceId as the row key; SpanId combined with a type flag (c for client, s for server) naturally aggregates the entire call chain, enabling fast queries.
Key Characteristics
Low Overhead & Sampling – Default 1‑in‑100 sampling reduces impact, with customizable policies for low‑traffic services, full sampling for exceptions, and manual overrides for testing.
Log‑TraceId Binding – TraceId can be injected into application logs, allowing seamless correlation across services and easy identification of upstream callers.
Custom TraceId Support – Enables developers and testers to inject their own TraceId for debugging.
Multi‑Threading Compatibility – Provides APIs that propagate TraceId across thread pools.
Simple, Reliable Integration – Minimal configuration steps; platform availability exceeds 99.99% and does not affect business system stability.
Experience & Lessons Learned
LTrace clarifies inter‑system relationships, helping locate latency hotspots and improve overall performance.
It validates that requests reach intended service providers, ensuring correctness.
Integration with existing monitoring systems enables automatic alerting with full trace context when anomalies occur.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
