Industry Insights 12 min read

How Lianjia Built LTrace: A Low‑Overhead, Scalable Distributed Tracing Platform

This article explains how Lianjia designed and implemented LTrace, a zero‑intrusion, high‑performance distributed tracing system that captures full request chains across heterogeneous services, supports multi‑language environments, offers flexible sampling, and enables rapid fault isolation and performance optimization.

Beike Product & Technology

Jul 16, 2017

How Lianjia Built LTrace: A Low‑Overhead, Scalable Distributed Tracing Platform

Background

Modern internet services run on massive distributed clusters, often composed of heterogeneous teams, languages, and thousands of servers. As Lianjia’s business grew, its distributed system became increasingly complex, making rapid detection and precise pinpointing of request‑level failures critical.

Why LTrace Was Created

To address these challenges, Lianjia built the LTrace platform, a full‑stack tracing solution that collects rich request‑level data, visualizes inter‑service relationships, and enables fast problem localization, bottleneck identification, and code‑level performance analysis.

Key Design Challenges

Stability: The platform must impose negligible overhead and never affect the stability of production services.

High Performance: It must ingest and store billions of trace records in real time.

Transparency & Extensibility: Low coupling, hidden implementation details, and easy extensibility are essential.

Cross‑Language & Multi‑Protocol: Support for Java, PHP, and other heterogeneous systems is required.

Core Features

1. Call Chain Visualization – Reconstructs a complete call‑graph for each request, showing service nodes, IPs, timestamps, durations, network latency, and exception details.

2. Rapid Issue Localization – Allows operators to instantly identify the faulty service node, host IP, and root cause without manually logging into multiple machines.

3. Bottleneck Detection – Highlights nodes that dominate latency, guiding targeted optimizations.

4. Code‑Level Optimization Insights – Detects redundant calls or identical parameters that can be batch‑processed to improve efficiency.

Platform Architecture

The architecture consists of three layers:

Data collection via lightweight middleware that writes trace data to local files and forwards them through rsyslog to a message queue.

Real‑time aggregation and storage in HBase and Elasticsearch.

Front‑end UI for querying, analyzing, and visualizing trace data.

Fundamental Concepts

TraceId : A globally unique identifier that propagates across RPC calls.

SpanId : Hierarchical identifier (e.g., 0, 0.1, 0.1.1) marking each RPC’s position in the trace tree.

Annotation / BinaryAnnotation : Key‑value pairs capturing important moments and business‑specific data (e.g., phone number, user ID).

Trace Tree & Span Nodes

A request’s call chain forms a Trace tree, with each edge represented by a Span node identified by its SpanId.

Instrumentation

LTrace provides near‑zero‑intrusion agents for various languages (Java, PHP, etc.) that automatically generate TraceId, SpanId, and collect metadata such as service name, IP, latency, hierarchy, exceptions, and custom business fields.

Four instrumentation points per distributed request:

Client Send – records outgoing request parameters.

Server Receive – captures inbound request details.

Server Send – logs response data.

Client Receive – records the final response.

Data Storage & Query

Trace data is stored in HBase using TraceId as the row key; SpanId combined with a type flag (c for client, s for server) naturally aggregates the entire call chain, enabling fast queries.

Key Characteristics

Low Overhead & Sampling – Default 1‑in‑100 sampling reduces impact, with customizable policies for low‑traffic services, full sampling for exceptions, and manual overrides for testing.

Log‑TraceId Binding – TraceId can be injected into application logs, allowing seamless correlation across services and easy identification of upstream callers.

Custom TraceId Support – Enables developers and testers to inject their own TraceId for debugging.

Multi‑Threading Compatibility – Provides APIs that propagate TraceId across thread pools.

Simple, Reliable Integration – Minimal configuration steps; platform availability exceeds 99.99% and does not affect business system stability.

Experience & Lessons Learned

LTrace clarifies inter‑system relationships, helping locate latency hotspots and improve overall performance.

It validates that requests reach intended service providers, ensuring correctness.

Integration with existing monitoring systems enables automatic alerting with full trace context when anomalies occur.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture observability Distributed Tracing System Monitoring Sampling low overhead ltrace

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.