Big Data 21 min read

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

This article presents ByteDance’s experience building a massive trace‑data analysis platform, covering observability fundamentals, the evolution of its distributed tracing system, various aggregation computation models, technical architecture choices, and concrete use‑cases such as precise topology, traffic estimation, dependency analysis, performance anti‑patterns, bottleneck detection, and error propagation.

ByteDance Terminal Technology
ByteDance Terminal Technology
ByteDance Terminal Technology
Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

1. Overview

With the rapid growth of micro‑service architectures, distributed tracing has become a critical component of observability. After years of development, ByteDance’s tracing system now covers most online services, handling tens of thousands of micro‑services and millions of instances. The next challenge is extracting higher‑level insights from massive trace data to support architecture optimization, service governance, and cost reduction.

2. Observability and Tracing

2.1 Basic Concepts

Observability tools collect data such as traces, logs, metrics, profiling, events, and CMDB metadata, enabling operators to diagnose issues quickly by correlating alerts with trace details.

2.2 ByteDance Tracing System

The system evolved from Trace 1.0 (2019) to a unified observation platform Argos (2020) and now supports over 50 k micro‑services, 3 PB of storage, and a throughput of 20 M spans per second.

3. Trace‑Analysis Technical Practice

3.1 Scenarios

Beyond single‑trace debugging, higher‑level questions include stability (which services can be degraded), capacity planning (which services need scaling), and cost‑performance (identifying inefficiencies). These require automated aggregation of massive trace datasets.

3.2 Core Principle

Trace analysis follows a MapReduce‑style aggregation, optionally combined with subscription rules, to produce results for downstream applications.

3.3 Architecture Options

Three computation modes are evaluated:

Streaming computation – near‑real‑time results, high data completeness, but limited to predefined time windows.

Ad‑hoc (sampling) computation – flexible queries with low extra cost, but reduced completeness.

Offline batch computation – high completeness and low operational cost, but with hour‑ or day‑level latency.

Based on requirements such as real‑time needs, data completeness, and ad‑hoc flexibility, ByteDance adopted an integrated solution that supports all three modes using a unified data model and logical operators.

4. Real‑World Applications

4.1 Precise Topology Calculation

By storing per‑node topology graphs in a graph database, ByteDance can retrieve exact upstream/downstream dependencies for any service, with flexible depth and granularity.

4.2 Full‑Link Traffic Estimation

Using streaming aggregation of trace counts and sampling rates, the system estimates traffic flow and proportion across the entire call graph, supporting capacity planning and cost governance.

4.3 Strong/Weak Dependency Analysis

Streaming computation identifies whether a downstream service is a strong or weak dependency based on error propagation, aiding downgrade plans, timeout configuration, and automated root‑cause analysis.

4.4 Performance Anti‑Pattern Detection

The platform automatically discovers patterns such as call amplification, duplicate calls, read‑write amplification, and serial loops, providing worst‑case samples and traffic context for remediation.

4.5 Full‑Link Performance Bottleneck Analysis

Aggregated trace data reveals systemic latency patterns and worst‑case paths, supporting both ad‑hoc and offline analysis modes.

4.6 Error Propagation Chain Analysis

By aggregating error traces, the system uncovers common error sources, propagation paths, and impact scopes, useful for long‑term stability improvements.

5. Summary and Outlook

The article outlines how ByteDance moved from building basic tracing capabilities to a comprehensive trace‑analysis platform that supports real‑time, ad‑hoc, and offline scenarios, delivering actionable insights for architecture governance, capacity planning, fault isolation, and performance optimization.

Future work includes continuous data‑quality improvement, expanding scenario‑specific APIs, increasing automation through AI‑driven analysis, and deeper integration with cloud‑native observability standards such as OpenTelemetry.

big datastream processingmicroservicesObservabilitygraph databaseDistributed Tracingtrace analysis
ByteDance Terminal Technology
Written by

ByteDance Terminal Technology

Official account of ByteDance Terminal Technology, sharing technical insights and team updates.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.