How ByteDance Built an EB‑Scale Log Service: Design & Optimization
This article details the evolution of ByteDance's TLS (Tinder Log Service) from a Loki‑based prototype to an EB‑scale, cloud‑native log system, covering its core properties, data organization, architecture, caching, hybrid storage, private codec, ecosystem compatibility, intelligent features, and real‑world case studies.
Introduction
In the early days of observability, logs were used mainly for post‑mortem fault review. Modern log systems now aim to solve metrics, tracing, and debugging in a single pipeline, with trace treated as a special log format and metrics generated from logs in real time.
Advantages of Logs
Generation and collection are easy; most programming languages provide logging frameworks.
Collection is side‑channel and requires no changes to user applications.
Logs retain rich detail for deep analysis.
Challenges
Massive volume, bursty traffic, and unstructured data make log utilization difficult.
ByteDance TLS Overview
TLS (Tinder Log Service) is a unified log system used across ByteDance’s internal products and Volcano Engine. It handles EB‑scale log volume and provides ingestion, storage, processing, query, alerting, consumption, and delivery.
Evolution
1.0: Built on a Loki‑like design with HDFS storage; lacked indexing, resulting in slow queries.
2.0: Added a custom inverted index and replaced HDFS with ByteDance’s pooled storage (bytestore), dramatically improving query performance and adding trace data.
3.0 (TLS): Introduced columnar storage, an OLAP engine, an AI engine, and high‑performance hybrid storage, reaching EB‑scale while enhancing ecosystem compatibility.
Core Properties of Modern Log Systems
High performance : Real‑time queries must return billions of rows within seconds.
Elasticity & high availability : Handle unpredictable bursts and ensure stable service.
Efficiency : Reduce cost while delivering high throughput.
Ecosystem compatibility : Support diverse user workloads and integrations.
Rich functionality : Processing, dashboards, alerts, and multi‑scenario support.
Intelligence : AIOps clustering, text analysis, machine‑learning operators, and natural‑language assistants.
Data Organization for Performance
TLS exposes four data planes—raw log, index, columnar, and storage—each with its own performance characteristics. Logs are written sequentially (append‑only) and read via a shard concept similar to Kafka partitions. An inverted index maps keywords to line numbers, and hourly grouping reduces the amount of data scanned per query.
System Architecture
The system is split into storage, compute, and control clusters. Compute components include:
API Gateway : Unified entry point for users.
ShardServer : Handles raw log write, consumption, and index lookup.
Query : OLAP engine for columnar SQL analysis.
IndexServer : Builds inverted index and columnar files.
SearchServer : Executes index queries.
Multi‑Level Caching & Hotspot Elimination
Caches exist at ShardServer, IndexServer, Query, SearchServer, and Bytestore. A queue‑based back‑pressure mechanism detects busy nodes and redirects traffic to less loaded nodes, preventing hotspot bottlenecks.
Real‑Time Index Visibility
Memory‑visible indexing achieves sub‑3‑second visibility even on HDD. A commit‑point mechanism prevents phantom reads after node failures.
Log Collector Client
LogCollector uses a multi‑pipeline design with adaptive back‑pressure, batch processing, parallel execution, zero‑copy I/O, memory pooling, and JSON‑level optimizations to maximize ingestion performance.
Handling Rapid Business Growth
Strategies include multi‑AZ, multi‑cluster deployment, storage‑compute separation, serverless K8s‑native services, and automated load balancing for seamless scaling.
Tenant Isolation
Flow control is applied per shard, tenant, topic, and resource type (CPU, memory, I/O) for write, read, query, and storage access, ensuring fair resource usage and preventing overload.
Efficient Hybrid Storage
Writes are aggregated into a WAL, encoded with erasure coding (EC), and written to SSD in full‑stripe batches; large sequential writes bypass SSD and go directly to HDD. This reduces SSD wear, network traffic, and improves latency.
Private Codec Protocol
Key compression maps frequently repeated log keys to numeric identifiers, saving ~30% of storage and network bandwidth. Separate header and data encoding allows fast header‑only reads without full deserialization.
Ecosystem Compatibility
TLS supports standard protocols (Kafka, OpenTelemetry, S3, Grafana, SQL92) and also offers high‑performance private interfaces for ingestion, consumption, and analysis.
Data Processing (ETL)
TLS provides a Python‑like scripting language for log enrichment, filtering, masking, and routing, enabling users to build custom pipelines with minimal effort.
Intelligent Operations & AI Assistant
Machine‑learning clustering, text analysis, and template matching power smart AIOps, while a natural‑language assistant can generate SQL, regex, and charts from user prompts.
Case Studies
Internal: Object storage migrated from a custom three‑system solution to TLS, reducing cost and operational complexity.
External: An international travel agency uses TLS as an online OLAP database for real‑time ticket pricing, alerts, and analytics, achieving high availability over two years.
Future Outlook
Continued focus on scalability, AI‑driven analytics, and broader ecosystem integration.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
