Backend Development 15 min read

Design and Optimization of Trace2.0: A High‑Performance Backend Tracing System

Trace2.0 is an OpenTelemetry‑based application monitoring system that processes petabyte‑scale trace data using multi‑channel client protocols, gRPC, load‑balancing optimizations, ZSTD compression, Kafka pipelines, ClickHouse storage, and a JDK 21 upgrade with virtual threads, achieving significant performance and cost improvements.

Architect
Architect
Architect
Design and Optimization of Trace2.0: A High‑Performance Backend Tracing System

Trace2.0 is a monitoring system introduced by the Poizon team that adopts the OpenTelemetry protocol to collect and analyze massive trace data. Since the end of 2021 it has handled daily data volumes of several petabytes and billions of spans, with peak traffic reaching tens of millions of spans per second.

Client Multi‑Channel Protocol – The system uses both HTTP and gRPC for span transmission, preferring gRPC for its binary format, high performance, and low network overhead. To avoid long‑lived connection issues, the maxConnectionAge parameter is set on the Netty server:

NettyServerBuilder.forPort(8081)
    .maxConnectionAge(grpcConfig.getMaxConnectionAgeInSeconds(), TimeUnit.SECONDS)
    .build();

Initially, a load‑balancer (SLB) was used, but as traffic grew the architecture shifted to direct multi‑channel connections to backend servers, reducing latency and simplifying the topology.

Data Compression – To improve compression ratios, spans are grouped by key within time windows, converted to SpanList , and written incrementally to a ZstdOutputStream . The core compression code is:

private FixedByteArrayOutputStream baos;
private OutputStream out;
public void write(byte[] body) {
    out.write(Bytes.toBytes(body.length));
    out.write(body);
}
public byte[] flush() throws IOException {
    out.close();
    baos.flush();
    byte[] data = baos.toByteArray();
    baos.reset();
    out = new ZstdOutputStream(baos);
    return data;
}
public void initOutputStream() throws IOException {
    this.baos = new FixedByteArrayOutputStream(4096);
    this.out = new ZstdOutputStream(this.baos, 3);
}

Online measurements show a 5× improvement for index data and a 17× improvement for detailed trace data.

Backend Architecture (Pipeline Model) – The processing pipeline follows a Source‑Processor‑Sink pattern using Kafka. A simplified configuration is:

component:
  source:
    kafka:
      - name: "otelTraceKafkaConsumer"
        topics: "otel-span"
        consumerGroup: "otel_storage_trace"
        parallel: 1
        servers: "otel-kafka.com:9092"
        targets: "decodeProcessor"
  processor:
    - name: "decodeProcessor"
      clazz: "org.poizon.apm.component.processor.DecodeProcessor"
      parallel: 4
      targets: "filterProcessor"
    - name: "filterProcessor"
      clazz: "org.poizon.apm.component.processor.FilterProcessor"
      parallel: 2
      targets: "spanMetricExtractor,metadataExtractor,topologyExtractor"
    - name: "spanMetricExtractor"
      clazz: "org.poizon.apm.component.processor.SpanMetricExtractor"
      parallel: 2
      props:
        topic: "otel-spanMetric"
      targets: "otel_kafka"
    - name: "metadataExtractor"
      clazz: "org.poizon.apm.component.processor.MetadataExtractor"
      parallel: 2
      props:
        topic: "otel-metadata"
      targets: "otel_kafka"
    - name: "topologyExtractor"
      clazz: "org.poizon.apm.component.processor.MetadataExtractor"
      parallel: 2
      props:
        topic: "otel-topology"
      targets: "otel_kafka"
  sink:
    kafka:
      - name: "otel_kafka"
        topics: "otel-spanMetric,otel-metadata,otel-topology"
        props:
          bootstrap.servers: otel-kafka.com:9092
          key.serializer: org.apache.kafka.common.serialization.ByteArraySerializer
          value.serializer: org.apache.kafka.common.serialization.ByteArraySerializer
          compression.type: zstd

Trace data is first sent to the OTel server, then routed to different Kafka topics based on the application name, deserialized, cleaned, and transformed. A Disruptor buffer with a multi‑producer‑single‑consumer model reduces thread contention and improves concurrency.

JDK 21 Upgrade – The backend was upgraded from JDK 8/17 to JDK 21, leveraging ZGC and virtual threads. Maven properties were set as follows:

21
21
21

JVM options include:

-Xms22g -Xmx22g
-XX:+UseZGC
-XX:MaxMetaspaceSize=512m
-XX:+UseStringDeduplication
-XX:ZCollectionInterval=120
-XX:ReservedCodeCacheSize=256m
-XX:InitialCodeCacheSize=256m
-XX:ConcGCThreads=2
-XX:ParallelGCThreads=6
-XX:ZAllocationSpikeTolerance=5
-XX:+UnlockDiagnosticVMOptions
-XX:-ZProactive
-Xlog:safepoint,classhisto*=trace,age*,gc*=info:file=/logs/gc-%t.log:time,tid,tags:filecount=5,filesize=50m

Virtual threads are used for task execution:

// Switch to virtual thread executor
ExecutorService executorService = Executors.newVirtualThreadPerTaskExecutor();
List
> completableFutureList = combinerList.stream()
    .map(task -> CompletableFuture.runAsync(() -> {
        // business logic
    }, executorService))
    .toList();
completableFutureList.stream()
    .map(CompletableFuture::join)
    .toList();

After the upgrade, throughput increased dramatically, CPU utilization dropped by over 10 %, and GC pause times became sub‑millisecond, while overall storage costs grew modestly despite a four‑fold data increase.

The article concludes that the combination of protocol optimization, pipeline processing, aggressive compression, and a modern JDK runtime delivers a scalable, cost‑effective tracing solution, while future work will focus on elastic‑resource‑driven designs to further improve horizontal scalability.

backend architectureKafkaOpenTelemetryClickHousetracingJDK21data compression
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.