Scaling Microservice Tracing with Zipkin and StarRocks: A Practical Guide

This article explains how Sohu Smart Media built a high‑performance tracing system for microservices by integrating Zipkin for data collection with StarRocks for storage and analytics, covering architecture, data models, SQL queries, Flink processing, and real‑world results that boost observability and engineering efficiency.

StarRocks
StarRocks
StarRocks
Scaling Microservice Tracing with Zipkin and StarRocks: A Practical Guide

In modern microservice architectures, traditional monitoring struggles to locate faults because requests span many services, languages, and machines. The three core data types—Logging, Metrics, and Tracing—provide different visibility levels, with Tracing offering the most detailed view of request lifecycles.

Tracing Fundamentals

Tracing records a Trace (the full request path) composed of multiple Span records. A Span captures the service name, timestamps, duration, and optional tags. Two Span kinds exist:

RPC Span : generated by client‑server RPC calls; includes client and server entries that share a Trace ID.

Messaging Span : generated by internal asynchronous messaging and does not share IDs with RPC spans.

These concepts enable reconstruction of call graphs, latency breakdowns, and dependency analysis.

System Model and Example Trace

A request from a user to Service A creates a global Trace ID (X). Service A may call Services B and C in parallel, each spawning their own Spans. The article includes a diagram (Fig 3) illustrating this flow and notes that only parallel calls are considered for later analysis.

Data Collection with Zipkin

Zipkin provides automatic instrumentation for many languages (e.g., Spring Cloud Sleuth for Java). Applications embed the Zipkin client library, configure a sampling rate (commonly 1 % or 0.1 %), and send JSON‑encoded Span batches to Kafka or HTTP endpoints. A sample JSON payload is shown in the source.

Choosing a Storage Backend

Zipkin supports MySQL, Cassandra, and ElasticSearch, but each has limitations for large‑scale observability:

MySQL cannot handle billions of rows per day.

Cassandra lacks efficient aggregation.

ElasticSearch supports simple aggregations but struggles with joins, window functions, and real‑time analytics.

To overcome these issues, Sohu migrated Trace data to StarRocks , a high‑performance analytical database. The migration required only two steps: CREATE TABLE definitions and a ROUTINE LOAD job to ingest JSON from Kafka.

StarRocks Table Schema

CREATE TABLE `zipkin` ( 
  `traceId` varchar(24) NULL, 
  `id` varchar(24) NULL COMMENT 'Span ID', 
  `localEndpoint_serviceName` varchar(512) NULL, 
  `dt` int(11) NULL, 
  `parentId` varchar(24) NULL, 
  `timestamp` bigint(20) NULL, 
  `kind` varchar(16) NULL, 
  `duration` int(11) NULL, 
  `name` varchar(300) NULL, 
  `tag_error` int(11) DEFAULT '0', 
  INDEX service_name_idx (`localEndpoint_serviceName`) USING BITMAP 
) ENGINE=OLAP DUPLICATE KEY(`traceId`,`parentId`,`id`,`timestamp`,`localEndpoint_serviceName`,`dt`) 
PARTITION BY RANGE(`dt`) (PARTITION p20220104 VALUES [('20220104'),('20220105')], PARTITION p20220105 VALUES [('20220105'),('20220106')]) 
DISTRIBUTED BY HASH(`id`) BUCKETS 100 
PROPERTIES ("replication_num" = "3", "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.start" = "-30", "dynamic_partition.end" = "2");

A second table zipkin_trace_perf stores performance‑focused fields for bottleneck analysis.

Routine Load Configuration

CREATE ROUTINE LOAD zipkin_routine_load ON zipkin COLUMNS( 
  id, kind, localEndpoint_serviceName, traceId, `name`, `timestamp`, `duration`, 
  `localEndpoint_ipv4`, `remoteEndpoint_ipv4`, `remoteEndpoint_port`, `shared`, 
  `parentId`, `tags_http_path`, `tags_http_method`, `tags_controller_class`, 
  `tags_controller_method`, tmp_tag_error, 
  tag_error = if(`tmp_tag_error` IS NULL, 0, 1), 
  error_msg = tmp_tag_error, 
  dt = from_unixtime(`timestamp`/1000000, '%Y%m%d'), 
  hr = from_unixtime(`timestamp`/1000000, '%H'), 
  `min` = from_unixtime(`timestamp`/1000000, '%i') 
) PROPERTIES ( 
  "desired_concurrent_number" = "3", 
  "max_batch_interval" = "50", 
  "max_batch_rows" = "300000", 
  "max_batch_size" = "209715200", 
  "max_error_number" = "1000000", 
  "strict_mode" = "false", 
  "format" = "json", 
  "strip_outer_array" = "true", 
  "jsonpaths" = "[\"$.id\",\"$.kind\",\"$.localEndpoint.serviceName\",\"$.traceId\",\"$.name\",\"$.timestamp\",\"$.duration\",\"$.localEndpoint.ipv4\",\"$.remoteEndpoint.ipv4\",\"$.remoteEndpoint.port\",\"$.shared\",\"$.parentId\",\"$.tags.http.path\",\"$.tags.http.method\",\"$.tags.mvc.controller.class\",\"$.tags.mvc.controller.method\",\"$.tags.error\"]" 
) FROM KAFKA ("kafka_broker_list" = "IP1:PORT1,IP2:PORT2,IP3:PORT3", "kafka_topic" = "XXXXXXXXX");

Flink Parent‑ID Resolution

To convert Messaging Spans into a pure RPC view, Flink processes the raw JSON stream, groups by traceId+serviceName, and rewrites the parent ID of each RPC client Span to point to the corresponding RPC server Span. The Flink job then writes the enriched data back to StarRocks:

env.addSource(getKafkaSource())
   .map(JSON.parseArray(_))
   .flatMap(_.asScala.map(_.asInstanceOf[JSONObject]))
   .map(jsonToBean(_))
   .keyBy(span => keyOfTrace(span))
   .window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)))
   .aggregate(new TraceAggregateFunction)
   .flatMap(spans => spans)
   .addSink(StarRocksSink.sink(StarRocksSinkOptions.builder().withProperty("XXX", "XXX").build()));

Analytical Queries

Using the Zipkin table, the article provides SQL for several common observability scenarios:

Upstream request statistics (hourly QPS, latency percentiles, error rates) for a given service.

Downstream response statistics with similar metrics.

Service‑internal processing by grouping on Span name to isolate endpoint latency.

Service topology by joining client and server spans to build a call graph with average durations.

Performance bottleneck analysis that ranks the longest‑lasting spans per trace and aggregates their occurrence percentages.

Each query demonstrates how StarRocks’ analytical functions (e.g., percentile_approx, window functions, bitmap indexes) enable fast, multi‑dimensional exploration of tracing data.

Practical Impact

After deployment, over 30 services (hundreds of instances) send Trace data at a 1 % sampling rate, generating more than 1 billion rows daily. The integrated Zipkin + StarRocks solution provides:

Real‑time alerting on latency percentiles and error rates.

Fine‑grained metric aggregation (day, hour, minute) for capacity planning.

Exploratory fault analysis across services, endpoints, and time windows.

Reduced operational overhead: developers only need to add the Zipkin SDK and configure Kafka, eliminating manual agent installation for logging or metrics.

Future Optimizations

Leverage StarRocks UDAFs and window functions to perform Parent‑ID resolution directly in the database, removing the Flink dependency.

Fully ingest the tags field by using StarRocks’ upcoming JSON data type.

Improve the Zipkin UI to surface more StarRocks‑driven dashboards and queries.

Conclusion

Integrating Zipkin with StarRocks transforms microservice monitoring from basic Monitoring (knowing whether a system works) to full Observability (understanding why it fails), delivering both analytical depth and engineering efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkMicroservicesSQLObservabilityStarRockstracingzipkin
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.