Big Data 8 min read

How Spark Enables Real‑Time Microservice Performance Tracing in the Cloud

This article explains how IBM Research leverages Spark to capture and analyze network traffic of microservice‑based applications in an OpenStack cloud, providing real‑time transaction tracing and batch latency statistics to reveal service dependencies and performance bottlenecks.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
How Spark Enables Real‑Time Microservice Performance Tracing in the Cloud

Microservice architectures decompose applications into loosely coupled services that communicate via REST APIs, offering high agility but making end‑to‑end performance monitoring difficult when dozens of services and hundreds of instances are involved.

To address this, IBM Research built a platform‑level performance analysis tool that non‑intrusively captures inter‑service network traffic in cloud environments. The system must handle massive real‑time tenant traffic, reconstruct application topologies, and trace individual requests across services, so Spark is used for both batch and streaming analytics.

Experimental Environment

The experiment runs on an OpenStack cloud with a small Spark cluster. Microservice applications from multiple tenants execute on separate Nova compute hosts. Each host runs a software tap that captures packets on the tenant network; the captured wire data is fed into a Kafka bus. A Spark connector reads from Kafka and performs real‑time analysis.

Analysis Goals

Determine how information flows through services when responding to an end‑user request (transaction tracing).

Identify call and response relationships among microservices within a given time window.

Measure response times of each microservice during that window.

To meet these goals, two Spark applications were developed:

A real‑time transaction‑tracing app built on Spark Streaming.

A batch analysis app that generates service call graphs and latency statistics.

Transaction Tracing Method

Because the application is treated as a black box without a global request identifier, causal relationships are inferred using the nesting algorithm introduced by Aguilera et al. (SOSP 2003). The algorithm constructs a graph where edges represent interactions between services, using timestamps to infer causality. It was adapted to operate on sliding windows of packet streams rather than offline trace sets.

The workflow (see Figure 3) extracts packets in PCAP format, groups them into DStreams, extracts the five‑tuple (src_ip, src_port, dst_ip, dst_port, protocol) for HTTP request/response pairs, and feeds them to the nesting algorithm. Results are stored in a time‑series database (InfluxDB).

Transaction tracing workflow diagram
Transaction tracing workflow diagram

Batch Analysis Application

The batch job reads the stored transaction traces from InfluxDB, converts each trace into <vertex, edge> pairs, and aggregates them into two RDDs (vertices and edges). Vertices are further parsed by name, and a directed graph is constructed to compute call relationships and per‑edge latency statistics. The resulting graphs illustrate the application’s state over a time interval (Figures 6‑7).

Service call graph
Service call graph
Latency statistics
Latency statistics
Batch analysis results
Batch analysis results

Conclusions and Future Work

The Spark‑based platform demonstrates that a unified big‑data system can simultaneously support real‑time streaming, batch processing, and graph analytics for microservice performance monitoring. Ongoing work will evaluate scalability by increasing host count to linearly improve data extraction speed and handle thousands of tenant traces.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DatamicroservicesReal-time analyticsCloudSparkperformance tracing
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.