How Spark Enables Real‑Time Microservice Performance Tracing in the Cloud
This article explains how IBM Research leverages Spark to capture and analyze network traffic of microservice‑based applications in an OpenStack cloud, providing real‑time transaction tracing and batch latency statistics to reveal service dependencies and performance bottlenecks.
Microservice architectures decompose applications into loosely coupled services that communicate via REST APIs, offering high agility but making end‑to‑end performance monitoring difficult when dozens of services and hundreds of instances are involved.
To address this, IBM Research built a platform‑level performance analysis tool that non‑intrusively captures inter‑service network traffic in cloud environments. The system must handle massive real‑time tenant traffic, reconstruct application topologies, and trace individual requests across services, so Spark is used for both batch and streaming analytics.
Experimental Environment
The experiment runs on an OpenStack cloud with a small Spark cluster. Microservice applications from multiple tenants execute on separate Nova compute hosts. Each host runs a software tap that captures packets on the tenant network; the captured wire data is fed into a Kafka bus. A Spark connector reads from Kafka and performs real‑time analysis.
Analysis Goals
Determine how information flows through services when responding to an end‑user request (transaction tracing).
Identify call and response relationships among microservices within a given time window.
Measure response times of each microservice during that window.
To meet these goals, two Spark applications were developed:
A real‑time transaction‑tracing app built on Spark Streaming.
A batch analysis app that generates service call graphs and latency statistics.
Transaction Tracing Method
Because the application is treated as a black box without a global request identifier, causal relationships are inferred using the nesting algorithm introduced by Aguilera et al. (SOSP 2003). The algorithm constructs a graph where edges represent interactions between services, using timestamps to infer causality. It was adapted to operate on sliding windows of packet streams rather than offline trace sets.
The workflow (see Figure 3) extracts packets in PCAP format, groups them into DStreams, extracts the five‑tuple (src_ip, src_port, dst_ip, dst_port, protocol) for HTTP request/response pairs, and feeds them to the nesting algorithm. Results are stored in a time‑series database (InfluxDB).
Batch Analysis Application
The batch job reads the stored transaction traces from InfluxDB, converts each trace into <vertex, edge> pairs, and aggregates them into two RDDs (vertices and edges). Vertices are further parsed by name, and a directed graph is constructed to compute call relationships and per‑edge latency statistics. The resulting graphs illustrate the application’s state over a time interval (Figures 6‑7).
Conclusions and Future Work
The Spark‑based platform demonstrates that a unified big‑data system can simultaneously support real‑time streaming, batch processing, and graph analytics for microservice performance monitoring. Ongoing work will evaluate scalability by increasing host count to linearly improve data extraction speed and handle thousands of tenant traces.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
