Big Data 10 min read

How Spark Enables Real‑Time Microservice Performance Profiling

This article explains how IBM Research and Cloudinsight use Apache Spark to capture, analyze, and visualize microservice communication in real time, addressing challenges of observability, bottleneck detection, and latency attribution in large‑scale cloud environments.

Efficient Ops
Efficient Ops
Efficient Ops
How Spark Enables Real‑Time Microservice Performance Profiling

Editor's Note

When developers gain agility from microservice architectures, observing the whole system becomes a major pain point. This article, compiled by Cloudinsight engineers, shows how IBM Research uses Spark to analyze and profile microservice performance.

Introduction

Microservices are increasingly popular due to their flexibility; applications are decomposed into loosely coupled services that communicate via REST APIs. This design enables rapid, independent iteration of services and can dramatically improve deployment capabilities, but it also makes system‑wide observability difficult.

Content Overview

In production, end‑to‑end visibility is essential for quickly diagnosing performance degradation, yet dozens of microservices (each with hundreds of instances) make this challenging.

How does information flow through services? Where are the bottlenecks? Is user‑experience latency caused by the network or by services in the call chain?

To meet growing demand for performance analysis tools in cloud environments, a platform‑based real‑time profiling solution—similar to auto‑scaling and load‑balancing services—is being built.

By capturing and analyzing network communication among microservices in a non‑intrusive way, the system can discover application topology, trace individual requests, and handle massive real‑time tenant traffic. Spark is chosen because it supports both batch and streaming analytics.

Spark Operational Analysis

A simple experiment demonstrates how Spark can be used for operational analysis. The environment consists of an OpenStack cloud running a set of microservice‑based applications across different tenant networks, plus a small Spark cluster.

Network taps on each Nova compute host capture packets, which are sent to a Kafka bus. Spark connectors consume the Kafka stream for real‑time processing.

Two Spark applications are developed to answer key questions:

How does information flow through services when responding to end‑user requests (transaction tracing)? What are the call relationships among microservices within a given time window? What are the response times of each microservice in that window?

The applications are:

Real‑time transaction tracing

Batch analysis to generate communication graphs and latency statistics

Real‑time Transaction Tracing Application

This app builds causal relationships between request‑response pairs across services without requiring globally unique identifiers, treating the application as a black box.

It adapts the nesting algorithm from Aguilera et al. (2003 SOSP) to infer causality by examining timestamps of inter‑service calls.

The algorithm checks timestamps, constructs a directed graph, and filters low‑confidence edges, operating on sliding windows of packet streams.

Workflow:

Packets arrive as PCAP files.

Individual streams are extracted and grouped into DStreams.

Within each time window, HTTP request/response pairs are matched using five tuple fields (srcip, srcport, destip, destport, protocol).

The resulting trace data is stored in a time‑series database (InfluxDB).

Standard Batch Analysis Application

The second Spark job runs as a batch process, producing service call graphs and latency statistics for a given time window.

It extracts independent transaction traces from InfluxDB, converts each trace into a list of (source, destination) pairs, and aggregates them into two RDDs: one for vertices and one for edges.

Vertex names are further resolved, and the call graph is computed as a directed graph with per‑edge latency statistics.

Figures illustrate the evolving call graph and latency distribution for a tenant application.

Conclusion

Using the Spark platform, diverse analysis workloads—batch, streaming, and graph processing—can operate simultaneously on a unified big‑data platform.

Future work will explore scalability, such as linearly increasing data extraction speed by adding hosts while handling thousands of tenant traces, with ongoing updates to follow.

big datamicroservicesreal-time analyticsperformance profilingSparkOperational Monitoring
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.