Big Data 32 min read

Understanding Spark and Flink RPC Implementations: A Code Reading Guide

This article explains how to read and compare the RPC implementations of Spark and Flink, covering Actor Model concepts, Akka integration, message handling, threading models, and practical code‑reading techniques while providing detailed code excerpts and architectural analysis.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Spark and Flink RPC Implementations: A Code Reading Guide

Introduction to Actor Model and Akka

Reading code starts with understanding the problem it solves, the existing solutions, and their trade‑offs; the code itself is merely the concrete expression of that reasoning.

Distributed systems require RPC mechanisms; this article focuses on Spark and Flink implementations built on the Actor Model.

Actor Model Basics

Actor – the communication entity

Message – the payload

Mailbox – single‑threaded FIFO processor

Actors are referenced via ActorRef within an ActorSystem, which Spark and Flink embed in their RPC layers.

Spark RPC Implementation

Spark replaced Akka (SPARK‑5293) with a Netty‑based RPC that mirrors Actor concepts: RpcEndpoint ↔ Actor RpcEndpointRef ↔ ActorRef RpcEnv ↔ ActorSystem

The RpcEndpoint trait defines lifecycle and message‑handling methods such as receive, receiveAndReply, onStart, and onStop:

private[spark] trait RpcEndpoint {
  final def self: RpcEndpointRef = ???
  final def stop(): Unit = ???
  val rpcEnv: RpcEnv = ???
  def receive: PartialFunction[Any, Unit] = ???
  def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = ???
  def onError(cause: Throwable): Unit = ???
  def onConnected(remoteAddress: RpcAddress): Unit = ???
  def onDisconnected(remoteAddress: RpcAddress): Unit = ???
  def onNetworkError(cause: Throwable, remoteAddress: RpcAddress): Unit = ???
  def onStart(): Unit = ???
  def onStop(): Unit = ???
}

Message dispatch is performed in Inbox.process, routing RpcMessage (request‑reply) and OneWayMessage (fire‑and‑forget) to the appropriate handler.

Flink RPC Implementation

Flink still relies on Akka; its RPC layer is defined in org.apache.flink.runtime.rpc and includes abstractions such as RpcService, RpcServer, RpcEndpoint, and RpcGateway. The connection flow creates a dynamic proxy via Java InvocationHandler:

private<C extends RpcGateway> CompletableFuture<C> connectInternal(String address, Class<C> clazz, Function<ActorRef, InvocationHandler> factory) {
  final ActorSelection actorSel = actorSystem.actorSelection(address);
  final Future<ActorIdentity> identify = Patterns.ask(actorSel, new Identify(42), timeout);
  final CompletableFuture<ActorRef> actorRefFuture = FutureUtils.toJava(identify).thenApply(id -> id.getRef());
  final CompletableFuture<HandshakeSuccessMessage> handshakeFuture = actorRefFuture.thenCompose(ref -> FutureUtils.toJava(Patterns.ask(ref, new RemoteHandshakeMessage(clazz, getVersion()), timeout)));
  return actorRefFuture.thenCombineAsync(handshakeFuture, (ref, ignored) -> {
    InvocationHandler ih = factory.apply(ref);
    return (C) Proxy.newProxyInstance(getClass().getClassLoader(), new Class[]{clazz}, ih);
  }, actorSystem.dispatcher());
}

The InvocationHandler translates method calls into RpcInvocation messages, handling fire‑and‑forget, ask (returning CompletableFuture), and synchronous return values.

Thread Model and MainThreadExecutable

Both Spark and Flink expose a MainThreadExecutable interface (implemented by the RPC handler) that schedules Runnable or Callable tasks on the RPC dispatcher thread, providing runAsync, scheduleRunAsync, and callAsync methods.

public void runAsync(Runnable runnable) { scheduleRunAsync(runnable, 0L); }
public void scheduleRunAsync(Runnable runnable, long delayMillis) {
  if (isLocal) {
    long atTime = delayMillis == 0 ? 0 : System.nanoTime() + delayMillis * 1_000_000;
    tell(new RunAsync(runnable, atTime));
  } else { throw new RuntimeException(...); }
}
public <V> CompletableFuture<V> callAsync(Callable<V> callable, Time timeout) {
  if (isLocal) {
    return (CompletableFuture<V>) ask(new CallAsync(callable), timeout);
  } else { throw new RuntimeException(...); }
}

These methods are essentially wrappers around Akka’s tell and ask, but they expose a Java‑friendly API for asynchronous execution within the RPC’s main thread.

Code‑Reading Tips

When approaching complex systems, first identify the problem domain, then locate the core abstractions (e.g., RpcEndpoint, RpcServer), skim boilerplate (configuration, error handling), and focus on the divergent logic that differentiates implementations.

Testing, unit tests, and example projects serve as valuable “experiments” for validating understanding of the code paths described.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsFlinkRPCcode readingSparkactor-model
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.