Understanding Spark and Flink RPC Implementations: A Code Reading Guide
This article explains how to read and compare the RPC implementations of Spark and Flink, covering Actor Model concepts, Akka integration, message handling, threading models, and practical code‑reading techniques while providing detailed code excerpts and architectural analysis.
Introduction to Actor Model and Akka
Reading code starts with understanding the problem it solves, the existing solutions, and their trade‑offs; the code itself is merely the concrete expression of that reasoning.
Distributed systems require RPC mechanisms; this article focuses on Spark and Flink implementations built on the Actor Model.
Actor Model Basics
Actor – the communication entity
Message – the payload
Mailbox – single‑threaded FIFO processor
Actors are referenced via ActorRef within an ActorSystem, which Spark and Flink embed in their RPC layers.
Spark RPC Implementation
Spark replaced Akka (SPARK‑5293) with a Netty‑based RPC that mirrors Actor concepts: RpcEndpoint ↔ Actor RpcEndpointRef ↔ ActorRef RpcEnv ↔ ActorSystem
The RpcEndpoint trait defines lifecycle and message‑handling methods such as receive, receiveAndReply, onStart, and onStop:
private[spark] trait RpcEndpoint {
final def self: RpcEndpointRef = ???
final def stop(): Unit = ???
val rpcEnv: RpcEnv = ???
def receive: PartialFunction[Any, Unit] = ???
def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = ???
def onError(cause: Throwable): Unit = ???
def onConnected(remoteAddress: RpcAddress): Unit = ???
def onDisconnected(remoteAddress: RpcAddress): Unit = ???
def onNetworkError(cause: Throwable, remoteAddress: RpcAddress): Unit = ???
def onStart(): Unit = ???
def onStop(): Unit = ???
}Message dispatch is performed in Inbox.process, routing RpcMessage (request‑reply) and OneWayMessage (fire‑and‑forget) to the appropriate handler.
Flink RPC Implementation
Flink still relies on Akka; its RPC layer is defined in org.apache.flink.runtime.rpc and includes abstractions such as RpcService, RpcServer, RpcEndpoint, and RpcGateway. The connection flow creates a dynamic proxy via Java InvocationHandler:
private<C extends RpcGateway> CompletableFuture<C> connectInternal(String address, Class<C> clazz, Function<ActorRef, InvocationHandler> factory) {
final ActorSelection actorSel = actorSystem.actorSelection(address);
final Future<ActorIdentity> identify = Patterns.ask(actorSel, new Identify(42), timeout);
final CompletableFuture<ActorRef> actorRefFuture = FutureUtils.toJava(identify).thenApply(id -> id.getRef());
final CompletableFuture<HandshakeSuccessMessage> handshakeFuture = actorRefFuture.thenCompose(ref -> FutureUtils.toJava(Patterns.ask(ref, new RemoteHandshakeMessage(clazz, getVersion()), timeout)));
return actorRefFuture.thenCombineAsync(handshakeFuture, (ref, ignored) -> {
InvocationHandler ih = factory.apply(ref);
return (C) Proxy.newProxyInstance(getClass().getClassLoader(), new Class[]{clazz}, ih);
}, actorSystem.dispatcher());
}The InvocationHandler translates method calls into RpcInvocation messages, handling fire‑and‑forget, ask (returning CompletableFuture), and synchronous return values.
Thread Model and MainThreadExecutable
Both Spark and Flink expose a MainThreadExecutable interface (implemented by the RPC handler) that schedules Runnable or Callable tasks on the RPC dispatcher thread, providing runAsync, scheduleRunAsync, and callAsync methods.
public void runAsync(Runnable runnable) { scheduleRunAsync(runnable, 0L); }
public void scheduleRunAsync(Runnable runnable, long delayMillis) {
if (isLocal) {
long atTime = delayMillis == 0 ? 0 : System.nanoTime() + delayMillis * 1_000_000;
tell(new RunAsync(runnable, atTime));
} else { throw new RuntimeException(...); }
}
public <V> CompletableFuture<V> callAsync(Callable<V> callable, Time timeout) {
if (isLocal) {
return (CompletableFuture<V>) ask(new CallAsync(callable), timeout);
} else { throw new RuntimeException(...); }
}These methods are essentially wrappers around Akka’s tell and ask, but they expose a Java‑friendly API for asynchronous execution within the RPC’s main thread.
Code‑Reading Tips
When approaching complex systems, first identify the problem domain, then locate the core abstractions (e.g., RpcEndpoint, RpcServer), skim boilerplate (configuration, error handling), and focus on the divergent logic that differentiates implementations.
Testing, unit tests, and example projects serve as valuable “experiments” for validating understanding of the code paths described.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
