Understanding Bytecode, Code Generation, Serialization, and Data Processing Techniques in Spark and Flink
This article explains how bytecode and code‑generation improve Spark SQL performance, compares Java I/O and MapReduce InputFormats, reviews serialization choices in Spark and Flink, and describes reflection‑based DataFrame creation, storage‑memory eviction, fail‑fast design, and ConcurrentHashMap usage in big‑data frameworks.
Bytecode is the intermediate representation used by the JVM; Spark SQL leverages dynamic bytecode generation in its Catalyst optimizer to compile expression evaluation into Java code, dramatically reducing virtual‑function calls and improving execution speed. Spark 1.x introduced codegen, Spark 2.0 added whole‑stage code generation, and Janino is used as the lightweight runtime compiler.
MapReduce InputFormats such as TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, and CombineTextInputFormat define how data is read from HDFS. A sample utility method for copying bytes between streams demonstrates typical Java I/O handling:
public static void copyBytes(InputStream in, OutputStream out, int buffSize) throws IOException {
PrintStream ps = out instanceof PrintStream ? (PrintStream)out : null;
byte[] buf = new byte[buffSize];
for (int bytesRead = in.read(buf); bytesRead >= 0; bytesRead = in.read(buf)) {
out.write(buf, 0, bytesRead);
if (ps != null && ps.checkError()) {
throw new IOException("Unable to write to output stream.");
}
}
}Java's built‑in serialization is often too heavyweight for large‑scale data processing. Frameworks such as Flink implement custom, dense serialization (e.g., Kryo, Avro, Protobuf) to reduce object size and improve throughput. Flink's own serializers store data in memory pools and avoid the 16‑byte overhead of standard Java objects.
Spark offers two serialization options: the default Java ObjectOutputStream and the faster, more compact Kryo library, which can be up to ten times quicker when classes are pre‑registered.
Spark SQL can convert an RDD to a DataFrame either by using reflection to infer the schema from case‑class fields or by explicitly providing a StructType schema. The reflection path is concise for simple POJOs, while the explicit path gives full control for complex or custom structures.
For storage memory management, Spark caches blocks in an executor‑wide LinkedHashMap. When the cache is full, the least‑recently‑used (LRU) entries are evicted; if a block also requires disk persistence, it is spilled before removal.
Flink adopts a fail‑fast strategy: upon a task failure, the streaming job is stopped, the latest checkpoint is restored, and operators resume from the saved state, guaranteeing exactly‑once processing.
Both Spark and Flink make extensive use of ConcurrentHashMap for configuration and runtime parameters, illustrating the importance of thread‑safe collections in distributed data processing systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
