An Overview of Apache Avro: Schema, Serialization Formats, Container Files, and RPC Usage
Apache Avro is a high‑performance binary data serialization system originating from Hadoop that uses JSON‑defined schemas to enable compact storage, efficient network transfer, container file formats for MapReduce, and RPC communication without requiring code generation or explicit field numbers.
Avro, pronounced like "ævrə," is a Hadoop sub‑project led by Doug Cutting that provides a high‑performance binary data transmission middleware used by HBase, Hive and other Hadoop components for data serialization and exchange.
Key features include rich data structure types, fast and compressible binary format that saves storage and bandwidth, file containers for persistent data, built‑in RPC support, and simple dynamic language integration.
Compared with similar systems such as Google Protocol Buffers and Facebook Thrift, Avro differs by using dynamic typing (no code generation needed), embedding minimal type information because the schema is known at read time, and relying on field names rather than numeric field identifiers.
Avro schemas, expressed as JSON objects, define data structures much like Java classes; they are required for both serialization and deserialization, so schema and data are stored together, enabling fast processing especially for scripting languages.
The schema supports eight primitive types and six complex types, with complex types defined by specific attributes that may be required or optional, allowing users to construct sophisticated data models.
Avro offers two encoding methods: a compact binary encoding for efficient serialization and a JSON encoding mainly for debugging or web‑based applications. Serialization follows a depth‑first, left‑to‑right traversal of the schema, with straightforward rules for primitive types and more elaborate rules for complex types.
For MapReduce integration, Avro defines a container file format that contains a single schema, stores objects in compressed blocks, and inserts synchronization markers between blocks to facilitate file splitting and fault tolerance. The file consists of a header (magic number, metadata, 16‑byte sync marker) and a series of data blocks, each preceded by the count of objects and the compressed size.
Metadata includes the schema and a codec indicating compression (null or deflate), while users can add custom metadata entries prefixed with "avro.". This design allows direct block‑level operations without full deserialization.
Avro can also serve as an RPC framework: during the handshake, client and server exchange schemas, ensuring both sides understand field names, missing fields, or extra fields. Messages are packaged into buffers, transmitted over a transport layer (commonly HTTP POST), and each buffer starts with a four‑byte length, contains the payload, and ends with an empty buffer marker.
References:
Avro official documentation: http://avro.apache.org/docs/current/spec.html
Doug Cutting’s article: http://www.cloudera.com/blog/2009/11/avro-a-new-format-for-data-interchange/
Serialization systems performance comparison: http://wiki.github.com/eishay/jvm-serializers/
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.