Design and Implementation of Ozone Data Exploration Service (Recon Server)
This article explains the design of a data exploration service for large‑scale distributed storage systems, detailing metadata synchronization, index reconstruction, aggregation tables, node‑level statistics, a user console, and the transition from checkpoint‑based snapshots to delta updates using RocksDB WAL in Hadoop Ozone Recon Server.
In large‑scale distributed storage systems, administrators and users often need specific query capabilities such as identifying the directory with the most data or the node storing the most files, which go beyond traditional file system tools like fsck. This article discusses how to design a data exploration service, using Hadoop Ozone Recon as a reference.
The service focuses on synchronizing core metadata (e.g., HDFS FSImage) from a standby node rather than scanning all physical data. Periodic synchronization of a read‑only standby metadata file, or incremental WAL updates, provides a safe basis for analysis.
To improve query efficiency, the service may rebuild index structures, such as inverted indexes or prefix‑matched keys, allowing faster aggregation and real‑time updates aligned with metadata sync intervals.
Aggregated results are stored in summary tables, for example:
CREATE TABLE `num_blocks_per_container` (</code>
<code>`container_id` BIGINT,</code>
<code>`num_blocks` INT,</code>
<code>PRIMARY KEY (`container_id`)</code>
<code>);These tables are refreshed regularly so users can query near‑real‑time data, such as a Container‑to‑Blocks count.
Additional node‑level statistics (disk usage, total blocks, etc.) are also collected and stored in the same aggregation tables, providing valuable insights for system administrators.
The service is exposed through a user console, with a Console Server acting as a proxy that can enforce authentication and offer browser‑based query interfaces.
Initially, Recon Server used periodic RocksDB checkpoint snapshots from the OzoneManager follower, which caused latency and overhead due to snapshot generation, compression, transfer, and decompression.
To achieve near‑real‑time synchronization, the design shifts to delta updates using RocksDB’s Write‑Ahead Log (WAL). The relevant RocksDB API is:
/**
* Returns an iterator that is positioned at a write‑batch containing
* seq_number. If the sequence number is non existent, it returns an iterator
* at the first available seq_no after the requested seq_no.</p>
* Must set WAL_ttl_seconds or WAL_size_limit_MB to large values to
* use this api, else the WAL files will get cleared aggressively and the
* iterator might become invalid before an update is read.
* @param sequenceNumber sequence number offset
* @return TransactionLogIterator instance.
* @throws RocksDBException if iterator cannot be retrieved from native‑side.
*/
public TransactionLogIterator getUpdatesSince(final long sequenceNumber) throws RocksDBException {
return new TransactionLogIterator(getUpdatesSince(nativeHandle_, sequenceNumber));
}The sequence number corresponds to a transaction ID; each WriteBatch may contain multiple records. By applying each batch’s updates, the Recon Server updates its aggregation tables via a handler (OMDBUpdatesHandler) and asynchronous tasks (ReconDBUpdate), ensuring consistent and timely data.
The article concludes with a diagram of the Recon Server architecture and encourages readers to apply these design principles to their own big‑data systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
