Databases 14 min read

How Pharos Accelerates HBase Multi‑Condition Queries with Low‑Latency Indexing

This article examines Pharos, Everbright Bank's home‑grown HBase indexing middleware, detailing why existing secondary‑index solutions fall short, the design goals of low latency, simple architecture and non‑intrusiveness, and the concrete storage, pagination, and transaction‑consistency techniques that enable fast complex queries on massive data.

dbaplus Community
dbaplus Community
dbaplus Community
How Pharos Accelerates HBase Multi‑Condition Queries with Low‑Latency Indexing

Background

With the rapid adoption of NoSQL technologies, many enterprises have built large‑scale HBase clusters to store massive datasets. However, HBase’s native key‑value model excels only at primary‑key lookups; complex multi‑condition queries suffer from poor performance, prompting the need for efficient secondary‑index mechanisms.

Why Existing Index Solutions Fall Short

Three common approaches were evaluated:

ElasticSearch + HBase : Easy to combine mature products but introduces a heavy architecture, higher operational cost, and additional network hops that increase latency.

Phoenix : Provides SQL‑like access and strong community backing, yet its heavyweight implementation lacks true index push‑down and often requires two round‑trips per query, leading to noticeable overhead.

Cloud‑provider Index Services : Offer turnkey solutions with low ops cost, but many financial institutions cannot use public clouds for security reasons, and the underlying implementations still rely on the ElasticSearch‑HBase combo, inheriting the same latency penalties.

Because none of these met Everbright Bank’s requirements for low‑latency, cost‑effective, and non‑intrusive indexing, a custom solution was pursued.

Pharos Overview

Pharos is an internally developed HBase middleware focused on secondary indexing. Its name, derived from the word “pharos” (lighthouse), reflects its purpose: guiding queries to the right data quickly, much like a lighthouse guides ships.

Design Goals

Low latency : Enable real‑time query responses suitable for both analytical and transactional workloads.

Simple architecture : Minimize deployment and operational complexity so developers can adopt it easily.

Non‑intrusiveness : Avoid modifying HBase core code or maintaining separate HBase forks, ensuring compatibility with upstream releases.

Key Design Decisions

Storage Strategy

Pharos adds an independent column family to each HBase table to store index entries. Because column families map to separate HFile files, index data resides in its own files, reducing I/O impact and keeping index and data co‑located on the same region.

This design yields two index models: “partitioned index” (index rows share the same region as data) and “global index” (separate index files). Pharos chooses partitioned indexing to achieve low latency, accepting the trade‑off of not supporting global‑unique indexes.

Storage Model

Each index record’s key starts with the region header, followed by indexed column values and the index name; the data record’s key is appended at the end. The value stores serialized metadata. This layout enables fast index lookups while keeping storage overhead modest.

Pagination Mechanism

Instead of the traditional client‑side tracking of page‑end row keys for each region, Pharos introduces a central client that caches region breakpoints and issues a lightweight Session ID to the application. The application only needs to retain this Session ID to continue paging across regions.

Index‑Data Transaction Consistency

Because HBase lacks cross‑row transactions, Pharos cannot guarantee atomic updates between index and data rows. The solution draws inspiration from Google’s Percolator model: when writing an index, a “uncertain” flag is set; after the data row commit, the flag flips to “committed”. If a rollback occurs, the flag remains uncertain. During reads, Pharos checks the flag; if still uncertain, it re‑validates the data row and updates the flag accordingly.

This approach adds roughly a 15% latency overhead on writes due to the extra flag update, but read‑time impact is negligible because uncertain states are rare.

Future Work

Pharos is currently in internal testing (v0.3 slated for release). Upcoming features focus on further query‑performance improvements, a proprietary data organization format to keep index and data co‑distributed during region splits, and the evolution from a pure secondary‑index component to a full‑featured middleware.

One critical enhancement is solving the region‑split problem: when a region splits, traditional indexes lose their co‑location with data. Pharos v0.22 introduced delayed index loading after split, while v0.3 will embed its own data layout to maintain index‑data alignment automatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distributed databaseHBaseLow latencysecondary indexPharos
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.