Databases 11 min read

Understanding Secondary Indexes and Coprocessor Solutions in HBase

This article explains the concept of secondary indexes in HBase, describes how coprocessors (including observers and endpoints) enable server‑side processing, compares coprocessor‑based solutions such as Apache Phoenix with non‑coprocessor approaches using Elasticsearch or Solr, and outlines their advantages and trade‑offs.

Big Data Technology & Architecture

Nov 7, 2021

Understanding Secondary Indexes and Coprocessor Solutions in HBase

What is a Secondary Index

In HBase, the primary index is built on the rowkey; data is sorted by rowkey when written to a region, and the region server creates an LSM‑tree index for efficient rowkey queries. However, HBase only provides rowkey‑based and full‑table scans, making multi‑dimensional queries difficult.

The purpose of a secondary index is to establish a mapping between column values and rowkeys.

Coprocessor

Before building secondary indexes, it is necessary to understand the Coprocessor feature. In older HBase versions (<0.92) counting rows required a MapReduce job. Coprocessors allow business logic to run on the RegionServer, reducing data transfer and enabling functions such as permission checks, secondary indexes, and integrity constraints.

Types of Coprocessors

Observer coprocessor – similar to database triggers; invoked on events like prePut, postPut, etc.

Endpoint coprocessor – similar to stored procedures; client can call server‑side code for operations such as aggregation.

Observer coprocessors have four sub‑types: RegionObserver, RegionServerObserver, WALObserver, and MasterObserver, all extending the Coprocessor interface.

Example flow with RegionObserver: client get request → coprocessorHost intercepts → preGet() → Region processing → postGet() → result returned.

Coprocessor Solutions (e.g., Phoenix)

Since HBase 0.94, the official documentation suggests using a Coprocessor‑based custom logic with a dual‑write strategy to maintain a secondary index table. Open‑source solutions include Huawei’s hindex, Apache Phoenix (SQL on HBase) which supports covered, functional, global, and local indexes.

Advantages: hides index management details from users. Disadvantages: invasive, adds overhead to RegionServer.

Non‑Coprocessor Solutions

Alternatives avoid Coprocessors and build external indexes using Apache Lucene‑based Elasticsearch or Apache Solr. Tools such as Lily HBase Indexer (NGDATA) monitor HBase WAL logs to asynchronously update Solr indexes, while Cloudera’s CDH Search integrates similar functionality.

Typical workflow: index relevant columns and rowkey in Solr, query Solr to obtain matching rowkeys, then fetch rows from HBase.

Other Solutions

Companies with large data teams often customize their own ES/Solr clusters for high‑performance indexing and search.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data HBase Phoenix secondary index Coprocessor

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.