Databases 20 min read

An Overview of Lucene: Architecture, Indexing Workflow, and Code Implementation

The article introduces Apache Lucene 7.3.1, explains its core architecture and index hierarchy, details the two‑phase indexing and search workflow with code examples for document addition, deletion, merging, and query execution, and highlights its suitability for small‑to‑medium projects versus distributed alternatives.

vivo Internet Technology

Jul 14, 2021

An Overview of Lucene: Architecture, Indexing Workflow, and Code Implementation

This article provides a comprehensive introduction to Apache Lucene (version 7.3.1), an open‑source full‑text search engine toolkit. It explains what Lucene is, its typical usage scenarios, and the key topics covered, including index generation, search processing, and result optimization.

1. Lucene Basics

Lucene is a sub‑project of the Apache Jakarta group and offers a complete query and indexing engine with language‑specific analyzers. It is the core library behind popular search servers such as Elasticsearch and Solr.

Typical use cases involve small‑to‑medium data sets; for very large indexes, distributed solutions like Elasticsearch are recommended.

2. Lucene Working Flow

The indexing process consists of two phases:

Creation phase : Documents are added via IndexWriter.addDocument , producing forward index files; a flush or merge operation then creates the inverted index files.

Search phase : Users submit queries; IndexReader reads the index, IndexSearcher retrieves matching documents, and results are sorted according to the chosen scoring algorithm.

The overall flow is illustrated by a diagram (omitted).

3. Lucene Index Structure

Lucene’s logical hierarchy consists of index → segment → document → field → term . The forward index stores raw document data, while the inverted index enables fast term‑to‑document lookups.

Key components:

Index : The complete set of segments stored on disk or in a directory.

Segment : Independent sub‑indexes; many segments increase I/O overhead, so Lucene merges them periodically.

Document : A collection of fields stored within a segment.

Field : Individual pieces of data (e.g., title, body) within a document.

Term : Tokens produced by an analyzer; the basis for full‑text search.

The inverted index is built using .tip, .tim, and .doc files. Lucene employs a Finite State Transducer (FST) from version 4 onward to reduce memory consumption of term dictionaries.

4. Document CRUD Operations

All index modifications are performed through IndexWriter, which relies on Directory (storage abstraction) and IndexWriterConfig (configuration).

4.1 Adding Documents

When a document is added, Lucene creates a ThreadState that holds a DocumentWriterPerThread. The following snippet shows how a ThreadState is obtained and locked:

ThreadState getAndLock(Thread requestingThread, DocumentsWriter documentsWriter) {
  ThreadState threadState = null;
  synchronized (this) {
    if (freeList.isEmpty()) {
      // No idle ThreadState, create a new one
      return newThreadState();
    } else {
      threadState = freeList.remove(freeList.size() - 1);
      // Prefer a ThreadState that already has an initialized DocumentWriterPerThread
      if (threadState.dwpt == null) {
        for (int i = 0; i < freeList.size(); i++) {
          ThreadState ts = freeList.get(i);
          if (ts.dwpt != null) {
            freeList.set(i, threadState);
            threadState = ts;
            break;
          }
        }
      }
    }
  }
  threadState.lock();
  return threadState;
}

During indexing, each field is processed according to its FieldType. The code below demonstrates how Lucene decides whether to write postings, store the field, or add doc values:

if (fieldType.indexOptions() != IndexOptions.NONE) {
    fp = getOrAddField(fieldName, fieldType, true);
    boolean first = fp.fieldGen != fieldGen;
    fp.invert(field, first);
    if (first) {
      fields[fieldCount++] = fp;
      fp.fieldGen = fieldGen;
    }
} else {
    verifyUnIndexedFieldType(fieldName, fieldType);
}

if (fieldType.stored()) {
    if (fp == null) {
      fp = getOrAddField(fieldName, fieldType, false);
    }
    String value = field.stringValue();
    if (value != null && value.length() > IndexWriter.MAX_STORED_STRING_LENGTH) {
      throw new IllegalArgumentException("stored field \"" + field.name() + "\" is too large (" + value.length() + " characters) to store");
    }
    try {
      storedFieldsConsumer.writeField(fp.fieldInfo, field);
    } catch (Throwable th) {
      throw AbortingException.wrap(th);
    }
}

DocValuesType dvType = fieldType.docValuesType();
if (dvType == null) {
    throw new NullPointerException("docValuesType must not be null (field: \"" + fieldName + "\")");
}
if (dvType != DocValuesType.NONE) {
    if (fp == null) {
      fp = getOrAddField(fieldName, fieldType, false);
    }
    indexDocValue(fp, dvType, field);
}

Tokenization is performed by constructing a TokenStream. The following excerpt shows how a StandardAnalyzer builds its component chain:

protected TokenStreamComponents createComponents(final String fieldName) {
  final StandardTokenizer src = new StandardTokenizer();
  src.setMaxTokenLength(maxTokenLength);
  TokenStream tok = new StandardFilter(src);
  tok = new LowerCaseFilter(tok);
  tok = new StopFilter(tok, stopwords);
  return new TokenStreamComponents(src, tok) {
    @Override
    protected void setReader(final Reader reader) {
      src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
      super.setReader(reader);
    }
  };
}

Deletion and update operations are handled by adding delete terms to a queue and then applying them during a flush. Example deletion code:

public synchronized long deleteTerms(final Term... terms) throws IOException {
  final DocumentsWriterDeleteQueue deleteQueue = this.deleteQueue;
  long seqNo = deleteQueue.addDelete(terms);
  flushControl.doOnDelete();
  lastSeqNo = Math.max(lastSeqNo, seqNo);
  if (applyAllDeletes(deleteQueue)) {
    seqNo = -seqNo;
  }
  return seqNo;
}

Flush is triggered either by document count ( maxBufferedDocs) or RAM usage ( ramBufferSizeMB). The snippet below shows the RAM‑based flush decision:

if (totalRam >= limit) {
  if (infoStream.isEnabled("FP")) {
    infoStream.message("FP", "trigger flush: activeBytes=" + control.activeBytes() + " deleteBytes=" + control.getDeleteBytesUsed() + " vs limit=" + limit);
  }
  markLargestWriterPending(control, state, totalRam);
}

Segment merging is managed by MergeScheduler using policies such as TieredMergePolicy and LogMergePolicy. The following code illustrates how pending merges are discovered:

private synchronized boolean updatePendingMerges(MergePolicy mergePolicy, MergeTrigger trigger, int maxNumSegments) throws IOException {
  final MergePolicy.MergeSpecification spec;
  if (maxNumSegments != UNBOUNDED_MAX_MERGE_SEGMENTS) {
    spec = mergePolicy.findForcedMerges(segmentInfos, maxNumSegments, Collections.unmodifiableMap(segmentsToMerge), this);
    newMergesFound = spec != null;
    if (newMergesFound) {
      for (int i = 0; i < spec.merges.size(); i++) {
        MergePolicy.OneMerge merge = spec.merges.get(i);
        merge.maxNumSegments = maxNumSegments;
      }
    }
  } else {
    spec = mergePolicy.findMerges(trigger, segmentInfos, this);
  }
  newMergesFound = spec != null;
  if (newMergesFound) {
    for (int i = 0; i < spec.merges.size(); i++) {
      registerMerge(spec.merges.get(i));
    }
  }
  return newMergesFound;
}

5. Search Execution

Loading an index involves reading segment metadata (via segments.gen and .si files) and then opening the inverted and stored field files. Once loaded, an IndexReader is wrapped by an IndexSearcher, which executes queries (e.g., BooleanQuery, PhraseQuery, TermQuery, PrefixQuery) against the index using a similarity algorithm such as BM25.

Sorting can be customized with BoostQuery or alternative similarity implementations.

6. Conclusion

Lucene provides powerful full‑text search capabilities for small‑to‑medium projects, but it has limitations: high memory consumption for large indexes, complex configuration for each field, and lack of native clustering. For large‑scale scenarios, distributed search engines like Elasticsearch are recommended.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java indexing search engine lucene Code Example Full‑Text Search

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.