Class RagBuilder

java.lang.Object
ecmwf.common.ai.RagBuilder

public class RagBuilder extends Object

A Lucene-based Retrieval-Augmented Generation (RAG) builder and search engine. Supports BM25 keyword search, vector similarity search, and hybrid ranking using Reciprocal Rank Fusion (RRF). Designed for memory-efficient, operational indexing of large document collections.

Features:

  • BM25 keyword search with Lucene StandardAnalyzer and QueryParser
  • Vector similarity search using embeddings and cosine similarity
  • Hybrid ranking via reciprocalRankFusion(ScoreDoc[], ScoreDoc[], int)
  • Incremental, memory-efficient indexing using paragraph splitting and batch embedding
  • Metadata support including filename, source path, paragraph index, and preview (first sentence)
  • Failure-tolerant indexing: logs individual document/paragraph errors instead of throwing
  • Configurable RRF parameter to adjust fusion weighting

Thread-safety:

  • search(String, int) is thread-safe (read-only IndexSearcher)
  • Indexing is single-threaded and not safe for concurrent writes

Typical usage:

Path indexPath = Path.of("data/lucene-index");
EmbeddingModel embeddingModel = ...; // your embedding model
RagBuilder rag = new RagBuilder(indexPath, embeddingModel, 64, 60);
rag.buildIfNeeded(Path.of("data/docs"), path -> path.toString().endsWith(".md"));
List<TextSegment> results = rag.search("example query", 5);

Designed for moderate to large collections. For very large corpora, consider incremental or distributed indexing/search solutions.

Author:
Laurent Gougeon
  • Constructor Summary

    Constructors
    Constructor
    Description
    RagBuilder(Path indexPath, dev.langchain4j.model.embedding.EmbeddingModel embeddingModel, int embeddingBatch, int rrfK)
    Constructs a new RagBuilder.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    buildIfNeeded(Path docsPath, Predicate<Path> filter)
    Builds the RAG index if it does not already exist.
    List<dev.langchain4j.data.segment.TextSegment>
    search(String query, int topK)
    Performs hybrid search over BM25 and vector embeddings.

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • RagBuilder

      public RagBuilder(Path indexPath, dev.langchain4j.model.embedding.EmbeddingModel embeddingModel, int embeddingBatch, int rrfK)
      Constructs a new RagBuilder.
      Parameters:
      indexPath - path to store the Lucene index
      embeddingModel - embedding model for semantic search
      embeddingBatch - batch size for embedding segments
      rrfK - smoothing parameter for Reciprocal Rank Fusion
  • Method Details

    • buildIfNeeded

      public void buildIfNeeded(Path docsPath, Predicate<Path> filter)
      Builds the RAG index if it does not already exist.

      Uses paragraph splitting, batch embedding, and incremental indexing. Each Lucene document stores the text, embedding, and metadata including preview.

      Parameters:
      docsPath - the root directory containing documents
      filter - predicate to select files; if null, all files are indexed
    • search

      public List<dev.langchain4j.data.segment.TextSegment> search(String query, int topK)
      Performs hybrid search over BM25 and vector embeddings.

      Dynamically adjusts candidate pool size for longer queries. Returns top-k segments ranked by combined relevance. Each segment includes metadata: filename, sourcePath, paragraph index, and preview.

      Parameters:
      query - the user query
      topK - maximum number of results to return
      Returns:
      list of top-k TextSegments; empty if search fails or query is blank