ecmwf.common.ai.RagBuilder

public class RagBuilder extends Object

A Lucene-based Retrieval-Augmented Generation (RAG) builder and search engine. Supports BM25 keyword search, vector similarity search, and hybrid ranking using Reciprocal Rank Fusion (RRF). Designed for memory-efficient, operational indexing of large document collections.

Features:

BM25 keyword search with Lucene StandardAnalyzer and QueryParser
Vector similarity search using embeddings and cosine similarity
Hybrid ranking via reciprocalRankFusion(ScoreDoc[], ScoreDoc[], int)
Incremental, memory-efficient indexing using paragraph splitting and batch embedding
Metadata support including filename, source path, paragraph index, and preview (first sentence)
Failure-tolerant indexing: logs individual document/paragraph errors instead of throwing
Configurable RRF parameter to adjust fusion weighting

Thread-safety:

search(String, int) is thread-safe (read-only IndexSearcher)
Indexing is single-threaded and not safe for concurrent writes

Typical usage:

Path indexPath = Path.of("data/lucene-index");
EmbeddingModel embeddingModel = ...; // your embedding model
RagBuilder rag = new RagBuilder(indexPath, embeddingModel, 64, 60);
rag.buildIfNeeded(Path.of("data/docs"), path -> path.toString().endsWith(".md"));
List<TextSegment> results = rag.search("example query", 5);

Designed for moderate to large collections. For very large corpora, consider incremental or distributed indexing/search solutions.

Author:: Laurent Gougeon

Constructor Summary

Constructors

Constructor

Description

RagBuilder(Path indexPath, dev.langchain4j.model.embedding.EmbeddingModel embeddingModel, int embeddingBatch, int rrfK)

Constructs a new RagBuilder.
Method Summary

Modifier and Type

Method

Description

void

buildIfNeeded(Path docsPath, Predicate<Path> filter)

Builds the RAG index if it does not already exist.

List<dev.langchain4j.data.segment.TextSegment>

search(String query, int topK)

Performs hybrid search over BM25 and vector embeddings.

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- RagBuilder
  
  public RagBuilder(Path indexPath, dev.langchain4j.model.embedding.EmbeddingModel embeddingModel, int embeddingBatch, int rrfK)
  
  Constructs a new RagBuilder.
  
  Parameters:
  
  indexPath - path to store the Lucene index
  
  embeddingModel - embedding model for semantic search
  
  embeddingBatch - batch size for embedding segments
  
  rrfK - smoothing parameter for Reciprocal Rank Fusion
Method Details
- buildIfNeeded
  
  public void buildIfNeeded(Path docsPath, Predicate<Path> filter)
  
  Builds the RAG index if it does not already exist.
  Uses paragraph splitting, batch embedding, and incremental indexing. Each Lucene document stores the text, embedding, and metadata including preview.
  
  Parameters:
  
  docsPath - the root directory containing documents
  
  filter - predicate to select files; if null, all files are indexed
- search
  
  public List<dev.langchain4j.data.segment.TextSegment> search(String query, int topK)
  
  Performs hybrid search over BM25 and vector embeddings.
  Dynamically adjusts candidate pool size for longer queries. Returns top-k segments ranked by combined relevance. Each segment includes metadata: filename, sourcePath, paragraph index, and preview.
  
  Parameters:
  
  query - the user query
  
  topK - maximum number of results to return
  
  Returns:
  
  list of top-k TextSegments; empty if search fails or query is blank

Class RagBuilder

Features:

Thread-safety:

Typical usage:

Constructor Summary

Method Summary

Methods inherited from class Object

Constructor Details

RagBuilder

Method Details

buildIfNeeded

search