Class RagBuilder
A Lucene-based Retrieval-Augmented Generation (RAG) builder and search engine. Supports BM25 keyword search, vector similarity search, and hybrid ranking using Reciprocal Rank Fusion (RRF). Designed for memory-efficient, operational indexing of large document collections.
Features:
- BM25 keyword search with Lucene
StandardAnalyzerandQueryParser - Vector similarity search using embeddings and cosine similarity
- Hybrid ranking via
reciprocalRankFusion(ScoreDoc[], ScoreDoc[], int) - Incremental, memory-efficient indexing using paragraph splitting and batch embedding
- Metadata support including filename, source path, paragraph index, and preview (first sentence)
- Failure-tolerant indexing: logs individual document/paragraph errors instead of throwing
- Configurable RRF parameter to adjust fusion weighting
Thread-safety:
search(String, int)is thread-safe (read-only IndexSearcher)- Indexing is single-threaded and not safe for concurrent writes
Typical usage:
Path indexPath = Path.of("data/lucene-index");
EmbeddingModel embeddingModel = ...; // your embedding model
RagBuilder rag = new RagBuilder(indexPath, embeddingModel, 64, 60);
rag.buildIfNeeded(Path.of("data/docs"), path -> path.toString().endsWith(".md"));
List<TextSegment> results = rag.search("example query", 5);
Designed for moderate to large collections. For very large corpora, consider incremental or distributed indexing/search solutions.
- Author:
- Laurent Gougeon
-
Constructor Summary
ConstructorsConstructorDescriptionRagBuilder(Path indexPath, dev.langchain4j.model.embedding.EmbeddingModel embeddingModel, int embeddingBatch, int rrfK) Constructs a new RagBuilder. -
Method Summary
-
Constructor Details
-
RagBuilder
public RagBuilder(Path indexPath, dev.langchain4j.model.embedding.EmbeddingModel embeddingModel, int embeddingBatch, int rrfK) Constructs a new RagBuilder.- Parameters:
indexPath- path to store the Lucene indexembeddingModel- embedding model for semantic searchembeddingBatch- batch size for embedding segmentsrrfK- smoothing parameter for Reciprocal Rank Fusion
-
-
Method Details
-
buildIfNeeded
Builds the RAG index if it does not already exist.Uses paragraph splitting, batch embedding, and incremental indexing. Each Lucene document stores the text, embedding, and metadata including preview.
- Parameters:
docsPath- the root directory containing documentsfilter- predicate to select files; if null, all files are indexed
-
search
Performs hybrid search over BM25 and vector embeddings.Dynamically adjusts candidate pool size for longer queries. Returns top-k segments ranked by combined relevance. Each segment includes metadata: filename, sourcePath, paragraph index, and preview.
- Parameters:
query- the user querytopK- maximum number of results to return- Returns:
- list of top-k TextSegments; empty if search fails or query is blank
-