https://github.com/gsaini/lucene-by-example
Hands-on learning project for Apache Lucene 10.x on Java 25 — 10 progressive modules covering indexing, analyzers, query types, query parser, highlighting, faceting, sorting/scoring, updates, custom analyzers, and autocomplete.
https://github.com/gsaini/lucene-by-example
analyzer apache-lucene bm25 full-text-search getting-started indexing information-retrieval java java-25 learning lucene maven search search-engine tutorial
Last synced: 3 days ago
JSON representation
Hands-on learning project for Apache Lucene 10.x on Java 25 — 10 progressive modules covering indexing, analyzers, query types, query parser, highlighting, faceting, sorting/scoring, updates, custom analyzers, and autocomplete.
- Host: GitHub
- URL: https://github.com/gsaini/lucene-by-example
- Owner: gsaini
- Created: 2026-05-14T01:10:38.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-15T04:29:36.000Z (about 1 month ago)
- Last Synced: 2026-06-14T23:34:28.985Z (3 days ago)
- Topics: analyzer, apache-lucene, bm25, full-text-search, getting-started, indexing, information-retrieval, java, java-25, learning, lucene, maven, search, search-engine, tutorial
- Language: Java
- Homepage:
- Size: 48.8 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Lucene by Example







[](https://github.com/gsaini/lucene-by-example/commits/main)
[](https://github.com/gsaini/lucene-by-example)
[](https://github.com/gsaini/lucene-by-example/stargazers)
A hands-on learning project that walks through the core features of
[Apache Lucene](https://lucene.apache.org/) one self-contained module at a time.
Everything runs entirely in memory against a tiny built-in book catalogue, so
you can read a module, run it, tweak it, and immediately see how the output
changes — no external services, no setup.
---
## Table of contents
- [Requirements](#requirements)
- [Running](#running)
- [Running the tests](#running-the-tests)
- [What each module covers](#what-each-module-covers)
- [Architecture](#architecture)
- [End-to-end pipeline](#end-to-end-pipeline)
- [Indexing pipeline (write path)](#indexing-pipeline-write-path)
- [Searching pipeline (read path)](#searching-pipeline-read-path)
- [Inverted index — the core data structure](#inverted-index--the-core-data-structure)
- [Field-type decision matrix](#field-type-decision-matrix)
- [Project module map](#project-module-map)
- [Three rules to remember](#three-rules-to-remember)
- [Where to go next](#where-to-go-next)
---
## Requirements
- Java 25+ (current LTS)
- Maven 3.8+
- Lucene 10.4.0 (declared in [pom.xml](pom.xml), pulled by Maven)
## Running
Run every module in order:
```bash
mvn -q compile exec:java
```
Run a single module by number (1–10):
```bash
mvn -q compile exec:java -Dexec.args=3
```
Run a few modules in sequence:
```bash
mvn -q compile exec:java -Dexec.args="1 3 7"
```
## Running the tests
Each module has a matching integration-test class under [src/test/java/com/example/lucene/](src/test/java/com/example/lucene/)
that builds a real in-memory index, runs real queries, and asserts on real results — no mocks.
```bash
# Run every test
mvn -q test
# Run a single test class
mvn -q test -Dtest=Module03_QueryTypesIT
# Run a single method
mvn -q test -Dtest=Module03_QueryTypesIT#fuzzy_query
```
The tests double as executable documentation: each `@DisplayName` describes the Lucene behaviour
the assertion locks in, so reading the test list is another way to learn what each module covers.
## What each module covers
| # | Module | What you'll learn |
| - | - | - |
| 1 | [Module01_HelloLucene.java](src/main/java/com/example/lucene/Module01_HelloLucene.java) | Directory, Analyzer, IndexWriter, IndexSearcher, TermQuery — the minimum viable pipeline. |
| 2 | [Module02_FieldsAndAnalyzers.java](src/main/java/com/example/lucene/Module02_FieldsAndAnalyzers.java) | StringField vs TextField vs StoredField vs Point vs DocValues; how analyzers produce different tokens. |
| 3 | [Module03_QueryTypes.java](src/main/java/com/example/lucene/Module03_QueryTypes.java) | TermQuery, PhraseQuery, BooleanQuery (MUST / SHOULD / MUST_NOT / FILTER), WildcardQuery, PrefixQuery, FuzzyQuery, RegexpQuery, numeric range queries. |
| 4 | [Module04_QueryParser.java](src/main/java/com/example/lucene/Module04_QueryParser.java) | Lucene's classic query-string syntax, including MultiFieldQueryParser with per-field boosts. |
| 5 | [Module05_Highlighting.java](src/main/java/com/example/lucene/Module05_Highlighting.java) | Generating snippet fragments with matched terms wrapped in HTML tags. |
| 6 | [Module06_Faceting.java](src/main/java/com/example/lucene/Module06_Faceting.java) | Sidebar-style facet counts using FacetField + Taxonomy index. |
| 7 | [Module07_SortingAndScoring.java](src/main/java/com/example/lucene/Module07_SortingAndScoring.java) | Sort by doc-values fields; FunctionScoreQuery to blend BM25 with a numeric signal. |
| 8 | [Module08_UpdatesAndDeletes.java](src/main/java/com/example/lucene/Module08_UpdatesAndDeletes.java) | updateDocument by primary key, deleteDocuments by Term and Query, deleteAll. |
| 9 | [Module09_CustomAnalyzer.java](src/main/java/com/example/lucene/Module09_CustomAnalyzer.java) | Building an Analyzer pipeline with stop-words, synonyms, stemming, edge n-grams, ASCII folding. |
| 10 | [Module10_Suggester.java](src/main/java/com/example/lucene/Module10_Suggester.java) | AnalyzingInfixSuggester for fast autocomplete. |
---
## Architecture
### End-to-end pipeline
The big picture: a domain object enters on the left, an index is built in the middle, and
queries flow back through the right.
```text
┌──────────────────────────────────────────────────────────────────────────┐
│ WRITE PATH (indexing) │
└──────────────────────────────────────────────────────────────────────────┘
┌────────────┐ field ┌────────────┐ analyze ┌────────────┐
│ Domain │ ───mapping──▶ │ Document │ ───tokens──▶ │ Analyzer │
│ object │ │ + Fields │ │ chain │
│ (Book, │ │ │ │ │
│ Product…) │ └─────┬──────┘ └─────┬──────┘
└────────────┘ │ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ IndexWriter │◀──────│ Token stream │
│ (transactional, │ │ + attributes │
│ one per index) │ └──────────────────┘
└────────┬─────────┘
│ flush / commit
▼
┌──────────────────┐
│ Directory │ (FSDirectory, MMapDirectory,
│ ┌────────────┐ │ ByteBuffersDirectory, …)
│ │ segment_1 │ │
│ │ segment_2 │ │ ── segments are immutable
│ │ segment_3 │ │ and merged in the background
│ └────────────┘ │
└────────┬─────────┘
│
┌───────────────────────────────────┼──────────────────────────────────────┐
│ ▼ READ PATH (searching) │
└───────────────────────────────────┬──────────────────────────────────────┘
│
┌────────▼─────────┐
│ IndexReader │ point-in-time snapshot
│ (DirectoryReader │ of all segments
│ .open(dir)) │
└────────┬─────────┘
│
┌────────▼─────────┐ ┌────────────────────┐
│ IndexSearcher │◀──────│ Query │
│ (BM25Similarity, │ │ (TermQuery, │
│ collectors, │ │ BooleanQuery, │
│ rewrites) │ │ PhraseQuery, …) │
└────────┬─────────┘ └────────────────────┘
│
▼
┌──────────────────┐
│ TopDocs │ ranked ScoreDoc[]
│ (scores + ids) │ + optional facets,
│ │ highlights, sorts
└──────────────────┘
```
### Indexing pipeline (write path)
Inside `IndexWriter.addDocument(...)`, each `Field` flows through the analyzer chain, and the
resulting tokens are recorded in postings, doc-values, points and stored fields — depending on
which `FieldType` flags were set.
```text
Document
├── StringField "id" ─────▶ exact-term postings (no analysis)
├── TextField "title" ─────▶ Analyzer ─▶ tokens ─▶ postings
├── TextField "description"─────▶ Analyzer ─▶ tokens ─▶ postings
├── IntPoint "year" ─────▶ BKD tree (range queries)
├── DoubleDocValuesField ─────▶ columnar doc-values (sort / facet / function)
├── SortedDocValuesField ─────▶ columnar doc-values (sort / facet)
├── FacetField "Category" ─────▶ Taxonomy index (facet counts)
└── StoredField "raw" ─────▶ stored-fields blob (retrieval only)
Analyzer chain
───────────────
raw text ──▶ Tokenizer ──▶ TokenFilter ──▶ TokenFilter ──▶ … ──▶ indexed tokens
▲ ▲ ▲
│ │ │
│ LowerCaseFilter StopFilter SynonymGraphFilter PorterStemFilter
│
e.g. StandardTokenizer (Unicode word breaks)
```
### Searching pipeline (read path)
```text
user input ──▶ QueryParser ──▶ Query tree ──▶ rewrite ──▶ Weight ──▶ Scorer
│
per-segment iteration ──┘
│
BM25 score + Similarity │
▼
TopDocsCollector
│
▼
┌─────────────────────────────┴───┐
│ TopDocs (ScoreDoc[] + totalHits)│
└─────────────────────────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
▼ ▼ ▼
StoredFields (retrieval) Highlighter (snippets) Facets (counts)
```
### Inverted index — the core data structure
This is the idea every Lucene feature is built on: instead of storing "doc → words", Lucene
flips it to "word → docs". Looking up a term is then an O(1) hash/Trie lookup followed by a
walk over its postings list.
```text
Forward (what we wrote)
───────────────────────
docId=1 "Lucene in Action"
docId=2 "Effective Java"
docId=3 "Java Concurrency in Practice"
Inverted (what Lucene stores)
─────────────────────────────
term postings list (docId → freq, positions, offsets)
───────── ───────────────────────────────────────────────
action ──▶ [ (1, freq=1, pos=[2]) ]
concurrency──▶ [ (3, freq=1, pos=[1]) ]
effective ──▶ [ (2, freq=1, pos=[0]) ]
in ──▶ [ (1, freq=1, pos=[1]), (3, freq=1, pos=[2]) ]
java ──▶ [ (2, freq=1, pos=[1]), (3, freq=1, pos=[0]) ]
lucene ──▶ [ (1, freq=1, pos=[0]) ]
practice ──▶ [ (3, freq=1, pos=[3]) ]
▲ stored in segment files: .tim/.tip (term dictionary), .doc/.pos (postings)
```
A `TermQuery("java")` walks the postings list under `java` → docs `[2, 3]`. A `PhraseQuery`
also walks positions to verify words appear adjacent. BM25 scoring uses the frequency and
length normalisation from this same index.
### Field-type decision matrix
A quick reference for which field type to pick for which purpose:
| Need | Use this field |
| --------------------------- | ----------------------------------------------------------------------- |
| Exact-match on an ID/code | `StringField` |
| Full-text search | `TextField` (with the right Analyzer) |
| Just return it with the hit | `StoredField` |
| Numeric range query | `IntPoint` / `LongPoint` / `DoublePoint` |
| Sort or facet | `SortedDocValuesField`, `NumericDocValuesField`, `DoubleDocValuesField` |
| Facet counts (taxonomy) | `FacetField` (+ `FacetsConfig.build(...)`) |
| Autocomplete | feed source into `AnalyzingInfixSuggester` |
> One logical field often becomes 2–3 Lucene fields. For example, `year` is usually
> `IntPoint` (range query) + `NumericDocValuesField` (sort) + `StoredField` (retrieval).
### Project module map
How the 10 modules fit on the architecture diagram:
```text
┌─────────────────────────────────────────────────────────────┐
│ INDEX BUILDING │
│ │
│ Module 1 Hello Lucene Module 8 Update/Delete │
│ Module 2 Field types Module 9 Custom Analyzer │
│ Module 6 Facet indexing │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ QUERYING │
│ │
│ Module 3 Query types Module 4 QueryParser │
│ Module 7 Sort / function Module 10 Suggester │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ POST-PROCESSING │
│ │
│ Module 5 Highlighting Module 6 Facet counts │
└─────────────────────────────────────────────────────────────┘
```
---
## Three rules to remember
1. **Field type decides what queries are possible.** A field that is not indexed
cannot be searched. A field that is not stored cannot be retrieved.
Sorting and faceting need a doc-values flavour of the field.
2. **Use the same Analyzer for indexing and searching.** Otherwise your query
terms won't match the tokens you wrote to the index. Module 2 makes this
obvious by showing the token output of four analyzers side-by-side.
3. **Documents are immutable.** "Update" means delete + add, keyed off a unique
field. See Module 8.
## Where to go next
- The official [Lucene 10.4.0 demo](https://lucene.apache.org/core/10_4_0/demo/index.html)
shows indexing of real files from disk.
- [Lucene's MIGRATE.md](https://github.com/apache/lucene/blob/main/lucene/MIGRATE.md)
is the best place to see what changes between major versions (e.g. 9.x → 10.x removed the
static `FacetsCollector.search(...)` helper in favour of `FacetsCollectorManager`, used in
Module 6 of this project).
- Real-world systems built on Lucene worth studying: Elasticsearch, OpenSearch,
Solr — they reuse the APIs you've practised in this project.