{"id":50864271,"url":"https://github.com/gsaini/lucene-by-example","last_synced_at":"2026-06-14T23:34:32.508Z","repository":{"id":357982620,"uuid":"1238267214","full_name":"gsaini/lucene-by-example","owner":"gsaini","description":"Hands-on learning project for Apache Lucene 10.x on Java 25 — 10 progressive modules covering indexing, analyzers, query types, query parser, highlighting, faceting, sorting/scoring, updates, custom analyzers, and autocomplete.","archived":false,"fork":false,"pushed_at":"2026-05-15T04:29:36.000Z","size":50,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-14T23:34:28.985Z","etag":null,"topics":["analyzer","apache-lucene","bm25","full-text-search","getting-started","indexing","information-retrieval","java","java-25","learning","lucene","maven","search","search-engine","tutorial"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gsaini.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-14T01:10:38.000Z","updated_at":"2026-05-15T04:29:40.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gsaini/lucene-by-example","commit_stats":null,"previous_names":["gsaini/lucene-by-example"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gsaini/lucene-by-example","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsaini%2Flucene-by-example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsaini%2Flucene-by-example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsaini%2Flucene-by-example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsaini%2Flucene-by-example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gsaini","download_url":"https://codeload.github.com/gsaini/lucene-by-example/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsaini%2Flucene-by-example/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34342089,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-14T02:00:07.365Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analyzer","apache-lucene","bm25","full-text-search","getting-started","indexing","information-retrieval","java","java-25","learning","lucene","maven","search","search-engine","tutorial"],"created_at":"2026-06-14T23:34:31.870Z","updated_at":"2026-06-14T23:34:32.485Z","avatar_url":"https://github.com/gsaini.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lucene by Example\n\n![Java](https://img.shields.io/badge/Java-25-007396?style=for-the-badge\u0026logo=openjdk\u0026logoColor=white)\n![Apache Lucene](https://img.shields.io/badge/Apache%20Lucene-10.4.0-D22128?style=for-the-badge\u0026logo=apache\u0026logoColor=white)\n![Maven](https://img.shields.io/badge/Maven-3.8+-C71A36?style=for-the-badge\u0026logo=apachemaven\u0026logoColor=white)\n![JUnit 5](https://img.shields.io/badge/JUnit-5.12-25A162?style=for-the-badge\u0026logo=junit5\u0026logoColor=white)\n![License](https://img.shields.io/badge/License-Apache%202.0-0F80C1?style=for-the-badge\u0026logo=apache\u0026logoColor=white)\n![Platform](https://img.shields.io/badge/Platform-Cross--Platform-4EAA25?style=for-the-badge\u0026logo=linux\u0026logoColor=white)\n![Status](https://img.shields.io/badge/Status-Learning%20Project-0A66C2?style=for-the-badge\u0026logo=readthedocs\u0026logoColor=white)\n\n[![GitHub last commit](https://img.shields.io/github/last-commit/gsaini/lucene-by-example?style=flat-square)](https://github.com/gsaini/lucene-by-example/commits/main)\n[![GitHub repo size](https://img.shields.io/github/repo-size/gsaini/lucene-by-example?style=flat-square)](https://github.com/gsaini/lucene-by-example)\n[![GitHub stars](https://img.shields.io/github/stars/gsaini/lucene-by-example?style=social)](https://github.com/gsaini/lucene-by-example/stargazers)\n\nA hands-on learning project that walks through the core features of\n[Apache Lucene](https://lucene.apache.org/) one self-contained module at a time.\n\nEverything runs entirely in memory against a tiny built-in book catalogue, so\nyou can read a module, run it, tweak it, and immediately see how the output\nchanges — no external services, no setup.\n\n---\n\n## Table of contents\n\n- [Requirements](#requirements)\n- [Running](#running)\n- [Running the tests](#running-the-tests)\n- [What each module covers](#what-each-module-covers)\n- [Architecture](#architecture)\n  - [End-to-end pipeline](#end-to-end-pipeline)\n  - [Indexing pipeline (write path)](#indexing-pipeline-write-path)\n  - [Searching pipeline (read path)](#searching-pipeline-read-path)\n  - [Inverted index — the core data structure](#inverted-index--the-core-data-structure)\n  - [Field-type decision matrix](#field-type-decision-matrix)\n  - [Project module map](#project-module-map)\n- [Three rules to remember](#three-rules-to-remember)\n- [Where to go next](#where-to-go-next)\n\n---\n\n## Requirements\n\n- Java 25+ (current LTS)\n- Maven 3.8+\n- Lucene 10.4.0 (declared in [pom.xml](pom.xml), pulled by Maven)\n\n## Running\n\nRun every module in order:\n\n```bash\nmvn -q compile exec:java\n```\n\nRun a single module by number (1–10):\n\n```bash\nmvn -q compile exec:java -Dexec.args=3\n```\n\nRun a few modules in sequence:\n\n```bash\nmvn -q compile exec:java -Dexec.args=\"1 3 7\"\n```\n\n## Running the tests\n\nEach module has a matching integration-test class under [src/test/java/com/example/lucene/](src/test/java/com/example/lucene/)\nthat builds a real in-memory index, runs real queries, and asserts on real results — no mocks.\n\n```bash\n# Run every test\nmvn -q test\n\n# Run a single test class\nmvn -q test -Dtest=Module03_QueryTypesIT\n\n# Run a single method\nmvn -q test -Dtest=Module03_QueryTypesIT#fuzzy_query\n```\n\nThe tests double as executable documentation: each `@DisplayName` describes the Lucene behaviour\nthe assertion locks in, so reading the test list is another way to learn what each module covers.\n\n## What each module covers\n\n| # | Module | What you'll learn |\n| - | - | - |\n| 1 | [Module01_HelloLucene.java](src/main/java/com/example/lucene/Module01_HelloLucene.java) | Directory, Analyzer, IndexWriter, IndexSearcher, TermQuery — the minimum viable pipeline. |\n| 2 | [Module02_FieldsAndAnalyzers.java](src/main/java/com/example/lucene/Module02_FieldsAndAnalyzers.java) | StringField vs TextField vs StoredField vs Point vs DocValues; how analyzers produce different tokens. |\n| 3 | [Module03_QueryTypes.java](src/main/java/com/example/lucene/Module03_QueryTypes.java) | TermQuery, PhraseQuery, BooleanQuery (MUST / SHOULD / MUST_NOT / FILTER), WildcardQuery, PrefixQuery, FuzzyQuery, RegexpQuery, numeric range queries. |\n| 4 | [Module04_QueryParser.java](src/main/java/com/example/lucene/Module04_QueryParser.java) | Lucene's classic query-string syntax, including MultiFieldQueryParser with per-field boosts. |\n| 5 | [Module05_Highlighting.java](src/main/java/com/example/lucene/Module05_Highlighting.java) | Generating snippet fragments with matched terms wrapped in HTML tags. |\n| 6 | [Module06_Faceting.java](src/main/java/com/example/lucene/Module06_Faceting.java) | Sidebar-style facet counts using FacetField + Taxonomy index. |\n| 7 | [Module07_SortingAndScoring.java](src/main/java/com/example/lucene/Module07_SortingAndScoring.java) | Sort by doc-values fields; FunctionScoreQuery to blend BM25 with a numeric signal. |\n| 8 | [Module08_UpdatesAndDeletes.java](src/main/java/com/example/lucene/Module08_UpdatesAndDeletes.java) | updateDocument by primary key, deleteDocuments by Term and Query, deleteAll. |\n| 9 | [Module09_CustomAnalyzer.java](src/main/java/com/example/lucene/Module09_CustomAnalyzer.java) | Building an Analyzer pipeline with stop-words, synonyms, stemming, edge n-grams, ASCII folding. |\n| 10 | [Module10_Suggester.java](src/main/java/com/example/lucene/Module10_Suggester.java) | AnalyzingInfixSuggester for fast autocomplete. |\n\n---\n\n## Architecture\n\n### End-to-end pipeline\n\nThe big picture: a domain object enters on the left, an index is built in the middle, and\nqueries flow back through the right.\n\n```text\n   ┌──────────────────────────────────────────────────────────────────────────┐\n   │                          WRITE PATH (indexing)                           │\n   └──────────────────────────────────────────────────────────────────────────┘\n\n   ┌────────────┐    field      ┌────────────┐   analyze    ┌────────────┐\n   │  Domain    │ ───mapping──▶ │  Document  │ ───tokens──▶ │  Analyzer  │\n   │  object    │               │  + Fields  │              │   chain    │\n   │ (Book,     │               │            │              │            │\n   │  Product…) │               └─────┬──────┘              └─────┬──────┘\n   └────────────┘                     │                           │\n                                      ▼                           ▼\n                              ┌──────────────────┐       ┌──────────────────┐\n                              │   IndexWriter    │◀──────│   Token stream   │\n                              │ (transactional,  │       │  + attributes    │\n                              │  one per index)  │       └──────────────────┘\n                              └────────┬─────────┘\n                                       │ flush / commit\n                                       ▼\n                              ┌──────────────────┐\n                              │     Directory    │   (FSDirectory, MMapDirectory,\n                              │  ┌────────────┐  │    ByteBuffersDirectory, …)\n                              │  │ segment_1  │  │\n                              │  │ segment_2  │  │   ── segments are immutable\n                              │  │ segment_3  │  │      and merged in the background\n                              │  └────────────┘  │\n                              └────────┬─────────┘\n                                       │\n   ┌───────────────────────────────────┼──────────────────────────────────────┐\n   │                                   ▼                READ PATH (searching) │\n   └───────────────────────────────────┬──────────────────────────────────────┘\n                                       │\n                              ┌────────▼─────────┐\n                              │  IndexReader     │   point-in-time snapshot\n                              │ (DirectoryReader │   of all segments\n                              │  .open(dir))     │\n                              └────────┬─────────┘\n                                       │\n                              ┌────────▼─────────┐       ┌────────────────────┐\n                              │  IndexSearcher   │◀──────│      Query         │\n                              │ (BM25Similarity, │       │ (TermQuery,        │\n                              │  collectors,     │       │  BooleanQuery,     │\n                              │  rewrites)       │       │  PhraseQuery, …)   │\n                              └────────┬─────────┘       └────────────────────┘\n                                       │\n                                       ▼\n                              ┌──────────────────┐\n                              │     TopDocs      │   ranked ScoreDoc[]\n                              │  (scores + ids)  │   + optional facets,\n                              │                  │     highlights, sorts\n                              └──────────────────┘\n```\n\n### Indexing pipeline (write path)\n\nInside `IndexWriter.addDocument(...)`, each `Field` flows through the analyzer chain, and the\nresulting tokens are recorded in postings, doc-values, points and stored fields — depending on\nwhich `FieldType` flags were set.\n\n```text\n   Document\n   ├── StringField \"id\"       ─────▶ exact-term postings        (no analysis)\n   ├── TextField \"title\"      ─────▶ Analyzer ─▶ tokens ─▶ postings\n   ├── TextField \"description\"─────▶ Analyzer ─▶ tokens ─▶ postings\n   ├── IntPoint  \"year\"       ─────▶ BKD tree                    (range queries)\n   ├── DoubleDocValuesField   ─────▶ columnar doc-values         (sort / facet / function)\n   ├── SortedDocValuesField   ─────▶ columnar doc-values         (sort / facet)\n   ├── FacetField \"Category\"  ─────▶ Taxonomy index              (facet counts)\n   └── StoredField \"raw\"      ─────▶ stored-fields blob          (retrieval only)\n\n                                  Analyzer chain\n                                  ───────────────\n   raw text ──▶ Tokenizer ──▶ TokenFilter ──▶ TokenFilter ──▶ … ──▶ indexed tokens\n                  ▲              ▲              ▲\n                  │              │              │\n                  │       LowerCaseFilter  StopFilter   SynonymGraphFilter   PorterStemFilter\n                  │\n              e.g. StandardTokenizer (Unicode word breaks)\n```\n\n### Searching pipeline (read path)\n\n```text\n   user input ──▶ QueryParser ──▶ Query tree ──▶ rewrite ──▶ Weight ──▶ Scorer\n                                                                          │\n                                                  per-segment iteration ──┘\n                                                                          │\n                                          BM25 score + Similarity         │\n                                                                          ▼\n                                                                  TopDocsCollector\n                                                                          │\n                                                                          ▼\n                                            ┌─────────────────────────────┴───┐\n                                            │ TopDocs (ScoreDoc[] + totalHits)│\n                                            └─────────────────────────────────┘\n                                                          │\n                            ┌─────────────────────────────┼─────────────────────────────┐\n                            ▼                             ▼                             ▼\n                  StoredFields (retrieval)        Highlighter (snippets)        Facets (counts)\n```\n\n### Inverted index — the core data structure\n\nThis is the idea every Lucene feature is built on: instead of storing \"doc → words\", Lucene\nflips it to \"word → docs\". Looking up a term is then an O(1) hash/Trie lookup followed by a\nwalk over its postings list.\n\n```text\n                           Forward (what we wrote)\n                           ───────────────────────\n   docId=1   \"Lucene in Action\"\n   docId=2   \"Effective Java\"\n   docId=3   \"Java Concurrency in Practice\"\n\n                           Inverted (what Lucene stores)\n                           ─────────────────────────────\n   term            postings list (docId → freq, positions, offsets)\n   ─────────       ───────────────────────────────────────────────\n   action     ──▶  [ (1, freq=1, pos=[2]) ]\n   concurrency──▶  [ (3, freq=1, pos=[1]) ]\n   effective  ──▶  [ (2, freq=1, pos=[0]) ]\n   in         ──▶  [ (1, freq=1, pos=[1]), (3, freq=1, pos=[2]) ]\n   java       ──▶  [ (2, freq=1, pos=[1]), (3, freq=1, pos=[0]) ]\n   lucene     ──▶  [ (1, freq=1, pos=[0]) ]\n   practice   ──▶  [ (3, freq=1, pos=[3]) ]\n\n   ▲ stored in segment files: .tim/.tip (term dictionary), .doc/.pos (postings)\n```\n\nA `TermQuery(\"java\")` walks the postings list under `java` → docs `[2, 3]`. A `PhraseQuery`\nalso walks positions to verify words appear adjacent. BM25 scoring uses the frequency and\nlength normalisation from this same index.\n\n### Field-type decision matrix\n\nA quick reference for which field type to pick for which purpose:\n\n| Need                        | Use this field                                                          |\n| --------------------------- | ----------------------------------------------------------------------- |\n| Exact-match on an ID/code   | `StringField`                                                           |\n| Full-text search            | `TextField` (with the right Analyzer)                                   |\n| Just return it with the hit | `StoredField`                                                           |\n| Numeric range query         | `IntPoint` / `LongPoint` / `DoublePoint`                                |\n| Sort or facet               | `SortedDocValuesField`, `NumericDocValuesField`, `DoubleDocValuesField` |\n| Facet counts (taxonomy)     | `FacetField` (+ `FacetsConfig.build(...)`)                              |\n| Autocomplete                | feed source into `AnalyzingInfixSuggester`                              |\n\n\u003e One logical field often becomes 2–3 Lucene fields. For example, `year` is usually\n\u003e `IntPoint` (range query) + `NumericDocValuesField` (sort) + `StoredField` (retrieval).\n\n### Project module map\n\nHow the 10 modules fit on the architecture diagram:\n\n```text\n                ┌─────────────────────────────────────────────────────────────┐\n                │                       INDEX BUILDING                        │\n                │                                                             │\n                │   Module 1  Hello Lucene        Module 8  Update/Delete    │\n                │   Module 2  Field types         Module 9  Custom Analyzer  │\n                │   Module 6  Facet indexing                                  │\n                └────────────────────────┬────────────────────────────────────┘\n                                         │\n                                         ▼\n                ┌─────────────────────────────────────────────────────────────┐\n                │                          QUERYING                           │\n                │                                                             │\n                │   Module 3  Query types         Module 4  QueryParser      │\n                │   Module 7  Sort / function     Module 10 Suggester        │\n                └────────────────────────┬────────────────────────────────────┘\n                                         │\n                                         ▼\n                ┌─────────────────────────────────────────────────────────────┐\n                │                    POST-PROCESSING                          │\n                │                                                             │\n                │   Module 5  Highlighting        Module 6  Facet counts     │\n                └─────────────────────────────────────────────────────────────┘\n```\n\n---\n\n## Three rules to remember\n\n1. **Field type decides what queries are possible.** A field that is not indexed\n   cannot be searched. A field that is not stored cannot be retrieved.\n   Sorting and faceting need a doc-values flavour of the field.\n2. **Use the same Analyzer for indexing and searching.** Otherwise your query\n   terms won't match the tokens you wrote to the index. Module 2 makes this\n   obvious by showing the token output of four analyzers side-by-side.\n3. **Documents are immutable.** \"Update\" means delete + add, keyed off a unique\n   field. See Module 8.\n\n## Where to go next\n\n- The official [Lucene 10.4.0 demo](https://lucene.apache.org/core/10_4_0/demo/index.html)\n  shows indexing of real files from disk.\n- [Lucene's MIGRATE.md](https://github.com/apache/lucene/blob/main/lucene/MIGRATE.md)\n  is the best place to see what changes between major versions (e.g. 9.x → 10.x removed the\n  static `FacetsCollector.search(...)` helper in favour of `FacetsCollectorManager`, used in\n  Module 6 of this project).\n- Real-world systems built on Lucene worth studying: Elasticsearch, OpenSearch,\n  Solr — they reuse the APIs you've practised in this project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgsaini%2Flucene-by-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgsaini%2Flucene-by-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgsaini%2Flucene-by-example/lists"}