https://github.com/shubham0204/full-text-search
Full Text Search built for PDFs, DOCX using Inverted Index in Java
https://github.com/shubham0204/full-text-search
full-text-search information-retrieval java
Last synced: 7 months ago
JSON representation
Full Text Search built for PDFs, DOCX using Inverted Index in Java
- Host: GitHub
- URL: https://github.com/shubham0204/full-text-search
- Owner: shubham0204
- License: apache-2.0
- Created: 2024-03-13T02:31:17.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-12T01:55:45.000Z (over 1 year ago)
- Last Synced: 2025-01-23T13:43:54.188Z (9 months ago)
- Topics: full-text-search, information-retrieval, java
- Language: Java
- Homepage:
- Size: 124 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Full Text Search On Local Files With Inverted Index

## Demo

## Features
* Text extraction from PDFs, Microsoft Word DOCX and text-based formats
* Disk-persistence of inverted index
* Validation of inverted index
* Command-line utility## Setup
Make sure Java is installed on your system, with `JAVA_HOME` pointing to a JDK installation.
You may clone the project from the GitHub repository, and build it with `gradlew` present in the root of the
repository,```
$> git clone https://github.com/shubham0204/full-text-search
$> cd full-text-search
$> ./gradlew build
```To execute tests,
```
$> ./gradlew test
```To build the fat/uber JAR,
```
$> ./gradlew shadowJar
```## Usage
### Index
```
$> java -jar fulltextsearch.jar index build [dir]
$> java -jar fulltextsearch.jar index info [dir]
$> java -jar fulltextsearch.jar index rm [dir]
```Use `fulltextsearch index --help` for description of each command.
### Query
```
$> fulltextsearch query [dir]
```## Dependencies
* [Apache PdfBox](https://pdfbox.apache.org/)
* [Apache POI](https://poi.apache.org/)
* [picocli](https://picocli.info/)
* [shadow](https://github.com/johnrengelman/shadow)## Useful Resources
* [Wikipedia - Full Text Search](https://en.wikipedia.org/wiki/Full-text_search)
* [Let's build a Full-Text Search engine](https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full-text-search-engine/)
* [Building a full-text search engine in 150 lines of Python code](https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/)
* [Building a Full Text Search Engine](https://blog.quastor.org/p/building-full-text-search-engine)
* [How to Implement Inverted Index Data Structure in Java](https://taruntelang.medium.com/how-to-implement-inverted-index-data-structure-in-java-14067093acd4)
* [Reddit discussion on `full-text-search`](https://www.reddit.com/r/learnjava/comments/1bs8v5w/project_full_text_search_on_local_files_with/)