https://github.com/dtrckd/docsearch
Full text search in your pdf documents.
https://github.com/dtrckd/docsearch
cli search-engine
Last synced: 12 months ago
JSON representation
Full text search in your pdf documents.
- Host: GitHub
- URL: https://github.com/dtrckd/docsearch
- Owner: dtrckd
- Created: 2019-12-14T12:43:50.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2022-06-02T11:23:28.000Z (about 4 years ago)
- Last Synced: 2025-04-10T14:07:14.885Z (about 1 year ago)
- Topics: cli, search-engine
- Language: Python
- Homepage:
- Size: 20.5 KB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Docsearch
**Docsearch** is a self-hosted search engine for your pdf documents based on [pmk](https://github.com/dtrckd/pymake) and [Whoosh](https://github.com/mchaput/whoosh).
# Install
You may need the following packages: `apt-get install poppler-utils default-jdk`
Clone the repo and enter inside:
make setup
# Overview
This repo provide **Search Engine** experience which follows the pymake design pattern.
The context of this [Pymake](https://github.com/dtrckd/pymake) program is as follows:
* **Data**: documents to search-in are pdf documents (like articles for example), located in your hardrive,
* **Model**: A BM25 model, that assumes a information model of bag of words representation.
* **Script**: There are two scripts:
+ a fit script that builds the index of the *input data*,
+ a search script that returns relevant documents, given a *query*.
* Experiment **Spec** are defined individually for each scripts in the attribute `_default_expe` in the class headers.
# Usage
#### The "fit" script
First, index your pdf documents (take a coffee)
pmk -x fit --path path/to/your/pdfs/
Alternatively, you can turn on features extraction (much slower), using [cermine](cermine.ceon.pl).
# To index trying to extract title, authors and publication date.
pmk -x fit --path path/to/your/pdfs/ --extract-feature
#### The "search" script
Then, search documents based on text content
pmk -x search "your text search request"
Show only the first match:
pmk -x search "your text search request" --limit 1
# Pymake tips
List information about the project :
* What experiments are there: `pmk -l spec`
* What models are there: `pmk -l model`
* What scripts are there: `pmk -l script`
* Show signatures of methods in scripts ('ir' script)\: `pmk -l --script ir`
To add new models, new scripts, or specs, you need to create them in the dedicated folder following the base class implementations.