https://github.com/dtrckd/docsearch

Full text search in your pdf documents.
https://github.com/dtrckd/docsearch

cli search-engine

Last synced: about 1 year ago
JSON representation

Full text search in your pdf documents.

Host: GitHub
URL: https://github.com/dtrckd/docsearch
Owner: dtrckd
Created: 2019-12-14T12:43:50.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-06-02T11:23:28.000Z (about 4 years ago)
Last Synced: 2025-04-10T14:07:14.885Z (over 1 year ago)
Topics: cli, search-engine
Language: Python
Homepage:
Size: 20.5 KB
Stars: 4
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Docsearch

**Docsearch** is a self-hosted search engine for your pdf documents based on [pmk](https://github.com/dtrckd/pymake) and [Whoosh](https://github.com/mchaput/whoosh).

# Install

You may need the following packages: `apt-get install poppler-utils default-jdk`

Clone the repo and enter inside:

    make setup

# Overview

This repo provide **Search Engine** experience which follows the pymake design pattern.

The context of this [Pymake](https://github.com/dtrckd/pymake) program is as follows:

* **Data**: documents to search-in are pdf documents (like articles for example), located in your hardrive,

* **Model**: A BM25 model, that assumes a information model of bag of words representation.

* **Script**: There are two scripts:

    + a fit script that builds the index of the *input data*,

    + a search script that returns relevant documents, given a *query*.

* Experiment **Spec** are defined individually for each scripts in the attribute `_default_expe` in the class headers.

# Usage

#### The "fit" script

First, index your pdf documents (take a coffee)

    pmk -x fit --path path/to/your/pdfs/   

Alternatively, you can turn on features extraction (much slower), using [cermine](cermine.ceon.pl).

    # To index trying to extract title, authors and publication date.

    pmk -x fit --path path/to/your/pdfs/ --extract-feature

#### The "search" script

Then, search documents based on text content

    pmk -x search "your text search request"

Show only the first match:

    pmk -x search "your text search request" --limit 1

# Pymake tips

List information about the project :

* What experiments are there: `pmk -l spec`

* What models are there: `pmk -l model`

* What scripts are there: `pmk -l script`

* Show signatures of methods in scripts ('ir' script)\: `pmk -l --script ir`

To add new models, new scripts, or specs, you need to create them in the dedicated folder following the base class implementations.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dtrckd/docsearch

Awesome Lists containing this project

README