Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/abdullahwaqar/docsearx

A simple search engine that ranks pdfs based on search keyword & TF-IDF weights and cosine similarity.
https://github.com/abdullahwaqar/docsearx

inmemory nltk search-engine

Last synced: about 12 hours ago
JSON representation

A simple search engine that ranks pdfs based on search keyword & TF-IDF weights and cosine similarity.

Host: GitHub
URL: https://github.com/abdullahwaqar/docsearx
Owner: abdullahwaqar
License: apache-2.0
Created: 2019-12-31T10:25:22.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2023-03-02T10:36:25.000Z (over 1 year ago)
Last Synced: 2023-03-05T15:52:15.400Z (over 1 year ago)
Topics: inmemory, nltk, search-engine
Language: Python
Homepage:
Size: 1.95 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 15
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# docsearcx | A minimal document search engine.
![alt text](https://github.com/abdullahwaqar/docsearx/blob/master/docs/Screenshot.png "Web App Screenshot")

docsearcx is a simple search engine that retrieves information from ***pdfs*** based on term frequency-inverse Document frequency and cosine similarity to retrieve relevant documents.

## Limitation
For the sake of POC this application relies on in memory storage.

---

## Setup
### Installing Pipenv
If pipenv is already installed skip this step.

```pip install pipenv```

### Installing Dependencies

```pipenv install```

& Activate the virtual environment shell by

```pipenv shell```

### Running the Flask app

```python app.py```

### Running Client

```
cd client/

npm install

npm run serve
```

---

### Term Frequency-inverse Document Frequency
TF-IDF is a numerical statistics which reflects how important a word is to a document. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.