An open API service indexing awesome lists of open source software.

https://github.com/sovit-123/local_file_search

Local file search using embedding techniques
https://github.com/sovit-123/local_file_search

embedding-models embeddings nlp vector-search

Last synced: 6 months ago
JSON representation

Local file search using embedding techniques

Awesome Lists containing this project

README

          

# Local File Search using Embeddings

Scripts to replicate simple file search and RAG in a directory with embeddings and Language Models.

From scratch implementation, ***no Vector DBs yet.***

***A simplified use case***: You have thousands of research papers but don't know which are the ones containing content that you want. You do a search according to a rough query and get an adequately good results.

https://github.com/user-attachments/assets/0b1bfc91-868b-4aa7-80ba-9e0a730c4b4b

## Setup

Before moving forward with any of the installation steps, ensure that you have CUDA > 12.1 installed globally on your system. Necessary for building Flash Attention.

### Ubuntu

Run the following in terminal in your preferred virtual/conda environment.

```
sh setup.sh
```

It will install the the requirements from the `requirements.txt` file.

### Windows

Install the required version of PyTorch first, preferably the latest stable supported by this repository. This is a necessary step to build Flash Attention correctly on Windows.

```
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
```

Next install rest of the requirements.

```
pip install -r requirements.txt
```

## Updates

* September 4, 2024: Added image, PDF, and text file chat to `app.py` with multiple Phi model options.

* September 1, 2024: Now you can upload PDFs directly to the Gradio UI (`python app.py`) and start chatting.

## Steps to Chat with Any PDF in Graio UI

***You can run `app.py` and select the any PDF file in the Gradio UI to interactively chat with the document.*** ***(Just execute `python app.py` and start chatting)***

## Steps to Run Through CLI

* (**Optional**) Download the `papers.csv` file from [here](https://www.kaggle.com/datasets/benhamner/nips-papers?select=papers.csv) and keep in the `data` directory. **You can also keep PDF files in the directory and pass the directory path**.

* (**Optional**) *Execute this step only if you download the above CSV file. Not needed if you have your own text files or PDFs in a directory*. Run the `csv_to_text_files.py` script to generate a directory of text files from the CSV file.

* Run `create_embeddings.py` to generate the embeddings that are stored in JSON file in the `data` directory. Check the scripts for the respective file names. ***Check `src/create_embedding.py`*** for relevant command line arguments to be passed.

* Generate example:

```
python create_embeddings.py --index-file-name index_file_to_store_embeddings.json --directory-path path/to/directory/containing/files/to/embed
```

* Additional command line arguments:

* `--add-file-content`: To store text chunks in JSON file if planning to do RAG doing file file search.
* `--model`: Any [Sentence Transformer](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) model tag. Default is `all-MiniLM-L6-v2`.
* `--chunk-size` and `--overlap`: Chunk size for creating embeddings and overlap between chunks.
* `--njobs`: Number of parallel processes to use. Useful when creating embeddings for hundreds of files in a directory.

* Then run the `search.py` with the path to the respective embedding file to start the search. Type in the search query.

* General example:

```
python search.py --index-file path/to/index.json
```

The above command just throws a list of TopK files that matches the query.

* Additional command line arguments:

* `--extract-content`: Whether to print the related content or not. Only works if `--add-file-content` was passed during creation of embeddings.
* `--model`: Sentence Transformer model tag if a model other than `all-MiniLM-L6-v2` was used during the creation of embeddings.
* `--topk`: Top K embeddings to match and output to the user.
* `--llm-call`: Use an LLM to restructure the answer for the question asked. Only works if `--extract-content` is passed as the model will need context. Currently the Phi-3 Mini 4K model is used.

## Datasets

* [NIPS Research Papers](https://www.kaggle.com/datasets/benhamner/nips-papers?select=papers.csv)