https://github.com/sovit-123/local_file_search
Local file search using embedding techniques
https://github.com/sovit-123/local_file_search
embedding-models embeddings nlp vector-search
Last synced: 6 months ago
JSON representation
Local file search using embedding techniques
- Host: GitHub
- URL: https://github.com/sovit-123/local_file_search
- Owner: sovit-123
- License: mit
- Created: 2024-07-08T17:33:43.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-02T11:20:42.000Z (7 months ago)
- Last Synced: 2025-03-25T08:38:15.568Z (7 months ago)
- Topics: embedding-models, embeddings, nlp, vector-search
- Language: Python
- Homepage:
- Size: 110 KB
- Stars: 6
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Local File Search using Embeddings
Scripts to replicate simple file search and RAG in a directory with embeddings and Language Models.
From scratch implementation, ***no Vector DBs yet.***
***A simplified use case***: You have thousands of research papers but don't know which are the ones containing content that you want. You do a search according to a rough query and get an adequately good results.
https://github.com/user-attachments/assets/0b1bfc91-868b-4aa7-80ba-9e0a730c4b4b
## Setup
Before moving forward with any of the installation steps, ensure that you have CUDA > 12.1 installed globally on your system. Necessary for building Flash Attention.
### Ubuntu
Run the following in terminal in your preferred virtual/conda environment.
```
sh setup.sh
```It will install the the requirements from the `requirements.txt` file.
### Windows
Install the required version of PyTorch first, preferably the latest stable supported by this repository. This is a necessary step to build Flash Attention correctly on Windows.
```
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
```Next install rest of the requirements.
```
pip install -r requirements.txt
```## Updates
* September 4, 2024: Added image, PDF, and text file chat to `app.py` with multiple Phi model options.
* September 1, 2024: Now you can upload PDFs directly to the Gradio UI (`python app.py`) and start chatting.
## Steps to Chat with Any PDF in Graio UI
***You can run `app.py` and select the any PDF file in the Gradio UI to interactively chat with the document.*** ***(Just execute `python app.py` and start chatting)***
## Steps to Run Through CLI
* (**Optional**) Download the `papers.csv` file from [here](https://www.kaggle.com/datasets/benhamner/nips-papers?select=papers.csv) and keep in the `data` directory. **You can also keep PDF files in the directory and pass the directory path**.
* (**Optional**) *Execute this step only if you download the above CSV file. Not needed if you have your own text files or PDFs in a directory*. Run the `csv_to_text_files.py` script to generate a directory of text files from the CSV file.
* Run `create_embeddings.py` to generate the embeddings that are stored in JSON file in the `data` directory. Check the scripts for the respective file names. ***Check `src/create_embedding.py`*** for relevant command line arguments to be passed.
* Generate example:
```
python create_embeddings.py --index-file-name index_file_to_store_embeddings.json --directory-path path/to/directory/containing/files/to/embed
```* Additional command line arguments:
* `--add-file-content`: To store text chunks in JSON file if planning to do RAG doing file file search.
* `--model`: Any [Sentence Transformer](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) model tag. Default is `all-MiniLM-L6-v2`.
* `--chunk-size` and `--overlap`: Chunk size for creating embeddings and overlap between chunks.
* `--njobs`: Number of parallel processes to use. Useful when creating embeddings for hundreds of files in a directory.* Then run the `search.py` with the path to the respective embedding file to start the search. Type in the search query.
* General example:
```
python search.py --index-file path/to/index.json
```
The above command just throws a list of TopK files that matches the query.* Additional command line arguments:
* `--extract-content`: Whether to print the related content or not. Only works if `--add-file-content` was passed during creation of embeddings.
* `--model`: Sentence Transformer model tag if a model other than `all-MiniLM-L6-v2` was used during the creation of embeddings.
* `--topk`: Top K embeddings to match and output to the user.
* `--llm-call`: Use an LLM to restructure the answer for the question asked. Only works if `--extract-content` is passed as the model will need context. Currently the Phi-3 Mini 4K model is used.
## Datasets
* [NIPS Research Papers](https://www.kaggle.com/datasets/benhamner/nips-papers?select=papers.csv)