https://github.com/j-sephb-lt-n/semantic-search-engine
(abandoned) Tool for searching for passages within a document
https://github.com/j-sephb-lt-n/semantic-search-engine
embeddings lancedb rag search semantic-search
Last synced: 8 months ago
JSON representation
(abandoned) Tool for searching for passages within a document
- Host: GitHub
- URL: https://github.com/j-sephb-lt-n/semantic-search-engine
- Owner: J-sephB-lt-n
- License: gpl-3.0
- Created: 2024-06-27T12:36:55.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-14T21:01:09.000Z (about 1 year ago)
- Last Synced: 2024-12-28T00:21:07.891Z (10 months ago)
- Topics: embeddings, lancedb, rag, search, semantic-search
- Language: Python
- Homepage:
- Size: 185 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# semantic-search-engine
I'm abandoning this repo in order to pursue this one instead:
Tool for searching for passages within a document
```bash
pdftotext TheEffectiveExecutive.pdf input_docs/TheEffectiveExecutive.txt
python -m steps.chunk_input # input written to /chunked_input/
python -m observe.chunk_stats.py
python -m observe.view_random_chunks 0
python -m steps.create_lance_db```
Note about cached huggingface models: the following opens up a UI for deleting models no longer needed:
```bash
pip install huggingface_hub[cli]
huggingface-cli delete-cache
```# TODO
- Investigate different chunking strategies
- Investigate ANN, indexing, distnace metrics etc. in lancedb
- Investigate different chunking strategies
- Implement batch data insert into semantic database