Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/j-sephb-lt-n/semantic-search-engine
Tool for searching for passages within a document
https://github.com/j-sephb-lt-n/semantic-search-engine
Last synced: about 23 hours ago
JSON representation
Tool for searching for passages within a document
- Host: GitHub
- URL: https://github.com/j-sephb-lt-n/semantic-search-engine
- Owner: J-sephB-lt-n
- License: gpl-3.0
- Created: 2024-06-27T12:36:55.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-07-14T21:01:09.000Z (4 months ago)
- Last Synced: 2024-07-14T22:19:31.077Z (4 months ago)
- Language: Python
- Size: 185 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# semantic-search-engine
I'm abandoning this repo in order to pursue this one instead:
Tool for searching for passages within a document
```bash
pdftotext TheEffectiveExecutive.pdf input_docs/TheEffectiveExecutive.txt
python -m steps.chunk_input # input written to /chunked_input/
python -m observe.chunk_stats.py
python -m observe.view_random_chunks 0
python -m steps.create_lance_db```
Note about cached huggingface models: the following opens up a UI for deleting models no longer needed:
```bash
pip install huggingface_hub[cli]
huggingface-cli delete-cache
```# TODO
- Investigate different chunking strategies
- Investigate ANN, indexing, distnace metrics etc. in lancedb
- Investigate different chunking strategies
- Implement batch data insert into semantic database