https://github.com/m-k-l-s/discworld-hex
Hex clusters Discworld's stories.
https://github.com/m-k-l-s/discworld-hex
discworld faiss sentence-transformers transformers
Last synced: 4 months ago
JSON representation
Hex clusters Discworld's stories.
- Host: GitHub
- URL: https://github.com/m-k-l-s/discworld-hex
- Owner: m-k-l-s
- Created: 2022-02-05T17:19:29.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-04-11T13:51:40.000Z (about 3 years ago)
- Last Synced: 2025-02-20T18:17:24.574Z (4 months ago)
- Topics: discworld, faiss, sentence-transformers, transformers
- Language: Python
- Homepage:
- Size: 52.7 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Discworld Hex
Hex clusters Discworld's stories.
Clustering and search tool applied to plots of Discworld novels.
Currently, given an input sentence, it will find the most similar parts of Discworld books based on their plot summaries from Wikipedia.This is just a tiny proof-of-concept of using [FAISS](https://github.com/facebookresearch/faiss) with transformer language models that could be easily extended to cover much larger datasets.
## Setup
Should work out of the box with `bash` and a couple of prerequisites:
- [conda](https://docs.conda.io/en/latest/miniconda.html)
- [poetry](https://python-poetry.org/docs/#installation)```bash
( cd conda && source bootstrap.sh )
conda activate discworld-hex
poetry install
```## Usage
TL;DR (when `poetry` is installed and the `discworld-hex` conda env is activated):
```bash
build
search
```To only fetch data and build and export the index:
```bash
build
# is just a shortcut for:
poetry run build
```To use the index to search:
```bash
search
# is just a shortcut for:
poetry run search
```To run any python script in this project:
```bash
poetry run python src/discworld_hex/any_file.py
```To run all checks:
```bash
poetry run pre-cmmit
```## TODO
### Functionality
(What the user would notice.)
- [ ] Allow custom `wikipedia` queries on the input (and thus custom libraries)
- [ ] Fine-tune (e.g., standard (masked) language modelling) on the specific subdomains
- [ ] Aggregate search results per-book
- [ ] Allow merging libraries
- [ ] Better CLI, allow to change `k`, pass in multiple sentences, etc., either:
- [ ] [`click`](https://github.com/pallets/click)ify and [`rich`](https://github.com/Textualize/rich)ify the
interface
- [ ] Alternatively, just make it into an API
- [ ] Support [other (faster, less accurate) indexes](https://github.com/facebookresearch/faiss/wiki/Faster-search)### Maintenance
(What the user shouldn't notice.)
- [ ] Less redundant library serialization
- More tests
- [ ] Rebuilding Library and the FAISS index