An open API service indexing awesome lists of open source software.

https://github.com/m-k-l-s/discworld-hex

Hex clusters Discworld's stories.
https://github.com/m-k-l-s/discworld-hex

discworld faiss sentence-transformers transformers

Last synced: 4 months ago
JSON representation

Hex clusters Discworld's stories.

Awesome Lists containing this project

README

        

# Discworld Hex

Hex clusters Discworld's stories.

Clustering and search tool applied to plots of Discworld novels.
Currently, given an input sentence, it will find the most similar parts of Discworld books based on their plot summaries from Wikipedia.

This is just a tiny proof-of-concept of using [FAISS](https://github.com/facebookresearch/faiss) with transformer language models that could be easily extended to cover much larger datasets.

## Setup

Should work out of the box with `bash` and a couple of prerequisites:
- [conda](https://docs.conda.io/en/latest/miniconda.html)
- [poetry](https://python-poetry.org/docs/#installation)

```bash
( cd conda && source bootstrap.sh )
conda activate discworld-hex
poetry install
```

## Usage

TL;DR (when `poetry` is installed and the `discworld-hex` conda env is activated):

```bash
build
search
```

To only fetch data and build and export the index:

```bash
build
# is just a shortcut for:
poetry run build
```

To use the index to search:

```bash
search
# is just a shortcut for:
poetry run search
```

To run any python script in this project:

```bash
poetry run python src/discworld_hex/any_file.py
```

To run all checks:

```bash
poetry run pre-cmmit
```

## TODO

### Functionality

(What the user would notice.)

- [ ] Allow custom `wikipedia` queries on the input (and thus custom libraries)
- [ ] Fine-tune (e.g., standard (masked) language modelling) on the specific subdomains
- [ ] Aggregate search results per-book
- [ ] Allow merging libraries
- [ ] Better CLI, allow to change `k`, pass in multiple sentences, etc., either:
- [ ] [`click`](https://github.com/pallets/click)ify and [`rich`](https://github.com/Textualize/rich)ify the
interface
- [ ] Alternatively, just make it into an API
- [ ] Support [other (faster, less accurate) indexes](https://github.com/facebookresearch/faiss/wiki/Faster-search)

### Maintenance

(What the user shouldn't notice.)

- [ ] Less redundant library serialization
- More tests
- [ ] Rebuilding Library and the FAISS index