https://github.com/m-k-l-s/discworld-hex

Hex clusters Discworld's stories.
https://github.com/m-k-l-s/discworld-hex

discworld faiss sentence-transformers transformers

Last synced: 4 months ago
JSON representation

Hex clusters Discworld's stories.

Host: GitHub
URL: https://github.com/m-k-l-s/discworld-hex
Owner: m-k-l-s
Created: 2022-02-05T17:19:29.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-04-11T13:51:40.000Z (about 3 years ago)
Last Synced: 2025-02-20T18:17:24.574Z (4 months ago)
Topics: discworld, faiss, sentence-transformers, transformers
Language: Python
Homepage:
Size: 52.7 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Discworld Hex

Hex clusters Discworld's stories.

Clustering and search tool applied to plots of Discworld novels.

Currently, given an input sentence, it will find the most similar parts of Discworld books based on their plot summaries from Wikipedia.

This is just a tiny proof-of-concept of using [FAISS](https://github.com/facebookresearch/faiss) with transformer language models that could be easily extended to cover much larger datasets.

## Setup

Should work out of the box with `bash` and a couple of prerequisites:

- [conda](https://docs.conda.io/en/latest/miniconda.html)

- [poetry](https://python-poetry.org/docs/#installation)

```bash

( cd conda && source bootstrap.sh )

conda activate discworld-hex

poetry install

```

## Usage

TL;DR (when `poetry` is installed and the `discworld-hex` conda env is activated):

```bash

build

search

```

To only fetch data and build and export the index:

```bash

build

# is just a shortcut for:

poetry run build

```

To use the index to search:

```bash

search

# is just a shortcut for:

poetry run search

```

To run any python script in this project:

```bash

poetry run python src/discworld_hex/any_file.py

```

To run all checks:

```bash

poetry run pre-cmmit

```

## TODO

### Functionality

(What the user would notice.)

- [ ] Allow custom `wikipedia` queries on the input (and thus custom libraries)

- [ ] Fine-tune (e.g., standard (masked) language modelling) on the specific subdomains

- [ ] Aggregate search results per-book

- [ ] Allow merging libraries

- [ ] Better CLI, allow to change `k`, pass in multiple sentences, etc., either:

    - [ ] [`click`](https://github.com/pallets/click)ify and [`rich`](https://github.com/Textualize/rich)ify the

      interface

    - [ ] Alternatively, just make it into an API

- [ ] Support [other (faster, less accurate) indexes](https://github.com/facebookresearch/faiss/wiki/Faster-search)

### Maintenance

(What the user shouldn't notice.)

- [ ] Less redundant library serialization

- More tests

    - [ ] Rebuilding Library and the FAISS index

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/m-k-l-s/discworld-hex

Awesome Lists containing this project

README