Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/artefactory/redis-team-thm

The ultimate CLI tool to help researchers and technical writers
https://github.com/artefactory/redis-team-thm

arxiv cli redis sentence-transformers

Last synced: 4 days ago
JSON representation

The ultimate CLI tool to help researchers and technical writers

Host: GitHub
URL: https://github.com/artefactory/redis-team-thm
Owner: artefactory
License: bsd-3-clause
Created: 2022-10-15T10:18:06.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2022-12-15T05:16:57.000Z (almost 2 years ago)
Last Synced: 2023-08-19T12:05:05.225Z (about 1 year ago)
Topics: arxiv, cli, redis, sentence-transformers
Language: Python
Homepage: https://artefactory.github.io/redis-team-THM/
Size: 3.95 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Team THM Submission



     


    


    Tom, Henrique, Michel, Corentin


    Oct. - Nov. 2022


    https://artefactory.github.io/redis-team-THM/


    https://thm-cli.community.saturnenterprise.io/api/docs





This demo showcases the _vector search similarity_ feature of Redis Enterprise.

RediSearch enables developers to add documents and their embeddings indexes to the database, turning Redis into a vector database that can be used for modern data web applications.



    



See [Architecture](#architecture) to see how it works, and [User Workflow](#user-workflow) to see how it can be used.

------------------------------

## Table of Contents

- [Documentation](#documentation)

- [History of Changes](#history-of-changes)

- [Machine Setup](#machine-setup)

- [Architecture](#architecture)

- [User Workflow](#user-workflow)

- [Running The Application](#running-the-application)

- [Benchmarks](#benchmarks)

------------------------------

## Documentation

- [Basic Demo](https://docsearch.redisventures.com) | [GitHub](https://github.com/RedisVentures/redis-arXiv-search)

- [Redis Vector Similarity Search](https://redis.io/docs/stack/search/reference/vectors)

- [Huggingface Tokenizers + Models](https://huggingface.co/sentence-transformers)

- [Cornell University - arXiv dataset](https://www.kaggle.com/Cornell-University/arxiv), `arxiv-metadata-oai-snapshot.json` file is used

- [`FastAPI`](https://fastapi.tiangolo.com/), [`pydantic`](https://pydantic-docs.helpmanual.io/), [`redis-om`](https://redis.io/docs/stack/get-started/tutorials/stack-python/)

- [`redis`](https://redis.io/docs/stack/) see Vector database and JSON storage

## History of Changes

- 1/11 - Added a multi-category classifier, a Question Answering engine and a CLI HTTP client to the backend

- 31/10 - Draft blog posts and CLI ETL tool

- 30/10 - Refactored `RedisVentures/redis-arXiv-search` project

- 27/10 - Setup Redis Cloud Enterprise and Saturn Cloud accounts and [organized](https://github.com/orgs/artefactory/projects/7) within the team

- 15/10 - Added a blog based on [Pelican](https://getpelican.com)

- 15/10 - Added CI/CD script

- 15/10 - Forked from [`RedisVentures/redis-arXiv-search`](https://github.com/RedisVentures/redis-arXiv-search)

## Machine Setup

```sh

brew install yarn redis

pip install -r backend/requirements.txt

pip install -r scripts/requirements.txt

```

## Architecture

The user will perform searches to the Redis database through a REST API HTTP Server.

We wrote a small interactive CLI client tool that performs calls to the HTTP Server and returns papers matching the user queries.

```txt

                        writes pickle and loads index

+-------------------+      +----------------+

|                   |      |                |

|  Redis            +<-----+  ETL CLI       |

|                   |      |                |

+--------+----------+      +----------------+

         ^

         |  reads search index

+--------+----------+

|                   |

|  FastAPI          |

|                   |

+--------+----------+

         ^

         |  calls backend

+--------+----------+      +---------------------+

|                   |      |                     |

|  THM CLI          +----->+  arxiv.org          |

|                   |      |  wolfram.alpha.com  |

+-------------------+      +---------------------+

    researcher uses the THM CLI while writing research

```

## User Workflow

This CLI tool is a quick assistant for a researcher daily activities and helps him improves his efficiency.

It can be used with his text editor and browser and helps him in the process of:

- building bibliography in Markdown or BibTeX formats,

- checking the PDF papers using arXiv website,

- checking scientific facts on Wolfram Alpha website.

```mermaid

  graph TD;

      welcome_message-->choose_activity;

      welcome_message-->configure_parameters;

      choose_activity-->search_keywords;

      search_keywords-->Search_API;

      choose_activity-->search_similar_to;

      search_similar_to-->Search_API;

      choose_activity-->fetch_paper_details;

      fetch_paper_details-->Search_API;

      choose_activity-->ask_open_question;

      ask_open_question-->HugginFacePipeline;

      choose_activity-->find_formula;

      find_formula-->Wolfram_Alpha_API;

```

## Running The Application

### Backend Application

Setup your Redis Enterprise Cloud then,

```sh

cd backend/

./start.sh

open http://0.0.0.0:8080/api/docs

```

[Deploy on Saturn Cloud](https://app.community.saturnenterprise.io/dash/resources?recipeUrl=https://github.com/artefactory/redis-team-THM/blob/main/backend/.saturn/thm-backend-deployment-recipe.json)

### Train model

```sh

cd scripts/

pip install -r requirements.txt

bash retrain_model.sh

```

### THM CLI

```sh

cd scripts/

pip install -r requirements.txt

./thm-cli.py

```

### ETL CLI

```sh

cd scripts/

pip install -r requirements.txt

./pipeline.sh

```

### Blog

```sh

# To preview files locally

pelican blog/content && pelican --listen

# To publish on GitHub pages

make publish_blog

```

### Machine Learning Models

The project uses the [`UKPLab/sentence-transformers`](https://github.com/UKPLab/sentence-transformers) library to compute dense vector representations for sentences found in [Cornell's arXiv corpus](https://www.kaggle.com/Cornell-University/arxiv).

We found the following models interesting NLP models from the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) that community built.

- `sentence-transformers/all-mpnet-base-v2` has embeddings of size 768 and relative good performance

- `sentence-transformers/all-MiniLM-L6-v2`

- `sentence-transformers/all-MiniLM-L12-v2` has embedding of size 384 and interesting for development as performing inference is faster

We also used [`transformers.AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification) for the problem of multi-category classification.

For the problem of Question Answering we used [`distilbert-base-cased-distilled-squad`](https://huggingface.co/distilbert-base-cased-distilled-squad).

```mermaid

  graph TD;

      sentence-transformers/all-MiniLM-L12-v2-->THM_API;

      transformers.AutoModelForSequenceClassification-->THM_API;

      THM_API-->THM_CLI;

      distilbert-base-cased-distilled-squad-->THM_CLI;

```

## Benchmarks

See on our blog for the [benchmarks we did](https://artefactory.github.io/redis-team-THM/cost-stack.html) to evaluate the full solution.

## Contributions

Changes and improvements are welcome! Feel free to fork and open a pull request into `main`.