https://github.com/forward-operators/paperchat

PaperChat - cli, API and ChatGPT plugin to query arXiv's dataset
https://github.com/forward-operators/paperchat

arxiv chatgpt chatgpt-plugin gpt huggingface openai

Last synced: 6 months ago
JSON representation

PaperChat - cli, API and ChatGPT plugin to query arXiv's dataset

Host: GitHub
URL: https://github.com/forward-operators/paperchat
Owner: Forward-Operators
License: mit
Created: 2023-04-12T17:25:47.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-09-22T11:29:35.000Z (about 2 years ago)
Last Synced: 2024-11-09T15:42:51.086Z (11 months ago)
Topics: arxiv, chatgpt, chatgpt-plugin, gpt, huggingface, openai
Language: HCL
Homepage:
Size: 1.44 MB
Stars: 38
Watchers: 4
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-ChatGPT-repositories - paperchat - PaperChat - cli, API and ChatGPT plugin to query arXiv's dataset (Browser-extensions)

README

# arxivchat

Welcome to arXivchat!

arXivchat is LLM based software that let's you talk about arXiv published papers in a conversational way.
It works as a cli tool, API provider and ChatGPT plugin.

Made by [Forward Operators](https://fwdoperators.com). We work with some of the smartest people on LLM and ML-related projects.

You are more than welcome to contribute!

## Dependencies
- python >=3.10
- poetry
- chromadb
- langchain
- arxiv

## Architecture

![diagram](./images/diagram.png)

## Setup
Follow these steps to quickly set up and run the arXiv plugin:

- Install Python 3.10, if not already installed.

- Clone the repository: git clone https://github.com/Forward-Operators/arxivchat.git

- Navigate to the cloned repository directory: cd /path/to/arxivchat

- Install poetry: `pip install poetry`

- Create a new virtual environment with Python 3.10: `poetry env use python3.10`

- Activate the virtual environment: `poetry shell`

- Install app dependencies: `poetry install`

Set the required environment variables:

```bash
export DATABASE=
export OPENAI_API_KEY=

# Add the environment variables for your chosen vector DB.

# Pinecone
export PINECONE_API_KEY=
export PINECONE_ENVIRONMENT=
export PINECONE_INDEX=

# Qdrant
export QDRANT_URL=
export QDRANT_PORT=
export QDRANT_GRPC_PORT=
export QDRANT_API_KEY=
export QDRANT_COLLECTION=

# Chroma
export CHROMA_HOST=
export CHROMA_PORT=
export CHROMA_COLLECTION=

# Embeddings
export EMBEDDINGS=
export CUDA_ENABLED= - needed for huggingface

```

Run the API locally: `cd app/; gunicorn --worker-class uvicorn.workers.UvicornWorker --config ./gunicorn_conf.py main:app`

Access the API documentation at http://0.0.0.0:8000/docs and test the API endpoints .

## Ingesting
arXiv has a dataset of almost 2 million publications. it is against arXiv's ToS to fetch too much data from their website (as it creates load)
Fortunately, good people from [kaggle](https://kaggle.com) together with Cornell University create a publicly available dataset that you can use.
The dataset is freely available via Google Cloud Storage buckets and updated weekly.

Now the main issues is - how to get only a subset of that entire dataset if we don't want to ingest over 5 terabytes of pdf files?
Dataset is divided into directories per-month, per-year, so if you'd like to get all publications from September of 2021, you could just run:
`gsutil cp -r gs://arxiv-dataset/arxiv/pdf/2109/ ./local_directory`

If you'd like to get an entire dataset:
`gsutil cp -r gs://arxiv-dataset/arxiv/pdf/ ./a_local_directory/`

But if you want to get only a subset (for a given category and dates) take a look into `download.py` file.

By default ingester is expecting this files to be at `/mnt/dataset/arxiv/pdf` with all pdf files there.

Check out and run `python scripy.py` to ingest data. You can also enable debugging there if something doesn't work.

_TODO: maybe change this to directory loader_
_TODO: implement celery deployment and use worker for ingestion_

## Query
`python cli.py`
![cli.py](./images/cli.png "image Title")

Ask the question about the topic you've fed the database before. Returns information about sources as well, runs continously.
Another option is to use REST API (run `uvicorn main:app --reload --host 0.0.0.0 --port 8000` from the `app` directory) or use it as ChatGPT plugin (after deployment)

## Deployment
There are terraform files in `deployment` directory. Use one that suits you best. There's README file in each of them with instructions.
You can also just build a Docker image and run it wherever you want. The image file is quite big though.

### GCP
For now can be deployed as Cloud Run using docker image, so it's API only deployment. Data ingestion must be run on other machine (I do recommend GPU-enabled Compute Engines, especially if you'd like to use Hugging Face embeddings and because you can mount datase from Google Storage directly using `gcsfuse`)
Potential [solution](https://cloud.google.com/run/docs/tutorials/network-filesystems-fuse) to use GCS bucket with Cloud Run
### Azure
For now it can be deployed as Container Apps (API only deployment, you need another deployment for ingester)

### AWS
AWS is not supported yet. Coming soon.

## Embeddings

### OpenAI
arxivchat uses `text-embedding-ada-002` for OpenAI by default, you can change that in `app/tools/factory.py`

### HuggingFace
For now you can use any model that works with [`sentence_transformers`](https://huggingface.co/sentence-transformers).
You can change the model in `app/tools/factory.py`

## ToDO
- [ ] Automount gcs arxiv bucket on deployment.
- [ ] Option to use Azure OpenAI.
- [ ] AWS deployment
- [ ] Add tests
- [ ] Automate ingesting new publications
- [ ] Add more vectostores options
- [ ] Add more embeddings options
- [ ] Support streaming responses
- [ ] Take embeddings model name from .env

## Issues & contribution
If you have any problems please use GitHub issues to report them.

## Contributing
We'd love your help in making arXivchat even better! To contribute, please follow these steps:

- Fork the repo
- Create a new branch
- Commit your changes
- Push the branch to your fork
- Create a new Pull Request

## License
arXivchat is released under the MIT License.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/forward-operators/paperchat

Awesome Lists containing this project

README