https://github.com/jbellis/cassgpt
https://github.com/jbellis/cassgpt
Last synced: 9 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/jbellis/cassgpt
- Owner: jbellis
- Created: 2023-05-08T14:22:32.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-06-11T17:19:50.000Z (over 2 years ago)
- Last Synced: 2025-02-16T14:27:06.706Z (11 months ago)
- Language: Python
- Size: 39.1 KB
- Stars: 6
- Watchers: 2
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Q&A with ChatGPT enriched with youtube transcriptions
This program uses OpenAI's GPT-3 to generate answers to your questions based on transcriptions of presentations on AI from YouTube.
## Dependencies
- OpenAI API key, stored in file `openai.key`
- Cassandra database supporting vector search. Currently that means you need to build and run
this branch: https://github.com/datastax/cassandra/tree/vsearch
- TLDR:
- `git clone git@github.com:datastax/cassandra.git --branch vsearch`
- `ant realclean`
- `ant jar -Duse.jdk11=true`
- `bin/cassandra -f`
- JDK 11. _Exactly_ 11.
- You will be able to run cqlsh with vector support if you run `bin/cqlsh` from the cassandra source root
- You can install the Python dependencies for cassgpt by running
`pip install -r requirements.txt` from this source tree.
## Usage
`python gen-qa-openai.py [--load_data]`
Specifying `--load_data` will will download the dataset, merge the transcriptions into larger chunks, generate embeddings for each chunk using OpenAI's `text-embedding-ada-002` model, and insert the chunks and embeddings into the database. This will take around twenty minutes and cost about $5 as of May 2023.
This only needs to be done once.
Once the dataset is loaded, the program will prompt you for a question; it will find the most
relevant context from the transcriptions using Cassandra vector search, and feed the resulting
context + question to OpenAI to generate an answer to your query.
Assumes Cassandra is running on localhost, hack the source if it's somewhere else.
## Need to start over?
Instead of rebuilding the embeddings from scratch (slow!), dump them from Cassandra and
re-load them into a fresh database.
`python dump.py`
`python load.py`
Also assumes Cassandra is running on localhost.
---