https://github.com/deadbits/wikipedia-chat
Chat with local Wikipedia embeddings 📚
https://github.com/deadbits/wikipedia-chat
chainlit cohere embeddings llm openai retrieval-augmented-generation wikipedia wikipedia-dump
Last synced: 6 months ago
JSON representation
Chat with local Wikipedia embeddings 📚
- Host: GitHub
- URL: https://github.com/deadbits/wikipedia-chat
- Owner: deadbits
- Created: 2023-11-14T16:17:57.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-11-14T19:20:43.000Z (almost 2 years ago)
- Last Synced: 2025-03-26T15:42:59.713Z (7 months ago)
- Topics: chainlit, cohere, embeddings, llm, openai, retrieval-augmented-generation, wikipedia, wikipedia-dump
- Language: Python
- Homepage:
- Size: 64.5 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 📚 wikichat
## 🌟 Overview
`wikichat` ingests [Cohere's multilingual Wikipedia embeddings](https://txt.cohere.com/embedding-archives-wikipedia/) into a Chroma vector database and provides a Chainlit web interface for retrieval-augmented-generation against the data using `gpt-4-1106-preview`.
I wanted to explore the idea of maintaining a local copy of Wikipedia, and this seemed like a good entry point. Down the road I might update this code to regularly pull the [full Wikipedia dump](https://dumps.wikimedia.org/) and create the embeddings, instead of relying on Cohere's prebuilt embeddings. I went this route as a proof of concept, and as an excuse to try out [Chainlit](https://docs.chainlit.io/get-started/overview).
Based on [Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb](https://github.com/cohere-ai/notebooks/blob/main/notebooks/Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb)
## 🛠Installation
1. **Clone the Repository:**
```bash
git clone https://github.com/deadbits/wikipedia-chat.git
cd wikipedia-chat
```2. **Setup Python virtual environment:**
```bash
python3 -m venv venv
source venv/bin/activate
```3. **Install Dependencies:**
```bash
pip install -r requirements.txt
```## 📖 Usage
**Set Cohere and OpenAI API keys**
```bash
export OPENAI_API_KEY="...."
export COHERE_API_KEY="..."
```### Ingest Data
* **Dataset:** [Cohere/wikipedia-22-12-simple-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings)
* **Rows:** `485,859`
* **Size:** `1.63` GBRun `ingest.py` to download the Wikipedia embeddings dataset and load into ChromaDB:
```python
python ingest.py
```The script adds records in batches of 100, but this will still take some time. The batch size could probably be increased.
### Web Interface
To initiate the web interface, run the `chainlit_ui.py` script with the Chainlit library:
```python
chainlit run chainlit_ui.py
```**Chainlit interface**
