https://github.com/deadbits/wikipedia-chat

Chat with local Wikipedia embeddings 📚
https://github.com/deadbits/wikipedia-chat

chainlit cohere embeddings llm openai retrieval-augmented-generation wikipedia wikipedia-dump

Last synced: 6 months ago
JSON representation

Chat with local Wikipedia embeddings 📚

Host: GitHub
URL: https://github.com/deadbits/wikipedia-chat
Owner: deadbits
Created: 2023-11-14T16:17:57.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-11-14T19:20:43.000Z (almost 2 years ago)
Last Synced: 2025-03-26T15:42:59.713Z (7 months ago)
Topics: chainlit, cohere, embeddings, llm, openai, retrieval-augmented-generation, wikipedia, wikipedia-dump
Language: Python
Homepage:
Size: 64.5 KB
Stars: 4
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 📚 wikichat

## 🌟 Overview

`wikichat` ingests [Cohere's multilingual Wikipedia embeddings](https://txt.cohere.com/embedding-archives-wikipedia/) into a Chroma vector database and provides a Chainlit web interface for retrieval-augmented-generation against the data using `gpt-4-1106-preview`.

I wanted to explore the idea of maintaining a local copy of Wikipedia, and this seemed like a good entry point. Down the road I might update this code to regularly pull the [full Wikipedia dump](https://dumps.wikimedia.org/) and create the embeddings, instead of relying on Cohere's prebuilt embeddings. I went this route as a proof of concept, and as an excuse to try out [Chainlit](https://docs.chainlit.io/get-started/overview).

Based on [Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb](https://github.com/cohere-ai/notebooks/blob/main/notebooks/Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb)

## 🛠 Installation

1. **Clone the Repository:**
```bash
git clone https://github.com/deadbits/wikipedia-chat.git
cd wikipedia-chat
```

2. **Setup Python virtual environment:**
```bash
python3 -m venv venv
source venv/bin/activate
```

3. **Install Dependencies:**
```bash
pip install -r requirements.txt
```

## 📖 Usage

**Set Cohere and OpenAI API keys**
```bash
export OPENAI_API_KEY="...."
export COHERE_API_KEY="..."
```

### Ingest Data

* **Dataset:** [Cohere/wikipedia-22-12-simple-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings)
* **Rows:** `485,859`
* **Size:** `1.63` GB

Run `ingest.py` to download the Wikipedia embeddings dataset and load into ChromaDB:

```python
python ingest.py
```

The script adds records in batches of 100, but this will still take some time. The batch size could probably be increased.

### Web Interface

To initiate the web interface, run the `chainlit_ui.py` script with the Chainlit library:

```python
chainlit run chainlit_ui.py
```

**Chainlit interface**

![Chainlit UI](data/chainlit.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deadbits/wikipedia-chat

Awesome Lists containing this project

README