https://github.com/eea/observia-chatbot-dataset
An evaluation dataset for RAG projects for Observia
https://github.com/eea/observia-chatbot-dataset
Last synced: 4 months ago
JSON representation
An evaluation dataset for RAG projects for Observia
- Host: GitHub
- URL: https://github.com/eea/observia-chatbot-dataset
- Owner: eea
- Created: 2024-11-11T15:38:54.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-01-23T13:23:39.000Z (5 months ago)
- Last Synced: 2026-01-24T05:32:27.227Z (5 months ago)
- Language: Jupyter Notebook
- Size: 741 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Golden Dataset generation process
## Step 0: install the package
```
poetry install --no-root
poetry shell
```
## Step 1: download the Vespa database
This works by connecting to the Postgresql database, getting a list of all document ids, then downloading all chunks for that document id.
The database connection information is hardcoded in the script, it may need adjusting.
```bash
python step1-download-vespa-database.py
```
## Step 2: generate topic-based questions
```bash
python step2.0-topic-generation.py data-download/GS_CEMS/ datasets/GS_CEMS-topics.json 200
```
## Step 2.1: filter only English questions (optional)
```bash
python step2.1-filter-for-english.py datasets/GS_CEMS-topics.json datasets/GS_CEMS-topics-en.json
```
## Step 2.2: extract the questions to a new text file
```bash
python step2.1-extract-primary-questions.py datasets/GS_CEMS-topics-en.json datasets/GS_CEMS-questions.txt
```
## Step 3: generate GoldenSet dataset
```bash
python step3.0-generate-danswer-dataset.py datasets/GS_CEMS-questions.txt datasets/GS_CEMS-goldenset.json
```
## Step 3.5: load multiple goldenset-style datasets in local Trulens
```bash
python step3.5-dataset-to-virtual-trulens.py dataset.json datasets/GS_CEMS-goldenset.json
```
## Step 4: convert GoldenSet to Excel
```bash
python step4.0-dataset2xls.py datasets/GS_CEMS-goldenset.json datasets/GS_CEMS-goldenset.xls
```