Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rasdani/germanrag
GermanRAG - a German dataset for finetuning Retrieval Augmented Generation
https://github.com/rasdani/germanrag
Last synced: 29 days ago
JSON representation
GermanRAG - a German dataset for finetuning Retrieval Augmented Generation
- Host: GitHub
- URL: https://github.com/rasdani/germanrag
- Owner: rasdani
- License: apache-2.0
- Created: 2024-01-20T20:03:36.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-02-04T17:43:09.000Z (11 months ago)
- Last Synced: 2024-04-24T02:47:36.069Z (9 months ago)
- Language: Python
- Size: 19.5 KB
- Stars: 6
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GermanRAG ๐ฉ๐ช๐๐ฆ
This is the code used for creating [GermanRAG](https://huggingface.co/datasets/DiscoResearch/germanrag), a German dataset for finetuning LLMs on Retrieval Augmented Generation tasks (RAG).
## How to use
- Install the requirements with `pip install -r requirements.txt`.
- Generate `germandpr_subset.jsonl` with `python germandpr.py`
- Clone [Airoboros](https://github.com/jondurbin/airoboros), `pip install -e .` there and copy `germandpr_subset.jsonl` aswell as `config_germanrag.yaml` into the root directory.
- Copy `airoboros/instructors/germanrag.py` and `airoboros/instructors/prompts/germanrag.txt` from this repo to the respective directories in Airoboros.
- Add `from airoboros.instructors.germanrag import generate as germanrag_generator` [here](https://github.com/jondurbin/airoboros/blob/29aa8c7864371b0603c33f5ad303887670f62b78/airoboros/self_instruct.py#L1000).
- Add `"germanrag": germanrag_generator` [here](https://github.com/jondurbin/airoboros/blob/29aa8c7864371b0603c33f5ad303887670f62b78/airoboros/self_instruct.py#L1024).
- Run `airoboros generate-instructions --config-path config_germanrag.yaml`
- Copy your generated `instructions.jsonl` back into this repo's root directory.
- Optional: Validate generations with `python validate_generations.py`.
- Run `python germanrag.py` to generate the final dataset.## Room for improvement
- Choose how to deduplicate/collapse the contexts in GermanDPR, i.e. on shortest, longest, first/random answer span.
- Fix function for three sentence context window.
- Experimental/Optional: Finish choping and mixing of contexts on chunk level.
- Add (true) negatives beyond hard negatives, by pairing with random/dissimilar contexts.
- Generalize to more datasets in SQuAD format.## Acknowledgments
- The GermanRAG dataset is derived from [GermanDPR](https://www.deepset.ai/germanquad), see 'Acknowledgments' in the [dataset card](https://huggingface.co/datasets/DiscoResearch/germanrag#acknowledgements).
- Airoboros by [Jon Durbin](https://github.com/jondurbin), consider giving a [tip](https://github.com/jondurbin/airoboros/tree/29aa8c7864371b0603c33f5ad303887670f62b78?tab=readme-ov-file#support-the-work) ;)## Collaborate
Feel free to open issues/PRs and come join us in our [Discord](https://discord.gg/A575uNaEfu)! ๐Check out our models at [DiscoResearch](https://huggingface.co/DiscoResearch) ๐ชฉ๐งช.