https://github.com/smartcat-labs/srbedding
https://github.com/smartcat-labs/srbedding
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/smartcat-labs/srbedding
- Owner: smartcat-labs
- Created: 2024-04-19T09:57:58.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-23T11:44:08.000Z (7 months ago)
- Last Synced: 2025-03-29T00:22:15.930Z (2 months ago)
- Language: Python
- Size: 67.2 MB
- Stars: 2
- Watchers: 4
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SRBedding
## Project setup
Before runnning the project setup the environment
`poetry shell`
`poetry update`## Evaluation jupyter notebook
Inside or evaluation-pipetine add datasets folder and results.
For loading SQuAD-sr you need to add the [squad-sr-lat.json](https://www.kaggle.com/datasets/aleksacvetanovic/squad-sr) into the datasets folder.
First run the make-evaluation-datasets.ipynb. This will create all the files needed
Then run
`cd evaluation-pipetine/`
`python evaluation-pipieline.py`## Training dataset creation
Run the following commands for creating the training dataset:
- `cd training_dataset`
- `python .\main_training.py`
- `python .\batch_loading.py`
The .parquet files will be saved in the datasets folder.## Translating dataset
The folder translation_pipeline is used for translating [ms_marco](https://huggingface.co/datasets/microsoft/ms_marco) and [natural_questions](https://huggingface.co/datasets/google-research-datasets/natural_questions) from English to Serbian. Translated queries and contexts from this datasets will be used for evaluation.
Run the following commands:
- `cd translation_pipeline`
- `python .\sending_batch.py`
- `python .\processing_batch.py`The folder translation_sts is used for translating one sentence pair from the [sts dataset](https://huggingface.co/datasets/mteb/stsbenchmark-sts) for the distiladion evaluator.
Run the following commands:
- `cd translation_sts`
- `python .\sending_batch.py`
- `python .\processing_batch.py`