https://github.com/undp-data/dsc-sdgi-corpus
Model benchmarks on SDGi Corpus, a multilingual dataset for text classification by Sustainable Development Goals.
https://github.com/undp-data/dsc-sdgi-corpus
dataset sustainable-development-goals text-classification
Last synced: 2 months ago
JSON representation
Model benchmarks on SDGi Corpus, a multilingual dataset for text classification by Sustainable Development Goals.
- Host: GitHub
- URL: https://github.com/undp-data/dsc-sdgi-corpus
- Owner: UNDP-Data
- License: agpl-3.0
- Created: 2024-04-20T17:41:59.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-25T13:10:55.000Z (about 2 years ago)
- Last Synced: 2025-01-16T04:43:35.045Z (over 1 year ago)
- Topics: dataset, sustainable-development-goals, text-classification
- Language: Python
- Homepage: https://huggingface.co/datasets/UNDP/sdgi-corpus
- Size: 27.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# dsc-sdgi-corpus
[](https://www.python.org)

## Introduction
Model benchmarks on [SDGi Corpus](https://huggingface.co/datasets/UNDP/sdgi-corpus), a multilingual dataset for text
classification by Sustainable Development Goals.
## Getting Started
### Python Environment
The codebase has been developed and tested in Python `3.11`. To create a local python environment, clone the repository
and run the following commands in the project directory:
```shell
python -m venv .venv/
source .venv/bin/activate
pip install -r requirements.txt
```
### Environment Variables
The following environment variables may need to be set in `.env` file:
```shell
# Location for an MLflow database
MLFLOW_TRACKING_URI="sqlite:///mlruns.db"
# The below is only required for GPT experiments or OOD data
AZURE_OPENAI_API_KEY=""
AZURE_OPENAI_ENDPOINT=""
AZURE_OPENAI_EMBEDDING_MODEL=""
```
## Running Experiments
For running out-of-domain (OOD) experiments, one needs to first prepare it using a function from `src`. This requires
access Azure OpenAI and setting the env variables mentioned above. To create and save a dataset run:
```python
from src import prepare_ood_dataset
dataset = prepare_ood_dataset()
dataset.save_to_disk("data/sdg-meter")
```
To replicate supervised results from the paper, you can run the Shell script:
```shell
chmod 755 main.sh
./main.sh
```
If you prefer running individual experiments, you can use `main.py`:
```shell
python main.py --size s --language xx
```
Results are saved to a local SQLite database you specify in the enviroment variables. To view the results in MLflow, run:
```shell
mlflow ui --port 8080 --backend-store-uri sqlite:///mlruns.db
# open http://127.0.0.1:8080
```
## Contribute
If you have any questions or notice any issues, feel free to [open an issue](https://github.com/UNDP-Data/dsc-sdgi-corpus/issues).