https://github.com/klima7/pol-spider
Polish translation of spider dataset.
https://github.com/klima7/pol-spider
machine-learning polish spider text-to-sql text2sql translation
Last synced: 3 months ago
JSON representation
Polish translation of spider dataset.
- Host: GitHub
- URL: https://github.com/klima7/pol-spider
- Owner: klima7
- Created: 2023-09-29T15:37:46.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-12T17:36:14.000Z (over 1 year ago)
- Last Synced: 2024-05-12T18:32:46.571Z (over 1 year ago)
- Topics: machine-learning, polish, spider, text-to-sql, text2sql, translation
- Language: Python
- Homepage:
- Size: 12.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Pol-Spider 🕷️
This repository provides translation of [Spider](https://yale-lily.github.io/spider), [CoSQL](https://yale-lily.github.io/cosql), [SParC](https://yale-lily.github.io/sparc), [Spider-DK](https://github.com/ygan/Spider-DK), [Spider-Syn](https://github.com/ygan/Spider-Syn) datasets into Polish and code for some experiments.
📄 Associated master thesis: [download link](https://github.com/klima7/Master-Thesis/releases/download/submit/master-thesis.pdf).
## Ready datasets
Polish translations are ready to download from [Hugging Face Datasets](https://huggingface.co/datasets/klima7/Pol-Spider/tree/main) 🤗
## Datasets synthesis
`datasets` directory contains scripts for dataset synthesis
### Setup environment
```bash
# clone repository
https://github.com/klima7/Polish-Spider
# create environment
conda create -n pol-spider python=3.19
conda activate pol-spider
pip install -r requirements.txt
# download spacy model
python -m spacy download xx_sent_ud_sm
```
Then download oryginal english databases from [here](https://huggingface.co/datasets/klima7/Pol-Spider/blob/main/_database.zip) and place inside `datasets/components/database`
### Example dataset synthesis
Synthesize dataset named `pol-spider-en`, which is based on samples from `spider`. Translate questions to polish. Apply `context-curated` translation to schema names. Translate strings in SQL queries to polish:
```bash
python datasets/scripts/synthesize.py spider pol-spider-en \
--question-lang pl \
--schema-translation context-curated \
--query-lang pl \
--with-db
```
### Joining datasets
Create `pol-spider` dataset by joining `pol-spider-en` and `pol-spider-pl`:
```bash
python datasets/scripts/join.py pol-spider pol-spider-en pol-spider-pl
```
## App
`app` directory contains streamlit app, which allows to use `C3SQL` and `RESDSQL` models easily.

### Starting app
To use `RESDSQL` model downloading weights from [Hugging Face](https://huggingface.co/klima7/Pol-Spider-App) 🤗 and placing inside `app/models` is required.
```bash
cd app
docker compose up --build
```
## Experiments
`experiments` directory contains dockerized code for experiments with `RAT-SQL`, `BRIDGE`, `RESDSQL`, `C3`.
## Evaluation
`evaluation` directory contains code for calculating metrics.