Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/google-research-datasets/QAmeleon

QAmeleon introduces synthetic multilingual QA data using PaLM, a 540B large language model. This dataset was generated by prompt tuning PaLM with only five examples per language. We use the synthetic data to finetune downstream QA models leading to improved accuracy in comparison to English-only and translation-based baselines.
https://github.com/google-research-datasets/QAmeleon

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/google-research-datasets/QAmeleon
Owner: google-research-datasets
Created: 2023-07-05T16:04:02.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-08-15T17:19:06.000Z (over 1 year ago)
Last Synced: 2024-08-01T13:26:59.753Z (7 months ago)
Size: 2.93 KB
Stars: 33
Watchers: 3
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        QAmeleon introduces synthetic multilingual QA data contaning in 8 langauges using PaLM-540B, a large language model. This dataset was generated by prompt tuning PaLM with only five examples per language. We use the synthetic data to finetune downstream QA models leading to improved accuracy in comparison to English-only and translation-based baselines. 

Data available at https://storage.googleapis.com/qameleon/qamelon_pt_accepted.csv 

More details can be found in the [paper](https://arxiv.org/abs/2211.08264) which can be cited as follows:

```

@misc{agrawal2022qameleon,

      title={QAmeleon: Multilingual QA with Only 5 Examples}, 

      author={Priyanka Agrawal and Chris Alberti and Fantine Huot and Joshua Maynez and Ji Ma and Sebastian Ruder and Kuzman Ganchev and Dipanjan Das and Mirella Lapata},

      year={2022},

      eprint={2211.08264},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```

This dataset contains a total of 47173 Question Answer instances across 8 langauges, following is the count per language. 

|Language | Count |

|---------|------:|

|ar       |6966   |

|bn       |6084   |

|fi    |5028 |

|id    |6797 |

|ko    |6471 |

|ru    |5557 |

|sw    |5597 |

|te    |4673 |

|**Total** |**47173**|

The QAmeleon dataset is released under the [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.