Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/google-research-datasets/swim-ir
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
https://github.com/google-research-datasets/swim-ir
cross-lingual datasets deep-learning information-retrieval machine-learning multilingual natural-language-processing neural-information-retrieval nlp training-data
Last synced: 2 days ago
JSON representation
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
- Host: GitHub
- URL: https://github.com/google-research-datasets/swim-ir
- Owner: google-research-datasets
- Created: 2023-11-06T11:26:39.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-13T23:42:22.000Z (about 1 year ago)
- Last Synced: 2024-11-08T14:06:57.731Z (about 2 months ago)
- Topics: cross-lingual, datasets, deep-learning, information-retrieval, machine-learning, multilingual, natural-language-processing, neural-information-retrieval, nlp, training-data
- Homepage: https://arxiv.org/abs/2311.05800
- Size: 201 KB
- Stars: 44
- Watchers: 7
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SWIM-IR
Paper |
Download |
DataCard |
Prompts
## Overview
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training dataset consisting of 28 million query-passage pairs spanning 33 languages. Multilingual passages are sampled from Wikipedia and are paired with queries generated by PaLM-2 using a novel summarize-then-ask prompting (SAP) generation method.
Models trained on SWIM-IR achieve good performance on XOR-Retrieve (cross-lingual), and MIRACL (multilingual). SWIM-IR based models achieved a new state-of-the-art on XTREME-UP, a cross-lingual retrieval benchmark for under-represented and scarce-data languages.
## Announcements
- [Nov 2023] SWIM-IR v1.0 currently covers a portion (10 of the 18) of the MIRACL languages in the SWIM-IR monolingual set. The cross-lingual SWIM-IR data contains synthetic training pairs for all of the languages in XOR-Retrieve and XTREME-UP.## Dataset Generation
!["Figure illustrating how the SMIM-IR dataset was created"](SWIM-IR-Diagram-Updated.drawio.png "SWIM-IR dataset creation.")
**Figure 1:** SWIM-IR dataset generation process. Sampled Wikipedia passages are provided to an LLM (PaLM-2) using the novel summarize-then-ask prompting (SAP) method.The SWIM-IR dataset is generated by first sampling passages from Wikipedia. The passages are then provided to PaLM-2 along with a prompt that asks the model to summarize the passage. The model is then prompted to ask a question that can be answered by the passage. The end-to-end process is illustrated in Figure 1. Summarize-then-ask prompting (SAP) aids the model in generating good information seeking queries for each specific input passage.
## Download
The SWIM-IR dataset can be downloaded using the links below:
* [SWIM-IR v1.0](
http://storage.googleapis.com/gresearch/swim-ir/swim_ir_v1.tar.gz
) (Nov 9, 2023)### Data Format
SWIM-IR is partitioned into three sections (/directories): `cross_lingual`, `cross_lingual_ext`, and `monolingual`. The `cross_lingual` section contains training data that can be used for evaluation on XOR Retrieve, while the `cross_lingual_ext` section can be used for evaluations on XTERME-UP. The `monolingual` section can be used for MIRACL evaluation.
Each section contains language specific JSONL files with the fields: `_id`, `lang`, `code`, `query`, `title` and `text`. Synthetic questions generated by PaLM-2 about the passage are stored in the `query` field. The `text` field contains a sampled passage from Wikipedia, while `title` is the title of the passage's article.
For the `monolingual` data, `lang` is the language of both the query and passage with the corresponding langauge code being stored in `code` (e.g., 'fr'). For the `cross_lingual` and `cross_lingual_ext` data, the queries are in English, while `lang` and `code` indicate the language and language code of the passage.
Below is a JSON example from SWIM-IR for a question in Chinese, "托马斯·爱迪生在哪里发明了留声机?" [Where did Thomas Edison invent the phonograph?].
```javascript
{'_id': '10770836',
'lang': 'Chinese',
'code': 'zh',
'query': '托马斯·爱迪生在哪里发明了留声机?',
'title': 'Menlo Park, New Jersey',
'text': 'Menlo Park is an unincorporated community located \
within Edison Township in Middlesex County, New Jersey, United \
States. In 1876, Thomas Edison set up his home and research \
laboratory in Menlo Park, which at the time was the site of an \ unsuccessful real estate development named after the town of \
Menlo Park, California. While there, he earned the nickname \
"the Wizard of Menlo Park". The Menlo Park lab was significant \
in that it was one of the first laboratories to pursue practical \
commercial applications of research. It was in his Menlo Park \
laboratory that Thomas Edison invented the phonograph and developed'
}
```
Note that within the SWIM-IR dataset, JSON examples are stored as JSONL with one JSON example per line. Multiple line JSON is used above to make the example more readable.### Prompts
Prompts given to PaLM-2 to generate the three parts of our dataset are provided below:
* [Cross-Lingual Prompts (XOR Retrieve)](XOR-Retrieve-prompts.csv)
* [Multi-Monolingual Prompts (MIRACL)](MIRACL-prompts.csv)
* [Cross-Lingual Extended Prompts (XTREME-UP)](xtreme-up-prompts.csv)## Paper
SWIM-IR is described in datail in the paper [Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval](https://arxiv.org/abs/2311.05800) by Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin and Daniel Cer. Please cite our paper in research work that uses or discusses SWIM-IR.
### BibTeX
```shell
@article{swim-ir-dataset,
author = {Nandan Thakur and
Jianmo Ni and
Gustavo Hern\'andez \'Abrego$^\lozenge$ and
John Wieting and
Jimmy Lin and
Daniel Cer},
title = {Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval},
journal = {CoRR},
volume = {abs/2311.05800},
year = {2023},
url = {https://arxiv.org/abs/2311.05800},
eprinttype = {arXiv},
primaryClass={cs.IR},
eprint = {2311.05800},
}
```## Contact
Questions about the SWIM-IR dataset can asked by creating an issue on this repository or by sending them to
[email protected]## License
The SWIM-IR dataset is licensed under CC BY-SA 4.0