Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/unicamp-dl/mMARCO
A multilingual version of MS MARCO passage ranking dataset
https://github.com/unicamp-dl/mMARCO
Last synced: 3 months ago
JSON representation
A multilingual version of MS MARCO passage ranking dataset
- Host: GitHub
- URL: https://github.com/unicamp-dl/mMARCO
- Owner: unicamp-dl
- License: apache-2.0
- Created: 2021-08-22T12:49:22.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-10-19T12:31:52.000Z (about 1 year ago)
- Last Synced: 2024-06-17T17:59:50.524Z (5 months ago)
- Language: Python
- Size: 69.3 KB
- Stars: 135
- Watchers: 6
- Forks: 9
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.bib
Awesome Lists containing this project
- StarryDivineSky - unicamp-dl/mMARCO
README
# mMARCO [](https://arxiv.org/abs/2108.13897)
**mMARCO** is a multilingual version of the MS MARCO passage ranking dataset.
For more information, checkout our paper:
* [**mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset**](https://arxiv.org/abs/2108.13897)We translate MS MARCO passage ranking dataset, a large-scale IR dataset comprising more than half million anonymized questions that were sampled from Bing's search query logs. **mMARCO** includes 14 languages (including the original English version).
All files, including the translated triples, collection, queries (training and validation) and run files, are available in [:hugs: Datasets](https://huggingface.co/datasets/unicamp-dl/mmarco).
```python
>>> dataset = load_dataset('unicamp-dl/mmarco', 'english')
>>> dataset['train'][1]
{'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. (...)'}
```**The old/deprecated version (v1) of mMARCO is available at [README_old.md](README_old.md)**
## Released Model Checkpoints
Our available fine-tuned models are:| Model | Description | EN | PT |
| :--- | :--- | :---: | :---: |
|[ptT5-base-pt-msmarco](https://huggingface.co/unicamp-dl/ptt5-base-pt-msmarco-100k-v2)| a [PTT5](https://github.com/unicamp-dl/PTT5) model fine-tuned on Portuguese MS MARCO | 0.200 | 0.299 |
|[ptT5-base-en-pt-msmarco](https://huggingface.co/unicamp-dl/ptt5-base-en-pt-msmarco-100k-v2) | a PTT5 model fine-tuned on English and Portuguese MS MARCO| 0.354 | 0.301 |
|[mT5-base-en-msmarco](https://huggingface.co/unicamp-dl/mt5-base-en-msmarco) |a [mT5](https://github.com/google-research/multilingual-t5) model fine-tuned on English MS MARCO | 0.371| 0.293 |
|[mT5-base-en-pt-msmarco](https://huggingface.co/unicamp-dl/mt5-base-en-pt-msmarco-v2) |a mT5 model fine-tuned on both English and Portuguese MS MARCO | 0.374 | **0.306** |
|[mT5-base-multi-msmarco](https://huggingface.co/unicamp-dl/mt5-base-mmarco-v2) |a mT5 model fine-tuned on mMARCO |0.366 | 0.302|
|[mMiniLM-en-msmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-en-msmarco) |a [mMiniLM](https://github.com/microsoft/unilm/tree/master/minilm) model fine-tuned on English MS MARCO | **0.382** | 0.277 |
|[mMiniLM-en-pt-msmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-en-pt-msmarco-v2) |a mMiniLM model fine-tuned on both English and Portuguese MS MARCO | 0.374 | 0.299|
|[mMiniLM-multi-msmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-mmarco-v2) |a mMiniLM model fine-tuned on mMARCO | 0.366| 0.277|EN and PT columns refer to MRR@10 on the dev set of English and Portuguse MS MARCO, respectively.
## How To Translate
In order to allow other users to translate the MS MARCO passage ranking dataset to other languages (or a dataset of your own will), we provide the ```translate.py``` script. This script expects a .tsv file, in which each line follows a ```document_id \t document_text``` format.
```
python translate.py --model_name_or_path Helsinki-NLP/opus-mt-{src}-{tgt} --target_language tgt_code--input_file collection.tsv --output_dir translated_data/
```
After translating, it is necessary to reassemble the file, as the documents were split into sentences.
```
python create_translated_collection.py --input_file translated_data/translated_file --output_file translated_{tgt}_collection
```
Translating the entire passages collection of MS MARCO took about 80 hours using a Tesla V100.# BM25 Baseline for Portuguese
The steps reported here are the same used for any language from mMARCO.## Data Prep
Using [pygaggle](https://github.com/castorini/pygaggle) scripts, we convert the mMARCO Portuguese collection into JSON files:
```
python pygaggle/tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path path/to/portuguese_collection.tsv \
--output-folder collections/portuguese-msmarco-passage/collection_jsonl
```
## Indexing using [Pyserini](https://github.com/castorini/pyserini)
Now we can index the Portuguese collection using Pyserini:
```
python -m pyserini.index -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -input collections/portuguese-msmarco-passage/collection_jsonl/ \
-index indexes/portuguese-lucene-index-msmarco \
-storePositions -storeDocvectors -storeRaw -language pt
```
As the original English set, the built index should have 8,841,823 documents.## Retrieval
Using a pygaggle script, we select only the queries that are in the qrels file:
```
python pygaggle/tools/scripts/msmarco/filter_queries.py \
--qrels path/to/qrels.dev.small.tsv \
--queries path/to/portuguese_queries.dev.tsv \
--output collections/portuguese-msmarco-passage/portuguese_queries.dev.small.tsv
```
This script results a file with 6980 queries. Now we can retrieve from our index:
```
python -m pyserini.search --topics collections/portuguese-msmarco-passage/portuguese_queries.dev.small.tsv \
--index indexes/portuguese-lucene-index-msmarco \
--language portuguese \
--output runs/run.portuguese-msmarco-passage.dev.small.tsv \
--bm25 --output-format msmarco --hits 1000 --k1 0.82 --b 0.68
```
## Evaluation
Using the official MS MARCO evaluation script:
```
python pygaggle/tools/scripts/msmarco/msmarco_passage_eval.py \
path/to/qrels.dev.small.tsv runs/run.portuguese-msmarco-passage.dev.small.tsv
```
The output should be like:
```
#####################
MRR @10: 0.152
QueriesRanked: 6980
#####################
```## Re-ranking with mT5
Finally, we can re-rank our BM25 initial run using [mT5-base-multi-msmarco](https://huggingface.co/unicamp-dl/mt5-base-multi-msmarco) (or each one of the previous listed models):
```
python reranker.py --model_name_or_path=unicamp-dl/mt5-base-en-pt-msmarco-v2 \
--initial_run runs/run.portuguese-msmarco-passage.dev.small.tsv \
--corpus path/to/portuguese_collection.tsv \
--queries portuguese_queries.dev.small.tsv \
--output_run runs/run.mt5-reranked-portuguese-msmarco-passage.dev.small.tsv
```
Using the official MS MARCO evaluation script to evaluate the re-ranked results:
```
python pygaggle/tools/scripts/msmarco/msmarco_passage_eval.py \
path/to/qrels.dev.small.tsv runs/run.mt5-reranked-portuguese-msmarco-passage.dev.small.tsv
```
The output should be like:
```
#####################
MRR @10: 0.306
QueriesRanked: 6980
#####################
```## Training mMiniLM
An example of mMiniLM-based models training is provided in `train_minilm.py` script.```
python train_minilm.py --output_dir ./mminilm-pt --language portuguese
```
# How to CiteIf you extend or use this work, please cite the [paper][paper] where it was
introduced:```
@misc{bonifacio2021mmarco,
title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset},
author={Luiz Henrique Bonifacio and Vitor Jeronymo and Hugo Queiroz Abonizio and Israel Campiotti and Marzieh Fadaee and and Roberto Lotufo and Rodrigo Nogueira},
year={2021},
eprint={2108.13897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```[paper]: https://arxiv.org/abs/2108.13897