{"id":13678213,"url":"https://github.com/unicamp-dl/mMARCO","last_synced_at":"2025-04-29T12:34:09.724Z","repository":{"id":44569816,"uuid":"398798410","full_name":"unicamp-dl/mMARCO","owner":"unicamp-dl","description":"A multilingual version of MS MARCO passage ranking dataset","archived":false,"fork":false,"pushed_at":"2023-10-19T12:31:52.000Z","size":71,"stargazers_count":140,"open_issues_count":3,"forks_count":9,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-08-02T13:21:28.195Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/unicamp-dl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-08-22T12:49:22.000Z","updated_at":"2024-07-29T09:11:39.000Z","dependencies_parsed_at":"2024-01-14T15:21:49.052Z","dependency_job_id":"0ee3d0d2-b0d5-43bc-b50b-731a3a70b50d","html_url":"https://github.com/unicamp-dl/mMARCO","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unicamp-dl%2FmMARCO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unicamp-dl%2FmMARCO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unicamp-dl%2FmMARCO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unicamp-dl%2FmMARCO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/unicamp-dl","download_url":"https://codeload.github.com/unicamp-dl/mMARCO/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224173663,"owners_count":17268148,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:00:51.163Z","updated_at":"2024-11-11T20:31:12.622Z","avatar_url":"https://github.com/unicamp-dl.png","language":"Python","funding_links":[],"categories":["Python","NLP语料和数据集"],"sub_categories":["大语言对话模型及数据"],"readme":"# mMARCO [\u003cimg src=\"https://img.shields.io/badge/arXiv-2108.13897-b31b1b.svg\"\u003e](https://arxiv.org/abs/2108.13897)\n**mMARCO** is a multilingual version of the MS MARCO passage ranking dataset.\nFor more information, checkout our paper:\n  * [**mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset**](https://arxiv.org/abs/2108.13897)\n\u003c!---\nThis repository presents a neural machine translation-based method for translating the MS MARCO passage ranking dataset.\nThe code available here is the same used in our paper [**mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset**](https://arxiv.org/abs/2108.13897).\n--\u003e\n\nWe translate MS MARCO passage ranking dataset, a large-scale IR dataset comprising more than half million anonymized questions that were sampled from Bing's search query logs. **mMARCO** includes 14 languages (including the original English version).\n\nAll files, including the translated triples, collection, queries (training and validation) and run files, are available in [:hugs: Datasets](https://huggingface.co/datasets/unicamp-dl/mmarco).\n\n```python\n\u003e\u003e\u003e dataset = load_dataset('unicamp-dl/mmarco', 'english')\n\u003e\u003e\u003e dataset['train'][1]\n{'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. (...)'}\n```\n\n**The old/deprecated version (v1) of mMARCO is available at [README_old.md](README_old.md)**\n\n## Released Model Checkpoints\nOur available fine-tuned models are:\n\n\n| Model | Description | EN | PT |\n| :--- | :--- | :---: | :---: |\n|[ptT5-base-pt-msmarco](https://huggingface.co/unicamp-dl/ptt5-base-pt-msmarco-100k-v2)| a [PTT5](https://github.com/unicamp-dl/PTT5) model fine-tuned on Portuguese MS MARCO | 0.200 | 0.299 |\n|[ptT5-base-en-pt-msmarco](https://huggingface.co/unicamp-dl/ptt5-base-en-pt-msmarco-100k-v2) | a PTT5 model fine-tuned on English and Portuguese MS MARCO| 0.354 | 0.301 |\n|[mT5-base-en-msmarco](https://huggingface.co/unicamp-dl/mt5-base-en-msmarco) |a [mT5](https://github.com/google-research/multilingual-t5) model fine-tuned on English MS MARCO | 0.371| 0.293 |\n|[mT5-base-en-pt-msmarco](https://huggingface.co/unicamp-dl/mt5-base-en-pt-msmarco-v2) |a mT5 model fine-tuned on both English and Portuguese MS MARCO | 0.374 | **0.306** |\n|[mT5-base-multi-msmarco](https://huggingface.co/unicamp-dl/mt5-base-mmarco-v2) |a mT5 model fine-tuned on mMARCO |0.366 | 0.302|\n|[mMiniLM-en-msmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-en-msmarco) |a [mMiniLM](https://github.com/microsoft/unilm/tree/master/minilm) model fine-tuned on English MS MARCO | **0.382** | 0.277 |\n|[mMiniLM-en-pt-msmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-en-pt-msmarco-v2) |a mMiniLM model fine-tuned on both English and Portuguese MS MARCO | 0.374 | 0.299|\n|[mMiniLM-multi-msmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-mmarco-v2) |a mMiniLM model fine-tuned on mMARCO | 0.366| 0.277|\n\nEN and PT columns refer to MRR@10 on the dev set of English and Portuguse MS MARCO, respectively.\n\n## How To Translate\nIn order to allow other users to translate the MS MARCO passage ranking dataset to other languages (or a dataset of your own will), we provide the ```translate.py``` script. This script expects a .tsv file, in which each line follows a ```document_id \\t document_text``` format.\n```\npython translate.py --model_name_or_path Helsinki-NLP/opus-mt-{src}-{tgt} --target_language tgt_code--input_file collection.tsv --output_dir translated_data/\n```\nAfter translating, it is necessary to reassemble the file, as the documents were split into sentences.\n```\npython create_translated_collection.py --input_file translated_data/translated_file --output_file translated_{tgt}_collection\n```\nTranslating the entire passages collection of MS MARCO took about 80 hours using a Tesla V100.\n\n# BM25 Baseline for Portuguese\nThe steps reported here are the same used for any language from mMARCO. \n\n## Data Prep\n\nUsing [pygaggle](https://github.com/castorini/pygaggle) scripts, we convert the mMARCO Portuguese collection into JSON files:\n```\npython pygaggle/tools/scripts/msmarco/convert_collection_to_jsonl.py \\\n    --collection-path path/to/portuguese_collection.tsv \\\n    --output-folder collections/portuguese-msmarco-passage/collection_jsonl\n```\n## Indexing using [Pyserini](https://github.com/castorini/pyserini)\nNow we can index the Portuguese collection using Pyserini:\n```\npython -m pyserini.index -collection JsonCollection \\\n    -generator DefaultLuceneDocumentGenerator \\\n    -threads 1 -input collections/portuguese-msmarco-passage/collection_jsonl/ \\\n    -index indexes/portuguese-lucene-index-msmarco \\\n    -storePositions -storeDocvectors -storeRaw -language pt\n```\nAs the original English set, the built index should have 8,841,823 documents.\n\n## Retrieval\nUsing a pygaggle script, we select only the queries that are in the qrels file:\n```\npython pygaggle/tools/scripts/msmarco/filter_queries.py \\\n    --qrels path/to/qrels.dev.small.tsv \\\n    --queries path/to/portuguese_queries.dev.tsv \\\n    --output collections/portuguese-msmarco-passage/portuguese_queries.dev.small.tsv\n```\nThis script results a file with 6980 queries. Now we can retrieve from our index:\n \n  ```\npython -m pyserini.search --topics collections/portuguese-msmarco-passage/portuguese_queries.dev.small.tsv \\\n     --index indexes/portuguese-lucene-index-msmarco \\\n     --language portuguese \\\n     --output runs/run.portuguese-msmarco-passage.dev.small.tsv  \\\n     --bm25 --output-format msmarco --hits 1000 --k1 0.82 --b 0.68\n  ```\n ## Evaluation\nUsing the official MS MARCO evaluation script:\n```\npython pygaggle/tools/scripts/msmarco/msmarco_passage_eval.py \\\n    path/to/qrels.dev.small.tsv runs/run.portuguese-msmarco-passage.dev.small.tsv\n``` \nThe output should be like:\n```\n#####################\nMRR @10: 0.152\nQueriesRanked: 6980\n#####################\n```\n\n## Re-ranking with mT5\nFinally, we can re-rank our BM25 initial run using [mT5-base-multi-msmarco](https://huggingface.co/unicamp-dl/mt5-base-multi-msmarco) (or each one of the previous listed models):\n``` \npython reranker.py --model_name_or_path=unicamp-dl/mt5-base-en-pt-msmarco-v2 \\\n    --initial_run runs/run.portuguese-msmarco-passage.dev.small.tsv  \\\n    --corpus path/to/portuguese_collection.tsv \\\n    --queries portuguese_queries.dev.small.tsv \\\n    --output_run runs/run.mt5-reranked-portuguese-msmarco-passage.dev.small.tsv\n``` \nUsing the official MS MARCO evaluation script to evaluate the re-ranked results:\n```\npython pygaggle/tools/scripts/msmarco/msmarco_passage_eval.py \\\n    path/to/qrels.dev.small.tsv runs/run.mt5-reranked-portuguese-msmarco-passage.dev.small.tsv\n``` \nThe output should be like:\n```\n#####################\nMRR @10: 0.306\nQueriesRanked: 6980\n#####################\n```\n\n## Training mMiniLM\nAn example of mMiniLM-based models training is provided in `train_minilm.py` script.\n\n```\npython train_minilm.py --output_dir ./mminilm-pt --language portuguese\n```\n \n# How to Cite\n\nIf you extend or use this work, please cite the [paper][paper] where it was\nintroduced:\n\n```\n@misc{bonifacio2021mmarco,\n      title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset}, \n      author={Luiz Henrique Bonifacio and Vitor Jeronymo and Hugo Queiroz Abonizio and Israel Campiotti and Marzieh Fadaee and  and Roberto Lotufo and Rodrigo Nogueira},\n      year={2021},\n      eprint={2108.13897},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n[paper]: https://arxiv.org/abs/2108.13897\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funicamp-dl%2FmMARCO","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funicamp-dl%2FmMARCO","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funicamp-dl%2FmMARCO/lists"}