{"id":13486477,"url":"https://github.com/google-research/xtreme","last_synced_at":"2025-04-04T07:07:51.455Z","repository":{"id":39373410,"uuid":"253934093","full_name":"google-research/xtreme","owner":"google-research","description":"XTREME is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models that covers 40 typologically diverse languages and includes nine tasks.","archived":false,"fork":false,"pushed_at":"2023-01-04T11:37:25.000Z","size":452,"stargazers_count":642,"open_issues_count":29,"forks_count":110,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-03-28T06:08:19.298Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://sites.research.google/xtreme","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-07T23:17:25.000Z","updated_at":"2025-03-23T08:11:26.000Z","dependencies_parsed_at":"2023-02-02T10:46:52.561Z","dependency_job_id":null,"html_url":"https://github.com/google-research/xtreme","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fxtreme","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fxtreme/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fxtreme/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fxtreme/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/xtreme/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247135144,"owners_count":20889421,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T18:00:46.885Z","updated_at":"2025-04-04T07:07:51.420Z","avatar_url":"https://github.com/google-research.png","language":"Python","readme":"# XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization\n\n[**Tasks**](#tasks-and-languages) | [**Download**](#download-the-data) |\n[**Baselines**](#build-a-baseline-system) |\n[**Leaderboard**](#leaderboard-submission) |\n[**Website**](https://sites.research.google/xtreme) |\n[**Paper**](https://arxiv.org/pdf/2003.11080.pdf) |\n[**Translations**](https://console.cloud.google.com/storage/browser/xtreme_translations)\n\nThis repository contains information about XTREME, code for downloading data, and\nimplementations of baseline systems for the benchmark.\n\n# Introduction\n\nThe Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa.\n\nFor a full description of the benchmark, see [the paper](https://arxiv.org/abs/2003.11080).\n\n# Tasks and Languages\n\nThe tasks included in XTREME cover a range of standard paradigms in natural language processing, including sentence classification, structured prediction, sentence retrieval and question answering. The full list of tasks can be seen in the image below.\n\n![The datasets used in XTREME](xtreme_score.png)\n\nIn order for models to be successful on the XTREME benchmark, they must learn representations that generalize across many tasks and languages. Each of the tasks covers a subset of the 40 languages included in XTREME (shown here with their ISO 639-1 codes): af, ar, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, ta, te, th, tl, tr, ur, vi, yo, and zh. The languages were selected among the top 100 languages with the [most Wikipedia articles](https://meta.wikimedia.org/wiki/List_of_Wikipedias) to maximize language diversity, task coverage, and availability of training data. They include members of the Afro-Asiatic, Austro-Asiatic, Austronesian, Dravidian, Indo-European, Japonic, Kartvelian, Kra-Dai, Niger-Congo, Sino-Tibetan, Turkic, and Uralic language families as well as of two isolates, Basque and Korean.\n\n# Download the data\n\nIn order to run experiments on XTREME, the first step is to download the dependencies. We assume you have installed [`anaconda`](https://www.anaconda.com/) and use Python 3.7+. The additional requirements including `transformers`, `seqeval` (for sequence labelling evaluation), `tensorboardx`, `jieba`, `kytea`, and `pythainlp` (for text segmentation in Chinese, Japanese, and Thai), and `sacremoses` can be installed by running the following script:\n```\nbash install_tools.sh\n```\n\nThe next step is to download the data. To this end, first create a `download` folder with ```mkdir -p download``` in the root of this project. You then need to manually download `panx_dataset` (for NER) from [here](https://www.amazon.com/clouddrive/share/d3KGCRCIYwhKJF0H3eWA26hjg2ZCRhjpEQtDL70FSBN) (note that it will download as `AmazonPhotos.zip`) to the `download` directory. Finally, run the following command to download the remaining datasets:\n```\nbash scripts/download_data.sh\n```\n\nNote that in order to prevent accidental evaluation on the test sets while running experiments,\nwe remove labels of the test data during pre-processing and change the order of the test sentences\nfor cross-lingual sentence retrieval.\n\n# Build a baseline system\n\nThe evaluation setting in XTREME is zero-shot cross-lingual transfer from English. We fine-tune models that were pre-trained on multilingual data on the labelled data of each XTREME task in English. Each fine-tuned model is then applied to the test data of the same task in other languages to obtain predictions.\n\nFor every task, we provide a single script `scripts/train.sh` that fine-tunes pre-trained models implemented in the [Transformers](https://github.com/huggingface/transformers) repo. To fine-tune a different model, simply pass a different `MODEL` argument to the script with the corresponding model. The current supported models are `bert-base-multilingual-cased`, `xlm-mlm-100-1280` and `xlm-roberta-large`.\n\n## Universal dependencies part-of-speech tagging\n\nFor part-of-speech tagging, we use data from the Universal Dependencies v2.5. You can fine-tune a pre-trained multilingual model on the English POS tagging data with the following command:\n```\nbash scripts/train.sh [MODEL] udpos\n```\n\n## Wikiann named entity recognition\n\nFor named entity recognition (NER), we use data from the Wikiann (panx) dataset. You can fine-tune a pre-trained multilingual model on the English NER data with the following command:\n```\nbash scripts/train.sh [MODEL] panx\n```\n\n## PAXS-X sentence classification\n\nFor sentence classification, we use the Cross-lingual Paraphrase Adversaries from Word Scrambling (PAWS-X) dataset. You can fine-tune a pre-trained multilingual model on the English PAWS data with the following command:\n```\nbash scripts/train.sh [MODEL] pawsx\n```\n\n## XNLI sentence classification\n\nThe second sentence classification dataset is the Cross-lingual Natural Language Inference (XNLI) dataset. You can fine-tune a pre-trained multilingual model on the English MNLI data with the following command:\n```\nbash scripts/train.sh [MODEL] xnli\n```\n\n## XQuAD, MLQA, TyDiQA-GoldP question answering\n\nFor question answering, we use the data from the XQuAD, MLQA, and TyDiQA-Gold Passage datasets.\nFor XQuAD and MLQA, the model should be trained on the English SQuAD training set. For TyDiQA-Gold Passage, the model is trained on the English TyDiQA-GoldP training set. Using the following command, you can first fine-tune a pre-trained multilingual model on the corresponding English training data, and then you can obtain predictions on the test data of all tasks.\n```\nbash scripts/train.sh [MODEL] [xquad,mlqa,tydiqa]\n```\n\n## BUCC sentence retrieval\n\nFor cross-lingual sentence retrieval, we use the data from the Building and Using Parallel Corpora (BUCC) shared task. As the models are not trained for this task but the representations of the pre-trained models are directly used to obtain similarity judgements, you can directly apply the model to obtain predictions on the test data of the task:\n```\nbash scripts/train.sh [MODEL] bucc2018\n```\n\n## Tatoeba sentence retrieval\n\nThe second cross-lingual sentence retrieval dataset we use is the Tatoeba dataset. Similarly to BUCC, you can directly apply the model to obtain predictions on the test data of the task:\n```\nbash scripts/train.sh [MODEL] tatoeba\n```\n\n# Leaderboard Submission\n\n## Submissions\nTo submit your predicitons to [**XTREME**](https://sites.research.google/xtreme), please create one single folder that contains 9 sub-folders named after all the tasks, i.e., `udpos`, `panx`, `xnli`, `pawsx`, `xquad`, `mlqa`, `tydiqa`, `bucc2018`, `tatoeba`. Inside each sub-folder, create a file containing the predicted labels of the test set for all languages. Name the file using the format `test-{language}.{extension}` where `language` indicates the 2-character language code, and `extension` is `json` for QA tasks and `tsv` for other tasks. You can see an example of the folder structure in `mock_test_data/predictions`.\n\n## Evaluation\nWe will compare your submissions with our label files using the following command:\n```\npython evaluate.py --prediction_folder [path] --label_folder [path]\n```\n\n# Translations\n\nAs part of training translate-train and translate-test baselines we have automatically translated\nEnglish training sets to other languages and tests sets to English. Translations are available for\nthe following datasets: SQuAD v1.1 (only train and dev), MLQA, PAWS-X, TyDiQA-GoldP, XNLI, and XQuAD.\n\nFor PAWS-X and XNLI, the translations are in the following format:\nColumn 1 and Column 2: original sentence pairs\nColumn 3 and Column 4: translated sentence pairs\nColumn 5: label\n\nThis will help make the association between the original data and their translations.\n\nFor XNLI and XQuAD, we have furthermore created pseudo test sets by automatically translating the English test set to the remaining\nlanguages in XTREME so that test data for all 40 languages is available. Note that\nthese translations are noisy and should not be treated as ground truth.\n\nAll translations are available [here](https://console.cloud.google.com/storage/browser/xtreme_translations).\n\n# Paper\n\nIf you use our benchmark or the code in this repo, please cite our paper `\\cite{hu2020xtreme}`.\n```\n@article{hu2020xtreme,\n      author    = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham Neubig and Orhan Firat and Melvin Johnson},\n      title     = {XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization},\n      journal   = {CoRR},\n      volume    = {abs/2003.11080},\n      year      = {2020},\n      archivePrefix = {arXiv},\n      eprint    = {2003.11080}\n}\n```\nPlease consider including a note similar to the one below to make sure to cite all the individual datasets in your paper.\n\nWe experiment on the XTREME benchmark `\\cite{hu2020xtreme}`, a composite benchmark for multi-lingual learning consisting of data from the XNLI `\\cite{Conneau2018xnli}`, PAWS-X `\\cite{Yang2019paws-x}`, UD-POS `\\cite{nivre2018universal}`, Wikiann NER `\\cite{Pan2017}`, XQuAD `\\cite{artetxe2020cross}`, MLQA `\\cite{Lewis2020mlqa}`, TyDiQA-GoldP `\\cite{Clark2020tydiqa}`, BUCC 2018 `\\cite{zweigenbaum2018overview}`, Tatoeba `\\cite{Artetxe2019massively}` tasks. We provide their BibTex information as follows.\n```\n@inproceedings{Conneau2018xnli,\n    title = \"{XNLI}: Evaluating Cross-lingual Sentence Representations\",\n    author = \"Conneau, Alexis  and\n      Rinott, Ruty  and\n      Lample, Guillaume  and\n      Williams, Adina  and\n      Bowman, Samuel  and\n      Schwenk, Holger  and\n      Stoyanov, Veselin\",\n    booktitle = \"Proceedings of EMNLP 2018\",\n    year = \"2018\",\n    pages = \"2475--2485\",\n}\n\n@inproceedings{Yang2019paws-x,\n    title = \"{PAWS-X}: A Cross-lingual Adversarial Dataset for Paraphrase Identification\",\n    author = \"Yang, Yinfei  and\n      Zhang, Yuan  and\n      Tar, Chris  and\n      Baldridge, Jason\",\n    booktitle = \"Proceedings of EMNLP 2019\",\n    year = \"2019\",\n    pages = \"3685--3690\",\n}\n\n@article{nivre2018universal,\n  title={Universal Dependencies 2.2},\n  author={Nivre, Joakim and Abrams, Mitchell and Agi{\\'c}, {\\v{Z}}eljko and Ahrenberg, Lars and Antonsen, Lene and Aranzabe, Maria Jesus and Arutie, Gashaw and Asahara, Masayuki and Ateyah, Luma and Attia, Mohammed and others},\n  year={2018}\n}\n\n@inproceedings{Pan2017,\nauthor = {Pan, Xiaoman and Zhang, Boliang and May, Jonathan and Nothman, Joel and Knight, Kevin and Ji, Heng},\nbooktitle = {Proceedings of ACL 2017},\npages = {1946--1958},\ntitle = {{Cross-lingual name tagging and linking for 282 languages}},\nyear = {2017}\n}\n\n@inproceedings{artetxe2020cross,\nauthor = {Artetxe, Mikel and Ruder, Sebastian and Yogatama, Dani},\nbooktitle = {Proceedings of ACL 2020},\ntitle = {{On the Cross-lingual Transferability of Monolingual Representations}},\nyear = {2020}\n}\n\n@inproceedings{Lewis2020mlqa,\nauthor = {Lewis, Patrick and Oğuz, Barlas and Rinott, Ruty and Riedel, Sebastian and Schwenk, Holger},\nbooktitle = {Proceedings of ACL 2020},\ntitle = {{MLQA: Evaluating Cross-lingual Extractive Question Answering}},\nyear = {2020}\n}\n\n@inproceedings{Clark2020tydiqa,\nauthor = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki},\nbooktitle = {Transactions of the Association of Computational Linguistics},\ntitle = {{TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}},\nyear = {2020}\n}\n\n@inproceedings{zweigenbaum2018overview,\n  title={Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora},\n  author={Zweigenbaum, Pierre and Sharoff, Serge and Rapp, Reinhard},\n  booktitle={Proceedings of 11th Workshop on Building and Using Comparable Corpora},\n  pages={39--42},\n  year={2018}\n}\n\n@article{Artetxe2019massively,\nauthor = {Artetxe, Mikel and Schwenk, Holger},\njournal = {Transactions of the ACL 2019},\ntitle = {{Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond}},\nyear = {2019}\n}\n```\n","funding_links":[],"categories":["Uncategorized","A01_文本生成_文本对话","Shell","**Datasets**","Urdu Datasets"],"sub_categories":["Uncategorized","大语言对话模型及数据","Benchmarks","Cross-lingual Datasets"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fxtreme","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Fxtreme","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fxtreme/lists"}