{"id":18927565,"url":"https://github.com/gentaiscool/miners","last_synced_at":"2025-07-31T00:42:10.855Z","repository":{"id":243938618,"uuid":"808884603","full_name":"gentaiscool/miners","owner":"gentaiscool","description":"MINERS ⛏️: The semantic retrieval benchmark for evaluating multilingual language models. (EMNLP 2024 Findings)","archived":false,"fork":false,"pushed_at":"2024-10-03T06:34:29.000Z","size":7112,"stargazers_count":13,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-15T16:26:01.281Z","etag":null,"topics":["benchmark","classification","deep-learning","deep-learning-models","efficient","generation","language-model","large-language-models","llm","machine-learning","miner","miners","ml","multilingual","nlp","retrieval","semantic-retrieval","sentence-transformers","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gentaiscool.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-06-01T04:18:30.000Z","updated_at":"2025-01-12T19:12:06.000Z","dependencies_parsed_at":"2025-04-15T13:42:28.593Z","dependency_job_id":"08b25473-93be-43e4-9a9e-d8f17f234d2e","html_url":"https://github.com/gentaiscool/miners","commit_stats":null,"previous_names":["gentaiscool/miners"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gentaiscool/miners","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gentaiscool%2Fminers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gentaiscool%2Fminers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gentaiscool%2Fminers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gentaiscool%2Fminers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gentaiscool","download_url":"https://codeload.github.com/gentaiscool/miners/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gentaiscool%2Fminers/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267967725,"owners_count":24173566,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-30T02:00:09.044Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","classification","deep-learning","deep-learning-models","efficient","generation","language-model","large-language-models","llm","machine-learning","miner","miners","ml","multilingual","nlp","retrieval","semantic-retrieval","sentence-transformers","transformers"],"created_at":"2024-11-08T11:19:34.676Z","updated_at":"2025-07-31T00:42:10.798Z","avatar_url":"https://github.com/gentaiscool.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MINERS \u003cimg src=\"assets/pickaxe.png\" width=\"30px\"\u003e: Multilingual Language Models as Semantic Retrievers\n![Pull Requests Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n⚡ Introducing the **MINERS benchmark**, designed to assess the multilingual LMs' prowess in semantic retrieval tasks, including bitext mining and classification through retrieval-augmented contexts **without fine-tuning**. A comprehensive framework has been developed to evaluate the effectiveness of language models in retrieving samples across over **200 diverse languages**, including low-resource languages in challenging **cross-lingual (XS)** and **code-switching (CS)** settings. The results show that achieving competitive performance with state-of-the-art methods is possible by solely retrieving semantically similar embeddings, without requiring any fine-tuning.\n\nThe paper has been accepted at EMNLP 2024 Findings.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/pipeline.png\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n## Table of Contents\n\n- [Paper](#-paper)\n- [Benchmark](#-benchmark)\n- [Environment Setup](#-environment-setup)\n- [Experiment Logs](#-experiment-logs)\n- [Running Experiments](#-running-experiments)\n\t- [Bitext Retrieval](#bitext-retrieval)\n\t- [Retrieval-based Classification](#retrieval-based-classification)\n\t- [ICL Classification](#icl-classification)\n- [Aggregating Experiment Results](#-aggregating-experiment-results)\n- [Visualizing the Embeddings](#-visualizing-the-embeddings)\n- [Models Support](#-models-support)\n- [How to Contribute?](#-how-to-contribute)\n- [On Progress](#on-progress)\n\n## 📜 Paper \nThis is the source code of the paper [[Arxiv]](https://arxiv.org/abs/2406.07424):\n\nThis code has been written using PyTorch. If you use any code or datasets from this toolkit in your research, please cite the associated paper.\n\u003cpre\u003e\n@article{winata2024miners,\n  title={MINERS: Multilingual Language Models as Semantic Retrievers},\n  author={Winata, Genta Indra and Zhang, Ruochen and Adelani, David Ifeoluwa},\n  journal={arXiv preprint arXiv:2406.07424},\n  year={2024}\n}\n\u003c/pre\u003e\n\n## 📊 Benchmark\nMINERS comprises **11** datasets: **7** multilingual and **4** code-switching datasets, covering more than **200 languages** and encompassing both parallel and classification formats. Parallel datasets are suited for bitext retrieval as they contain aligned multilingual content, facilitating bitext mining and machine translation tasks. Additionally, the classification datasets cover intent classification, sentiment analysis, and topic classification, which we assess for retrieval-based and ICL classification assignments.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/dataset.png\" width=\"70%\"\u003e\n\u003c/p\u003e\n\nOur benchmark evaluates LMs on three tasks: bitext retrieval, retrieval-based classification, and ICL classification. The settings include **monolingual (Mono)**, **cross-lingual (XS)**, **code-switching (CS)**, and **cross-lingual code-switching (XS CS)**.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/res_bitext_classification.png\" width=42%\"\u003e\n  \u003cimg src=\"assets/res_icl_v2.png\" width=\"53%\"\u003e\n\u003c/p\u003e\n\n## ⚡ Environment Setup\n```\npip install -r requirements.txt\n```\nIf you wish to utilize the APIs or models from OpenAI, Cohere, or Hugging Face, modify the `OPENAI_TOKEN`, `COHERE_TOKEN`, and `HF_TOKEN`. Note that most models on Hugging Face do not require the `HF_TOKEN`, which is specifically intended for the llama and gemma models.\n\nIf you wish to use Llama3.1, you need to upgrade the transformers version\n```\npip install transformers==4.44.2\n```\n\n## 📝 Experiment Logs\nIf you wish to get all results and prompt examples from our experiments, feel free to download them [here](https://drive.google.com/file/d/1yG4VQDClLAhlyGZNxrnByZbOdU2kaAAR/view?usp=drive_link) (~360MB).\n\n## 🧪 Running Experiments\nAll experiment results will be stored in the `logs/` directory. You can execute each experiment using the following commands:\n\n### Bitext Retrieval\n#### Cross-lingual setting\n```\n❱❱❱ python bitext.py --src_lang {src_lang} --dataset {dataset} --seed {seed} --cuda --model_checkpoint {model_checkpoint}\n❱❱❱ python bitext.py --src_lang de --dataset bucc --seed 42 --cuda --model_checkpoint sentence-transformers/LaBSE\n```\n\n#### Ensemble\nThe arguments are similar as above, except we use `--model_checkpoints` and `--weights`\n```\n❱❱❱ python bitext.py --src_lang {src_lang} --dataset {dataset} --seed {seed} --cuda --model_checkpoint {model_checkpoint}\n❱❱❱ python bitext.py --src_lang de --dataset bucc --seed 42 --cuda --model_checkpoint sentence-transformers/LaBSE\n```\n\n### Retrieval-based Classification\n#### Monolingual setting\n```\n❱❱❱ python classification.py --dataset {dataset} --seed {seed} --cuda --model_checkpoint {model_checkpoint}\n❱❱❱ python classification.py --dataset nusax --seed 42 --cuda --model_checkpoint sentence-transformers/LaBSE\n```\n\n#### Cross-lingual setting\nAdd `--src_lang` and `--cross` to the command.\n```\n❱❱❱ python classification.py --src_lang {src_lang} --cross --dataset {dataset} --seed {seed} --cuda --model_checkpoint {model_checkpoint}\n❱❱❱ python classification.py --src_lang eng --cross --dataset nusax --seed 42 --cuda --model_checkpoint sentence-transformers/LaBSE\n```\n\n#### Ensemble\nThe arguments are similar as above, except we use `--model_checkpoints` and `--weights`\n```\n❱❱❱ python classification.py --dataset {dataset} --seed {seed} --cuda --model_checkpoints {model_checkpoint1} {model_checkpoint2} {...} --weights {weight1} {weight2} {...}\n❱❱❱ python classification.py --dataset nusax --seed 42 --cuda --model_checkpoints sentence-transformers/LaBSE intfloat/multilingual-e5-large --weights 0.25 0.75\n```\n\n### ICL Classification\n#### Monolingual setting\n```\n❱❱❱ python icl.py --dataset {dataset} --seed 42 --instruction {instruction} --model_checkpoint {model} --gen_model_checkpoint {gen_model_checkpoint}  --cuda --load_in_8bit --k {k}\n❱❱❱ python icl.py --dataset nusax --seed 42 --instruction \"Generate a sentiment label for a given input.\\nPlease only output the label.\" --model_checkpoint sentence-transformers/LaBSE --gen_model_checkpoint meta-llama/Meta-Llama-3-8B-Instruct  --cuda --load_in_8bit --k 1\n```\n\n#### Cross-lingual setting\nAdd `--src_lang` and `--cross` to the command.\n```\n❱❱❱ python icl.py --src_lang {src_lang} --cross --dataset {dataset} --seed 42 --instruction {instruction} --model_checkpoint {model} --gen_model_checkpoint {gen_model_checkpoint}  --cuda --load_in_8bit --k {k}\n❱❱❱ python icl.py --src_lang eng --cross --dataset nusax --seed 42 --instruction \"Generate a sentiment label for a given input.\\nPlease only output the label.\" --model_checkpoint sentence-transformers/LaBSE --gen_model_checkpoint meta-llama/Meta-Llama-3-8B-Instruct  --cuda --load_in_8bit --k 1\n```\n\n## 📈 Aggregating Experiment Results\nAdd `--k` to modify the number of retrieved samples.\n```\n❱❱❱ python script/aggregate/aggregate_bitext_mining.py --k {k}\n❱❱❱ python script/aggregate/aggregate_classification.py --k {k}\n❱❱❱ python script/aggregate/aggregate_classification_cross.py --k {k}\n❱❱❱ python script/aggregate/aggregate_icl.py --k {k}\n❱❱❱ python script/aggregate/aggregate_icl_cross.py --k {k}\n❱❱❱ python script/aggregate/aggregate_icl_percentile.py --k {k}\n```\n\n## 🏞️ Visualizing the Embeddings\n```\n❱❱❱ python visualize.py --model_checkpoint {model_checkpoint} --dataset {dataset} --seed {seed} --cuda\n❱❱❱ python visualize.py --model_checkpoint sentence-transformers/LaBSE --dataset nusax --seed 42 --cuda\n```\n\n### Examples of the visualization by class labels: LaBSE (left) and XLM-R BASE (right)\n\u003cimg src=\"assets/scatter_plots/tsne_nusax_LaBSE_class.png\" width=\"35%\"\u003e \u003cimg src=\"assets/scatter_plots/tsne_nusax_xlm-roberta-base_class.png\" width=\"35%\"\u003e\n\n### Examples of the visualization by sample ID: LaBSE (left) and XLM-R BASE (right)\n\u003cimg src=\"assets/scatter_plots/tsne_nusax_LaBSE.png\" width=\"35%\"\u003e \u003cimg src=\"assets/scatter_plots/tsne_nusax_xlm-roberta-base.png\" width=\"35%\"\u003e\n\n## 💻 Models Support\nOur codebase supports the usage of multiple models for the experiments, providing flexibility for customization beyond the list shown below:\n### Encoder LMs and APIs\n#### Open-source LMs:\n- [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)\n- [sentence-transformers/use-cmlm-multilingual](https://huggingface.co/sentence-transformers/use-cmlm-multilingual)\n- [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)\n- [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)\n- [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)\n- [microsoft/Multilingual-MiniLM-L12-H384](https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384)\n- [cis-lmu/glot500-base](https://huggingface.co/cis-lmu/glot500-base)\n- [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base)\n- [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)\n\n#### Commercial embedding APIs (last tested as of June 2024)\n- Cohere-Embedv3\n- OpenAI-Embedv3\n\n### Generative LMs:\n- BLOOMZ [bigscience/bloomz-560m](https://huggingface.co/bigscience/bloomz-560m) [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) [bigscience/bloomz-3b](https://huggingface.co/bigscience/bloomz-3b)\n- mT0 [bigscience/mt0-xl](https://huggingface.co/bigscience/mt0-xl)\n- XGLM [facebook/xglm-564M](https://huggingface.co/facebook/xglm-564M) [facebook/xglm-2.9B](https://huggingface.co/facebook/xglm-2.9B)\n- Aya-23 [CohereForAI/aya-23-8B](https://huggingface.co/CohereForAI/aya-23-8B)\n- Aya-101 [CohereForAI/aya-101](https://huggingface.co/CohereForAI/aya-101)\n- Gemma 1.1 Instruct [google/gemma-1.1-7b-it](https://huggingface.co/google/gemma-1.1-7b-it)\n- Llama 3 8B Instruct [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)\n- Llama 3 8B Instruct [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)\n- GPT models  (last tested as of June 2024)\n- Cohere Command R  (last tested as of June 2024)\n\n\n## 🚀 How to Contribute?\nFeel free to create [an issue](https://github.com/gentaiscool/miners/issues/) if you have any questions. And, create [a PR](https://github.com/gentaiscool/miners/pulls) for fixing bugs or adding improvements (i.e., adding new datasets or models). \n\nIf you are interested to create an extension of this work, feel free to reach out to [us](mailto:gentaindrawinata@gmail.com)!\n\nSupport our open source effort ⭐\n\n## On Progress\nWe are improving the code to make it more user-friendly and customizable. We have created a new repository for implementing DistFuse, which is available at [https://github.com/gentaiscool/distfuse/](https://github.com/gentaiscool/distfuse/). You can install it by running `pip install distfuse`. Later, it will be integrated to this repository.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgentaiscool%2Fminers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgentaiscool%2Fminers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgentaiscool%2Fminers/lists"}