{"id":19406099,"url":"https://github.com/worldbank/gistembed","last_synced_at":"2025-04-24T09:31:03.343Z","repository":{"id":224635875,"uuid":"754251340","full_name":"worldbank/GISTEmbed","owner":"worldbank","description":"GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embeddings","archived":false,"fork":false,"pushed_at":"2024-03-06T04:04:24.000Z","size":1366,"stargazers_count":39,"open_issues_count":2,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-03T02:11:14.730Z","etag":null,"topics":["deep-learning","embedding-models","fine-tuning","huggingface","mteb","sentence-embeddings","sentence-transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/worldbank.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-07T17:34:23.000Z","updated_at":"2025-03-27T05:47:49.000Z","dependencies_parsed_at":"2024-03-06T05:23:59.046Z","dependency_job_id":"e3a36af4-7494-4d8c-91df-62f1a0fd5c67","html_url":"https://github.com/worldbank/GISTEmbed","commit_stats":null,"previous_names":["avsolatorio/gistembed","worldbank/gistembed"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/worldbank%2FGISTEmbed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/worldbank%2FGISTEmbed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/worldbank%2FGISTEmbed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/worldbank%2FGISTEmbed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/worldbank","download_url":"https://codeload.github.com/worldbank/GISTEmbed/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250600622,"owners_count":21456996,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","embedding-models","fine-tuning","huggingface","mteb","sentence-embeddings","sentence-transformers"],"created_at":"2024-11-10T11:41:02.483Z","updated_at":"2025-04-24T09:31:03.332Z","avatar_url":"https://github.com/worldbank.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GISTEmbed\n\nThe GISTEmbed framework (Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning) introduces an innovative approach to dynamically mine training negatives within a batch, serving as contrastive samples for fine-tuning embedding models. At the core of GISTEmbed is the utilization of a guide model, which assesses the relevance of samples in the batch against a query-positive pair. This model ensures that only examples deemed irrelevant are selected as training negatives.\n\nThis methodology is particularly advantageous for fine-tuning smaller models, leading to notable improvements across a wide range of NLP tasks. By focusing on the in-sample selection of negatives, GISTEmbed addresses common challenges in contrastive learning, such as the efficient and effective identification of informative negative samples.\n\nCompared to traditional methods, which often rely on random or heuristic-based selection, GISTEmbed's guided approach ensures a higher quality of training negatives, contributing to more robust and generalizable embeddings.\n\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://github.com/avsolatorio/GISTEmbed/raw/main/img/GISTEmbed%20Model.png\" style=\"width:75%\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cstrong\u003eGISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning\u003c/strong\u003e\n\u003cbr\u003e\n\u003ca href=\"https://arxiv.org/abs/2402.16829\" target=\"_blank\"\u003ePaper on ArXiv\u003c/a\u003e\n\u003c/p\u003e\n\u003cbr\u003e\n\n\nThe model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.\n\n# Trained models\n\nWe have fine-tuned various models using the GISTEmbed framework. The models are available on the Hugging Face model hub:\n\n- [avsolatorio/GIST-large-Embedding-v0](https://huggingface.co/avsolatorio/GIST-large-Embedding-v0): The model fine-tuned using the GISTEmbed framework and the MEDI+MTEBcls dataset. The base model used is the [`BAAI/bge-large-en-v1.5`](https://huggingface.co/BAAI/bge-large-en-v1.5).\n- [avsolatorio/GIST-Embedding-v0](https://huggingface.co/avsolatorio/GIST-Embedding-v0): The model fine-tuned using the GISTEmbed framework and the MEDI+MTEBcls dataset. The base model used is the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5).\n- [avsolatorio/GIST-small-Embedding-v0](https://huggingface.co/avsolatorio/GIST-small-Embedding-v0): The model fine-tuned using the GISTEmbed framework and the MEDI+MTEBcls dataset. The base model used is the [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5).\n- [avsolatorio/GIST-all-MiniLM-L6-v2](https://huggingface.co/avsolatorio/GIST-all-MiniLM-L6-v2): The model fine-tuned using the GISTEmbed framework and the MEDI+MTEBcls dataset. The base model used is the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).\n\n\n# Data\n\nThe dataset used is a compilation of the MEDI dataset and the MTEB Classification training dataset. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:\n\n- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)\n- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb\n\nThe dataset contains a `task_type` key which can be used to select only the mteb classification tasks (prefixed with `mteb_`).\n\nThe **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).\n\nThe MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.\n\nThe retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.\n\n# Usage\n\nThe model can be easily loaded using the Sentence Transformers library.\n\n```Python\nimport torch.nn.functional as F\nfrom sentence_transformers import SentenceTransformer\n\nrevision = None  # Replace with the specific revision to ensure reproducibility in  case the model is updated.\n\nmodel = SentenceTransformer(\"avsolatorio/GIST-Embedding-v0\", revision=revision)\n\ntexts = [\n    \"Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.\",\n    \"Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.\",\n    \"As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes\"\n]\n\n# Compute embeddings\nembeddings = model.encode(texts, convert_to_tensor=True)\n\n# Compute cosine-similarity for each pair of sentences\nscores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)\n\nprint(scores.cpu().numpy())\n```\n\n# Guided in-batch constrastive loss\n\nFor anyone interested in the technical implementation of GISTEmbed as a training mechanism, please refer to the loss computation implemented in the loss function [`guided_in_batch_contrastive_loss`](https://github.com/avsolatorio/GISTEmbed/blob/538e3d749b1944e8362c5566385111763866fa4c/gist_embed/trainer/loss.py#L599).\n\nThis loss function is subsequently used in the [`GISTTrainer`](https://github.com/avsolatorio/GISTEmbed/blob/538e3d749b1944e8362c5566385111763866fa4c/gist_embed/trainer/trainer.py#L127).\n\n\n# Reproducibility\n\nThis section outlines how to fine-tune models using the GISTEmbed framework. The following steps are necessary to reproduce the results:\n\n\nFirst, create a new conda environment and install poetry.\n\n```\nconda create -n GISTEmbed python=3.10\n\nconda activate GISTEmbed\n\npip install poetry\n```\n\nNext, clone the repository and install the dependencies.\n\n```\ngit clone https://github.com/avsolatorio/GISTEmbed.git\n\ncd GISTEmbed\n\npoetry install\n```\n\nTo reduce the likelihood of encountering issues and unexpected training runs, we set up a convention that would validate the intended parameters and configurations.\n\nOne can refer to the [gist_embed/validator.py](gist_embed/validator.py) file to see the validation logic. Additional configurations must be registered in the validator to ensure that the intended parameters are correctly set.\n\nAfter registering the intended configurations, an experiment script can be created to fine-tune the model. See example: [experiments/01-600-11-1-2-2-0-0-cls-normed-384-512_run_finetune_experiment.sh](experiments/01-600-11-1-2-2-0-0-cls-normed-384-512_run_finetune_experiment.sh).\n\nDetails of the arguments used in the script can be found in the [gist_embed/trainer/arguments](gist_embed/trainer/arguments) file.\n\nTo run the experiment, simply execute the following command:\n\n```\nbash experiments/01-600-11-1-2-2-0-0-cls-normed-384-512_run_finetune_experiment.sh\n```\n\nThe script will execute the experiment and save the model to the specified output directory. There are configurations in the script that handles the model checkpointing to Hugging Face model hub. Ensure to change the `--callback_hub_organization \u003corganization\u003e` to the appropriate organization.\n\nThe script also uses WANDB for logging. Ensure to set the `WANDB_API_KEY` environment variable to enable logging to WANDB.\n\n# Base model\n\nWe have implemented some tricks on top of the (excellent!) Sentence Transformers library to support the GISTEmbed framework. One notable trick is supporting gradient checkpointing for training the models. This is particularly useful for training large models with limited GPU memory.\n\nSee the [gist_embed/base.py](gist_embed/base.py) file for the implementation details.\n\n# Training Parameters\n\nBelow are the training parameters used to fine-tune the model:\n\n```\nEpochs = 80\nWarmup ratio = 0.1\nLearning rate = 5e-6\nBatch size = 32\nCheckpoint step = 103500\nContrastive loss temperature = 0.01\n```\n\nSpecific training details and strategies will be published shortly.\n\n# Evaluation\n\nThe model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.\n\n\n# Citation\nPlease cite our work if you use GISTEmbed or the datasets we published in your projects or research\n\n```\n@article{solatorio2024gistembed,\n    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},\n    author={Aivin V. Solatorio},\n    journal={arXiv preprint arXiv:2402.16829},\n    year={2024},\n    URL={https://arxiv.org/abs/2402.16829}\n    eprint={2402.16829},\n    archivePrefix={arXiv},\n    primaryClass={cs.LG}\n}\n```\n\n# Acknowledgements\n\nThis work is supported by the \"KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)\" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.\n\nThe findings, interpretations, and conclusions expressed in this material are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.\n\nWe also send 🤗 to the HuggingFace 🤗, Sentence Transformers, PyTorch, and to all open-sourced projects for all the open-sourced software they release.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fworldbank%2Fgistembed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fworldbank%2Fgistembed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fworldbank%2Fgistembed/lists"}