https://github.com/amzn/informative-diverse-hard-negative-sampling

Experimental code for our paper on informative and diverse sampling of negative examples for dense retrieval
https://github.com/amzn/informative-diverse-hard-negative-sampling

active-learning dense-retrieval information-retrieval negative-sampling

Last synced: about 1 year ago
JSON representation

Experimental code for our paper on informative and diverse sampling of negative examples for dense retrieval

Host: GitHub
URL: https://github.com/amzn/informative-diverse-hard-negative-sampling
Owner: amzn
License: apache-2.0
Created: 2024-02-21T09:53:55.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-03-10T04:52:41.000Z (over 2 years ago)
Last Synced: 2025-04-30T14:26:27.886Z (about 1 year ago)
Topics: active-learning, dense-retrieval, information-retrieval, negative-sampling
Language: Python
Homepage: https://www.amazon.science/publications/indi-informative-and-diverse-sampling-for-dense-retrieval
Size: 6.85 MB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

# InDi: Informative and Diverse Sampling for Dense Retrieval
InDi is an extension of the popular [Tevatron package](https://github.com/texttron/tevatron/) ([commit](https://github.com/texttron/tevatron/commit/b8f33900895930f9886012580e85464a5c1f7e9a)), adding a novel procedure for selecting negative samples.
Specifically, it is inspired by ideas from the Active Learning field to find samples on which the model is uncertain about while ensuring high diversity.

## Instructions

### Prerequisite
To run InDi it is necessary to have the MS-MARCO corpus, train a baseline tevatron model (called S1), and compute vector embeddings to all documents in the corpus.
To do that, please follow instructions at https://github.com/texttron/tevatron/blob/main/examples/coCondenser-marco/README.md (up to and including https://github.com/texttron/tevatron/blob/main/examples/coCondenser-marco/README.md#search)
The following output files are generated in this process (and must appear in the `resources` directory):
* `corpus/*.json` - the tokenized MS-MARCO corpus.
* `encoding/*.pt` - the vector embeddings generated by the S1 model (of the MS-MARCO corpus).
* `qrels.train.tsv` - the QRels file of the training dataset.
* `scores/*.parquet` - dual encoder (and optional, cross encoder score) for the top-200 documents retrieved by model S1. The file contains the following columns: `qid`, `docid`, `de_score`, `ce_score` (optional).
* `train.query.txt` - the queries in the training dataset.

For evaluation the QRels file must be downloaded from https://microsoft.github.io/msmarco/.

### Running
In order to execute InDi run:
```
python -m active_learning.main_marco
```

## Citation
If you find InDi helpful, please consider citing our [paper](https://www.amazon.science/publications/indi-informative-and-diverse-sampling-for-dense-retrieval).
```
@article{cohen2024indi,
title={InDi: Informative and diverse sampling for dense retrieval},
author={Cohen, Nachshon and Indelman, Hedda Cohen and Fairstein, Yaron and Kushilevitz, Guy},
journal={ECIR},
year={2024}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/amzn/informative-diverse-hard-negative-sampling

Awesome Lists containing this project

README