Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/orionw/LM-expansions

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
https://github.com/orionw/LM-expansions

information-retrieval ir llm machine-learning nlp prompting

Last synced: 3 days ago
JSON representation

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

Awesome Lists containing this project

README

        

# When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Official repository for the paper [When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets](https://aclanthology.org/2024.findings-eacl.134/) including code to reproduce and links to the data generated by the models.

## Table of Contents

- [Overview](#overview)
- [Data](#data)
- [Requirements](#requirements)
- [Setup](#setup)
- [Reproduce](#reproduce)
- [Reproduce Expansions Data](#reproduce-expansions-data)
- [Reproduce Model Results Using Expansions](#reproduce-model-results-using-expansions)
- [License](#license)
- [Citing](#citing)

## Overview

This project presents a comprehensive study on generative query and document expansions across various methods, retrievers, and datasets. It aims to identify when these expansions fail and provide insights into improving information retrieval systems.

## Data

The generations from the models can be found at [orionweller/llm-based-expansions-generations](https://huggingface.co/datasets/orionweller/llm-based-expansions-generations), organized by dataset and expansion type.

## Requirements

- Python 3.10
- conda
- OpenAI API key (for using OpenAI models)
- Together.ai or Anthropic API keys (if using their services)
- GPU (if using Llama for generation)
- pyserini (for BM25 results reproduction)

## Setup

1. Clone the repository:
```
git clone https://github.com/orionw/LM-expansions.git
cd LM-expansions
```

2. Install the correct Python environment:
```
conda env create --file=environment.yaml -y && conda activate expansions
```

3. Download the local data:
```
git clone https://huggingface.co/datasets/orionweller/llm-based-expansions-eval-datasets
```

This dataset contains local data not available on Huggingface, such as `scifact-refute` and other datasets formatted in a common format. To reproduce the creation of `scifact-refute`, check out `scripts/make_scifact_refute.py`.

## Reproduce

### Reproduce Expansions Data

1. Set up your environment variables (e.g., `OPENAI_API_KEY`) if using OpenAI models.

2. Create or modify a prompt config. Examples are in `prompt_configs/*`. For instance:
```
bash generate_expansions.sh scifact_refute prompt_configs/chatgpt_doc2query.jsonl
```

3. Adjust parameters as needed:
- `num_examples`: maximum number of instances to predict
- `temperature`: controls the randomness of predictions

Note: If using Together.ai or Anthropic API keys, define them accordingly. For Llama generation, ensure you're using a GPU.

### Reproduce Model Results Using Expansions

1. Run the model using the following command structure:
```
bash rerank.sh <"none" if not using document expansions otherwise "replace" or "append" the query with the expansion> <"none" if not using query expansions otherwise "replace" or "append" the query with the expansion>
```

Example:
```
bash rerank.sh "scifact_refute" "testing" 0 1 "none" "none" "llm-based-expansions-generations/scifact_refute/expansion_hyde_chatgpt64.jsonl" "replace" "contriever_msmarco" 10 100
```

2. Results will be written to `results///--run.txt`.

3. Evaluate the results:
```
bash evaluate.sh scifact_refute testing
```

To reproduce the top 1000 BM25 results:

1. Install `pyserini` following their [installation docs](https://github.com/castorini/pyserini/blob/master/docs/installation.md).

2. Run the BM25 retrieval:
```
bash make_bm25_run.sh
```

Example:
```
bash make_bm25_run.sh bm25 scifact_refute doc_id "title,text" query_id text
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citing

If you found the code, data or paper useful, please cite:

```bibtex
@inproceedings{weller-etal-2024-generative,
title = "When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets",
author = "Weller, Orion and
Lo, Kyle and
Wadden, David and
Lawrie, Dawn and
Van Durme, Benjamin and
Cohan, Arman and
Soldaini, Luca",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2024",
month = mar,
year = "2024",
address = "St. Julian{'}s, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-eacl.134",
pages = "1987--2003",
}
```

This project also built off of many others (see the paper for a full list of references), including code from [TART](https://github.com/facebookresearch/tart/tree/main) and [InPars](https://github.com/zetaalphavector/InPars), please check them and the others out!