Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/orionw/LM-expansions
When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
https://github.com/orionw/LM-expansions
information-retrieval ir llm machine-learning nlp prompting
Last synced: 3 days ago
JSON representation
When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
- Host: GitHub
- URL: https://github.com/orionw/LM-expansions
- Owner: orionw
- License: mit
- Created: 2024-01-31T22:29:16.000Z (11 months ago)
- Default Branch: master
- Last Pushed: 2024-06-29T22:12:36.000Z (6 months ago)
- Last Synced: 2024-12-30T01:35:35.640Z (9 days ago)
- Topics: information-retrieval, ir, llm, machine-learning, nlp, prompting
- Language: Python
- Homepage: https://arxiv.org/abs/2309.08541
- Size: 67.4 KB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_ai_agents - Lm-Expansions - When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets (Building / Datasets)
- awesome_ai_agents - Lm-Expansions - When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets (Building / Datasets)
README
# When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
Official repository for the paper [When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets](https://aclanthology.org/2024.findings-eacl.134/) including code to reproduce and links to the data generated by the models.
## Table of Contents
- [Overview](#overview)
- [Data](#data)
- [Requirements](#requirements)
- [Setup](#setup)
- [Reproduce](#reproduce)
- [Reproduce Expansions Data](#reproduce-expansions-data)
- [Reproduce Model Results Using Expansions](#reproduce-model-results-using-expansions)
- [License](#license)
- [Citing](#citing)## Overview
This project presents a comprehensive study on generative query and document expansions across various methods, retrievers, and datasets. It aims to identify when these expansions fail and provide insights into improving information retrieval systems.
## Data
The generations from the models can be found at [orionweller/llm-based-expansions-generations](https://huggingface.co/datasets/orionweller/llm-based-expansions-generations), organized by dataset and expansion type.
## Requirements
- Python 3.10
- conda
- OpenAI API key (for using OpenAI models)
- Together.ai or Anthropic API keys (if using their services)
- GPU (if using Llama for generation)
- pyserini (for BM25 results reproduction)## Setup
1. Clone the repository:
```
git clone https://github.com/orionw/LM-expansions.git
cd LM-expansions
```2. Install the correct Python environment:
```
conda env create --file=environment.yaml -y && conda activate expansions
```3. Download the local data:
```
git clone https://huggingface.co/datasets/orionweller/llm-based-expansions-eval-datasets
```This dataset contains local data not available on Huggingface, such as `scifact-refute` and other datasets formatted in a common format. To reproduce the creation of `scifact-refute`, check out `scripts/make_scifact_refute.py`.
## Reproduce
### Reproduce Expansions Data
1. Set up your environment variables (e.g., `OPENAI_API_KEY`) if using OpenAI models.
2. Create or modify a prompt config. Examples are in `prompt_configs/*`. For instance:
```
bash generate_expansions.sh scifact_refute prompt_configs/chatgpt_doc2query.jsonl
```3. Adjust parameters as needed:
- `num_examples`: maximum number of instances to predict
- `temperature`: controls the randomness of predictionsNote: If using Together.ai or Anthropic API keys, define them accordingly. For Llama generation, ensure you're using a GPU.
### Reproduce Model Results Using Expansions
1. Run the model using the following command structure:
```
bash rerank.sh <"none" if not using document expansions otherwise "replace" or "append" the query with the expansion> <"none" if not using query expansions otherwise "replace" or "append" the query with the expansion>
```Example:
```
bash rerank.sh "scifact_refute" "testing" 0 1 "none" "none" "llm-based-expansions-generations/scifact_refute/expansion_hyde_chatgpt64.jsonl" "replace" "contriever_msmarco" 10 100
```2. Results will be written to `results///--run.txt`.
3. Evaluate the results:
```
bash evaluate.sh scifact_refute testing
```To reproduce the top 1000 BM25 results:
1. Install `pyserini` following their [installation docs](https://github.com/castorini/pyserini/blob/master/docs/installation.md).
2. Run the BM25 retrieval:
```
bash make_bm25_run.sh
```Example:
```
bash make_bm25_run.sh bm25 scifact_refute doc_id "title,text" query_id text
```## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Citing
If you found the code, data or paper useful, please cite:
```bibtex
@inproceedings{weller-etal-2024-generative,
title = "When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets",
author = "Weller, Orion and
Lo, Kyle and
Wadden, David and
Lawrie, Dawn and
Van Durme, Benjamin and
Cohan, Arman and
Soldaini, Luca",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2024",
month = mar,
year = "2024",
address = "St. Julian{'}s, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-eacl.134",
pages = "1987--2003",
}
```This project also built off of many others (see the paper for a full list of references), including code from [TART](https://github.com/facebookresearch/tart/tree/main) and [InPars](https://github.com/zetaalphavector/InPars), please check them and the others out!