https://github.com/ghost---shadow/insquad

information-retrieval large-language-models submodular-optimization

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/ghost---shadow/insquad
Owner: Ghost---Shadow
License: apache-2.0
Created: 2024-04-03T22:35:38.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-08-27T04:52:27.000Z (8 months ago)
Last Synced: 2025-08-27T11:54:59.517Z (7 months ago)
Topics: information-retrieval, large-language-models, submodular-optimization
Language: Python
Homepage:
Size: 592 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# InSQuaD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![CI](https://github.com/Ghost---Shadow/InSQuaD/actions/workflows/python_ci.yml/badge.svg)](https://github.com/Ghost---Shadow/InSQuaD/actions/workflows/python_ci.yml)

InSQuaD is a research framework for efficient in-context learning that leverages submodular mutual information to optimize the quality-diversity tradeoff in example selection for large language models. This implementation supports various retrieval methods, subset selection strategies, and generative models for comprehensive evaluation across multiple datasets.

## 🚀 Features

- **Submodular Optimization**: Implementation of facility location and graph cut losses for quality-diversity tradeoffs
- **Multiple Retrieval Methods**: Support for semantic search models (MPNet, sentence transformers) and dense indexes (FAISS)
- **Diverse Datasets**: Pre-configured loaders for MRPC, SST, MNLI, DBPedia, RTE, HellaSwag, XSum, MultiWOZ, and GeoQ
- **Flexible Architecture**: Modular design supporting various generative models (OpenAI, HuggingFace transformers)
- **Comprehensive Evaluation**: Built-in metrics and analysis tools for experimental evaluation
- **Experiment Management**: YAML-based configuration system with Weights & Biases integration

## 📋 Requirements

- Python 3.9+
- CUDA-compatible GPU (recommended)
- Required API keys (OpenAI, Weights & Biases)

## 🛠️ Installation

1. **Clone the repository**:
```bash
git clone https://github.com/Ghost---Shadow/InSQuaD.git
cd InSQuaD
```

2. **Create conda environment** (recommended):
```bash
conda create -n InSQuaD python=3.9 -y
conda activate InSQuaD
```

3. **Install dependencies**:
```bash
./devops/install.sh
```

4. **Set up environment variables**:
Create a `.env` file in the root directory with your API keys:
```bash
OPENAI_API_KEY=your_openai_key_here
WANDB_API_KEY=your_wandb_key_here
```

## 🚦 Quick Start

### Running Experiments

1. **Single experiment**:
```bash
python src/train.py experiments/tests/InSQuaD_test_experiment.yaml
```

2. **Full experiment suite**:
```bash
sh run_all_experiments.sh
```

3. **Offline evaluation**:
```bash
python src/offline_eval.py path/to/experiment/config.yaml
```

### Configuration

See `experiments/` directory for configuration examples.

## 🧪 Testing

Run the test suite to ensure everything is working correctly:

```bash
# Test everything (some tests may fail on Windows)
python -m unittest discover -s src -p "*_test.py"

# Test specific modules
python -m unittest discover -s src.dataloaders -p "*_test.py"
python -m unittest discover -s src.dense_indexes -p "*_test.py"
python -m unittest discover -s src.shortlist_strategies -p "*_test.py"
python -m unittest discover -s src.subset_selection_strategies -p "*_test.py"
```

## 🔧 Development

### Code Formatting

Format code using Black:
```bash
black .
```

### Project Structure

```
src/
├── dataloaders/ # Dataset loading and preprocessing
├── dense_indexes/ # FAISS and other dense retrieval indexes
├── generative_models/ # LLM wrappers (OpenAI, HuggingFace)
├── losses/ # Submodular loss functions
├── semantic_search_models/ # Embedding models
├── shortlist_strategies/ # Example selection strategies
├── subset_selection_strategies/ # Submodular optimization
└── training_strategies/ # Training loops and algorithms
```

## 📊 Supported Datasets

- **MRPC**: Microsoft Research Paraphrase Corpus
- **SST**: Stanford Sentiment Treebank (binary and 5-class)
- **MNLI**: Multi-Genre Natural Language Inference
- **DBPedia**: Database entity classification
- **RTE**: Recognizing Textual Entailment
- **HellaSwag**: Commonsense reasoning
- **XSum**: Extractive summarization
- **MultiWOZ**: Task-oriented dialogue
- **GeoQ**: Geographic question answering

## 🤖 Supported Models

### Generative Models
- OpenAI GPT models (GPT-3.5, GPT-4)
- HuggingFace transformers (Gemma, T5, etc.)
- Custom model implementations

### Semantic Search Models
- MPNet (all-mpnet-base-v2)
- Sentence Transformers
- Custom embedding models

## 📈 Results and Analysis

The framework includes comprehensive analysis tools:

- **Performance Tables**: Automated LaTeX table generation
- **Visualization**: Plotting utilities for results analysis
- **Statistical Analysis**: Confidence intervals and significance tests
- **Time Analysis**: Efficiency comparisons across methods

Results are automatically logged to Weights & Biases for easy tracking and comparison.

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

Please ensure your code follows the existing style and includes appropriate tests.

## 📄 License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## 📚 Citation

If you use this code in your research, please cite:

```bibtex
@inproceedings{insquad2025,
title={InSQuaD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity},
author={Nanda, Souradeep and Majee, Anay and Iyer, Rishab Krishnan},
booktitle={Proceedings of the 2025 IEEE International Conference on Data Mining (ICDM)},
year={2025},
organization={IEEE},
url={https://github.com/Ghost---Shadow/InSQuaD}
}
```

## 🆘 Support

For questions, issues, or feature requests, please open an issue on GitHub or contact the maintainers.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ghost---shadow/insquad

Awesome Lists containing this project

README