{"id":30616476,"url":"https://github.com/ghost---shadow/insquad","last_synced_at":"2025-08-30T09:40:51.185Z","repository":{"id":311861986,"uuid":"781713629","full_name":"Ghost---Shadow/InSQuaD","owner":"Ghost---Shadow","description":"InSQuaD is a research framework for efficient in-context learning that leverages submodular mutual information to optimize the quality-diversity tradeoff in example selection for large language models","archived":false,"fork":false,"pushed_at":"2025-08-27T04:52:27.000Z","size":606,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-27T11:54:59.517Z","etag":null,"topics":["information-retrieval","large-language-models","submodular-optimization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ghost---Shadow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-04-03T22:35:38.000Z","updated_at":"2025-08-27T04:52:31.000Z","dependencies_parsed_at":"2025-08-27T11:55:07.025Z","dependency_job_id":"9b0fd820-7a96-4404-bfc9-8b17f1b3dff7","html_url":"https://github.com/Ghost---Shadow/InSQuaD","commit_stats":null,"previous_names":["ghost---shadow/insquad"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Ghost---Shadow/InSQuaD","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ghost---Shadow%2FInSQuaD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ghost---Shadow%2FInSQuaD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ghost---Shadow%2FInSQuaD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ghost---Shadow%2FInSQuaD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ghost---Shadow","download_url":"https://codeload.github.com/Ghost---Shadow/InSQuaD/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ghost---Shadow%2FInSQuaD/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272833294,"owners_count":25000870,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-retrieval","large-language-models","submodular-optimization"],"created_at":"2025-08-30T09:40:46.335Z","updated_at":"2025-08-30T09:40:51.149Z","avatar_url":"https://github.com/Ghost---Shadow.png","language":"Python","readme":"# InSQuaD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![CI](https://github.com/Ghost---Shadow/InSQuaD/actions/workflows/python_ci.yml/badge.svg)](https://github.com/Ghost---Shadow/InSQuaD/actions/workflows/python_ci.yml)\n\nInSQuaD is a research framework for efficient in-context learning that leverages submodular mutual information to optimize the quality-diversity tradeoff in example selection for large language models. This implementation supports various retrieval methods, subset selection strategies, and generative models for comprehensive evaluation across multiple datasets.\n\n## 🚀 Features\n\n- **Submodular Optimization**: Implementation of facility location and graph cut losses for quality-diversity tradeoffs\n- **Multiple Retrieval Methods**: Support for semantic search models (MPNet, sentence transformers) and dense indexes (FAISS)\n- **Diverse Datasets**: Pre-configured loaders for MRPC, SST, MNLI, DBPedia, RTE, HellaSwag, XSum, MultiWOZ, and GeoQ\n- **Flexible Architecture**: Modular design supporting various generative models (OpenAI, HuggingFace transformers)\n- **Comprehensive Evaluation**: Built-in metrics and analysis tools for experimental evaluation\n- **Experiment Management**: YAML-based configuration system with Weights \u0026 Biases integration\n\n## 📋 Requirements\n\n- Python 3.9+\n- CUDA-compatible GPU (recommended)\n- Required API keys (OpenAI, Weights \u0026 Biases)\n\n## 🛠️ Installation\n\n1. **Clone the repository**:\n   ```bash\n   git clone https://github.com/Ghost---Shadow/InSQuaD.git\n   cd InSQuaD\n   ```\n\n2. **Create conda environment** (recommended):\n   ```bash\n   conda create -n InSQuaD python=3.9 -y\n   conda activate InSQuaD\n   ```\n\n3. **Install dependencies**:\n   ```bash\n   ./devops/install.sh\n   ```\n\n4. **Set up environment variables**:\n   Create a `.env` file in the root directory with your API keys:\n   ```bash\n   OPENAI_API_KEY=your_openai_key_here\n   WANDB_API_KEY=your_wandb_key_here\n   ```\n\n## 🚦 Quick Start\n\n### Running Experiments\n\n1. **Single experiment**:\n   ```bash\n   python src/train.py experiments/tests/InSQuaD_test_experiment.yaml\n   ```\n\n2. **Full experiment suite**:\n   ```bash\n   sh run_all_experiments.sh\n   ```\n\n3. **Offline evaluation**:\n   ```bash\n   python src/offline_eval.py path/to/experiment/config.yaml\n   ```\n\n### Configuration\n\nSee `experiments/` directory for configuration examples.\n\n## 🧪 Testing\n\nRun the test suite to ensure everything is working correctly:\n\n```bash\n# Test everything (some tests may fail on Windows)\npython -m unittest discover -s src -p \"*_test.py\"\n\n# Test specific modules\npython -m unittest discover -s src.dataloaders -p \"*_test.py\"\npython -m unittest discover -s src.dense_indexes -p \"*_test.py\"\npython -m unittest discover -s src.shortlist_strategies -p \"*_test.py\"\npython -m unittest discover -s src.subset_selection_strategies -p \"*_test.py\"\n```\n\n## 🔧 Development\n\n### Code Formatting\n\nFormat code using Black:\n```bash\nblack .\n```\n\n### Project Structure\n\n```\nsrc/\n├── dataloaders/          # Dataset loading and preprocessing\n├── dense_indexes/        # FAISS and other dense retrieval indexes  \n├── generative_models/    # LLM wrappers (OpenAI, HuggingFace)\n├── losses/              # Submodular loss functions\n├── semantic_search_models/ # Embedding models\n├── shortlist_strategies/ # Example selection strategies\n├── subset_selection_strategies/ # Submodular optimization\n└── training_strategies/  # Training loops and algorithms\n```\n\n## 📊 Supported Datasets\n\n- **MRPC**: Microsoft Research Paraphrase Corpus\n- **SST**: Stanford Sentiment Treebank (binary and 5-class)\n- **MNLI**: Multi-Genre Natural Language Inference\n- **DBPedia**: Database entity classification\n- **RTE**: Recognizing Textual Entailment\n- **HellaSwag**: Commonsense reasoning\n- **XSum**: Extractive summarization\n- **MultiWOZ**: Task-oriented dialogue\n- **GeoQ**: Geographic question answering\n\n## 🤖 Supported Models\n\n### Generative Models\n- OpenAI GPT models (GPT-3.5, GPT-4)\n- HuggingFace transformers (Gemma, T5, etc.)\n- Custom model implementations\n\n### Semantic Search Models\n- MPNet (all-mpnet-base-v2)\n- Sentence Transformers\n- Custom embedding models\n\n## 📈 Results and Analysis\n\nThe framework includes comprehensive analysis tools:\n\n- **Performance Tables**: Automated LaTeX table generation\n- **Visualization**: Plotting utilities for results analysis  \n- **Statistical Analysis**: Confidence intervals and significance tests\n- **Time Analysis**: Efficiency comparisons across methods\n\nResults are automatically logged to Weights \u0026 Biases for easy tracking and comparison.\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\nPlease ensure your code follows the existing style and includes appropriate tests.\n\n## 📄 License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## 📚 Citation\n\nIf you use this code in your research, please cite:\n\n```bibtex\n@inproceedings{insquad2025,\n  title={InSQuaD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity},\n  author={Nanda, Souradeep and Majee, Anay and Iyer, Rishab Krishnan},\n  booktitle={Proceedings of the 2025 IEEE International Conference on Data Mining (ICDM)},\n  year={2025},\n  organization={IEEE},\n  url={https://github.com/Ghost---Shadow/InSQuaD}\n}\n```\n\n## 🆘 Support\n\nFor questions, issues, or feature requests, please open an issue on GitHub or contact the maintainers.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fghost---shadow%2Finsquad","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fghost---shadow%2Finsquad","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fghost---shadow%2Finsquad/lists"}