https://github.com/NoviScl/MoRE
https://github.com/NoviScl/MoRE
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/NoviScl/MoRE
- Owner: NoviScl
- Created: 2023-05-23T20:40:16.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-10-16T01:12:23.000Z (over 1 year ago)
- Last Synced: 2023-10-16T20:41:06.472Z (over 1 year ago)
- Language: Python
- Size: 882 KB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- Awesome-LLM-Ensemble - [Official
README
# Getting MoRE out of Mixture of Language Model Reasoning Experts (EMNLP 2023 Findings)
This repository contains the code and data for running the experiments in our paper. Please see below for more detailed instructions for running the code.
![]()
## Data
All the model prediction data can be downloaded from [this link](https://drive.google.com/file/d/1GYF-dq9N5XFd3w97AQO_fR5ArCQpwrpC/view?usp=sharing).
Once you download it, unzip it and put it under the `uniqa_predictions_final` folder.It contains two subsets: one for dev set and another for test set. All our evaluation results are based on the test sets. Each subset should contain the experts' (and the dataset-specific few-shot baseline's) predictions on all the 12 datasets used in our paper.
## Training the Router
You can run `python3 feature_classifier.py` to train the random forest router and run inference to score all predictions. For ablation, you can set `agreement = False` to exclude the inter-expert agreement features; or you can also set `qonly = True` to train a router that only uses the question features (see more detailed in the paper).
## Generalizability Evaluation
Once you run inference and save the router scores (which we already provided in `feature_classifiers`), you can run `python3 ensemble.py` to reproduce all results reported in Table 1, The default method is `classifier`, which uses the router classifier's scores for answer selection; you can also set to other methods for comparison.
## Selective QA Evaluation
For the selective QA evaluation, run `python3 abstention.py`. You can use either MaxProb or the router's score to score predictions by setting `method` correspondingly and set the `metric` among `AUC`, `Cov@80`, and `Cov@90` in the `all_metric` function. Use the `ER_metric` function to compute the effective reliability, which involves first searching for a threshold based on the dev set.
## Citation
```bibtex
@article{Si:Shi:Zhao:Zettlemoyer:Boyd-Graber-2023,
Title = {Getting \underline{MoRE} out of \underline{M}ixture \underline{o}f Language Model \underline{R}easoning \underline{E}xperts},
Author = {Chenglei Si and Weijia Shi and Chen Zhao and Luke Zettlemoyer and Jordan Lee Boyd-Graber},
Journal = {Findings of Empirical Methods in Natural Language Processing},
Year = {2023},
Location = {Singapore},
}
```If you have any questions about the code or paper, feel free to email Chenglei ([email protected]).