https://github.com/NoviScl/MoRE

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/NoviScl/MoRE
Owner: NoviScl
Created: 2023-05-23T20:40:16.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-10-16T01:12:23.000Z (over 1 year ago)
Last Synced: 2023-10-16T20:41:06.472Z (over 1 year ago)
Language: Python
Size: 882 KB
Stars: 6
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

Awesome-LLM-Ensemble - [Official

README

# Getting MoRE out of Mixture of Language Model Reasoning Experts (EMNLP 2023 Findings)

This repository contains the code and data for running the experiments in our paper. Please see below for more detailed instructions for running the code.

## Data

All the model prediction data can be downloaded from [this link](https://drive.google.com/file/d/1GYF-dq9N5XFd3w97AQO_fR5ArCQpwrpC/view?usp=sharing).
Once you download it, unzip it and put it under the `uniqa_predictions_final` folder.

It contains two subsets: one for dev set and another for test set. All our evaluation results are based on the test sets. Each subset should contain the experts' (and the dataset-specific few-shot baseline's) predictions on all the 12 datasets used in our paper.

## Training the Router

You can run `python3 feature_classifier.py` to train the random forest router and run inference to score all predictions. For ablation, you can set `agreement = False` to exclude the inter-expert agreement features; or you can also set `qonly = True` to train a router that only uses the question features (see more detailed in the paper).

## Generalizability Evaluation

Once you run inference and save the router scores (which we already provided in `feature_classifiers`), you can run `python3 ensemble.py` to reproduce all results reported in Table 1, The default method is `classifier`, which uses the router classifier's scores for answer selection; you can also set to other methods for comparison.

## Selective QA Evaluation

For the selective QA evaluation, run `python3 abstention.py`. You can use either MaxProb or the router's score to score predictions by setting `method` correspondingly and set the `metric` among `AUC`, `Cov@80`, and `Cov@90` in the `all_metric` function. Use the `ER_metric` function to compute the effective reliability, which involves first searching for a threshold based on the dev set.

## Citation

```bibtex
@article{Si:Shi:Zhao:Zettlemoyer:Boyd-Graber-2023,
Title = {Getting \underline{MoRE} out of \underline{M}ixture \underline{o}f Language Model \underline{R}easoning \underline{E}xperts},
Author = {Chenglei Si and Weijia Shi and Chen Zhao and Luke Zettlemoyer and Jordan Lee Boyd-Graber},
Journal = {Findings of Empirical Methods in Natural Language Processing},
Year = {2023},
Location = {Singapore},
}
```

If you have any questions about the code or paper, feel free to email Chenglei ([email protected]).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/NoviScl/MoRE

Awesome Lists containing this project

README