https://github.com/tanyuqian/cappy

NeurIPS 2023 - Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
https://github.com/tanyuqian/cappy
Last synced: 3 months ago
JSON representation
NeurIPS 2023 - Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
Host: GitHub
URL: https://github.com/tanyuqian/cappy
Owner: tanyuqian
License: apache-2.0
Created: 2023-11-11T00:37:28.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2024-03-29T05:55:03.000Z (over 1 year ago)
Last Synced: 2025-03-28T01:53:36.129Z (7 months ago)
Language: Python
Homepage:
Size: 132 KB
Stars: 41
Watchers: 2
Forks: 4
Open Issues: 3
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer

This repo contains code of the following paper:

**Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer** \

Bowen Tan, Yun Zhu, Lijuan Liu, Eric Xing, Zhiting Hu, Jindong Chen \

NeurIPS 2023 \

[[arXiv]](https://arxiv.org/pdf/2311.06720.pdf)  [[Model Card (btan2/cappy-large)]](https://huggingface.co/btan2/cappy-large)

## Getting Started

* Cappy is a pretrained small scorer designed to enhance the performance and efficiency of multi-task LLMs. 

* Cappy takes in an instruction and a candidate response as input, and produces a score between 0 and 1, indicating an estimated correctness of the response with respect to the instruction. 

* With merely 360 million parameters, Cappy functions either independently on classification tasks or serve as an auxiliary component for LLMs, boosting their performance. 

* Also, Cappy enables efficiently integrating downstream supervision without requiring LLM finetuning nor the access to their parameters.

* Furthermore, Cappy is flexible to cooperate with other LLM adaptations, including finetuning and in-context learning, and prompt tuning, offering additional performance enhancement.

 

Now, Cappy can be loaded with `transformers` either as a Jax/Flax model or a PyTorch model.

### Jax/Flax

```python

from transformers import AutoTokenizer, FlaxAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('btan2/cappy-large')

cappy = FlaxAutoModelForSequenceClassification.from_pretrained('btan2/cappy-large')

instruction = """

What label best describes this news article?

Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.

"""

response = 'Business'

inputs = tokenizer([(instruction, response), ], return_tensors='pt')

score = cappy(**inputs).logits[0][0].item()

```

### PyTorch

```python

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('btan2/cappy-large')

cappy = AutoModelForSequenceClassification.from_pretrained('btan2/cappy-large')

instruction = """

What label best describes this news article?

Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.

"""

response = 'Business'

inputs = tokenizer([(instruction, response), ], return_tensors='pt')

score = cappy(**inputs).logits[0][0].item()

```

Below are the scripts to recover the experiments in the paper.

## Requirements

Cappy's pretraining and finetuning are both based on [Redco](https://github.com/tanyuqian/redco), 

a lightweight tool automating distributed training on both GPUs and TPUs. 

To install redco

```shell

pip install redco==0.4.13

```

Sometimes the Jax version needs be adjusted based on your device & environment. 

Here are some [instructions](https://github.com/tanyuqian/redco#adjust-jax--flax-versions).

To install other requirements,

```shell

pip install -r requirements.txt

```

## Pretraining Cappy

Cappy's pretraining uses the code from [this example](https://github.com/tanyuqian/redco/tree/master/examples/classification_regression) in Redco. We will release Cappy's pretraining data soon.

## Evaluting Cappy on PromptSource (zero-shot)

### Download Test Data

Following the setting from [OPT-IML paper](https://arxiv.org/pdf/2212.12017.pdf) (Section 5.2). We conduct zero-shot evaluation on 11 held-out classification tasks from PromptSource.  

```shell

bash scripts/download_promptsource_test_data.sh

```

### Running Cappy

```shell

python cappy_promptsource.py --model_name_or_path btan2/cappy-large

```

### Results

|             | OPT 30B | OPT-IML 30B | OPT 175B | OPT-IML 175B | T0 11B | Cappy (ours, 0.36B) |

|------------:|:-------:|:-----------:|:--------:|:------------:|:------:|:-------------------:|

|     ANLI R1 |   33.7  |     37.1    |   34.1   |     42.2     |  42.1  |        34.3         |

|     ANLI R2 |   34.1  |     35.4    |   34.1   |     38.5     |  37.9  |        33.9         |

|     ANLI R3 |   34.7  |     36.6    |   34.7   |     39.6     |  39.7  |        34.7         |

|          CB |   24.6  |     43.2    |   38.9   |     56.4     |  58.5  |        59.4         |

|         RTE |   56.4  |     67.8    |   54.0   |     73.4     |  80.2  |        71.9         |

|  StoryCloze |   55.5  |     90.7    |   57.0   |     95.0     |  96.7  |        93.7         |

|         WSC |   43.5  |     58.2    |   51.0   |     59.2     |  58.6  |        63.8         |

|         WiC |   50.8  |     54.7    |   49.7   |     53.6     |  56.0  |        51.9         |

|  Winogrande |   50.2  |     53.4    |   50.1   |     56.6     |  62.5  |        51.7         |

|  WinoGender |   54.9  |     64.6    |   53.9   |     72.7     |  83.8  |        68.9         |

| Crows-Pairs |   85.5  |     22.3    |   85.5   |     34.4     |  24.0  |        57.8         |

| **Average** |   47.6  |     51.3    |   49.3   |     56.5     |  58.2  |        56.6         |

Baseline results come from [OPT-IML paper](https://arxiv.org/pdf/2212.12017.pdf) (Section 5.2).

## Boosting FLAN-T5 with Cappy on Big-Bench Tasks

### Getting Big-Bench Tasks

We take all 45 generative tasks from Big-Bench in our experiment. The command below process the tasks into `.jsonl` format.

```shell

python scripts/get_bigbench_data.py

```

The processed datasets can be found in `./bigbench_data`, where `./bigbench_data/subset_names.json` records all the task names.

### Getting FLAN-T5 Outputs

We collect generated outputs (as well as log-likelihoods on evaluation sets) from FLAN-T5 models (from `-small` to `-xxl`). They can be downloaded with

```shell

bash scripts/download_bigbench_flan_gens.sh

```

If you want to generate outputs by your self and/or adjust some generation settings, we provide generation code as below that supports distributed inference using multiple GPUs together (in case the model is too large to accomodate on a single GPU, e.g., `FLAN-T5-XXL (11B)`).

```shell

python scripts/bigbench_flan_generate.py \

  --model_name_or_path google/flan-t5-xl \

  --n_model_shards 4

```

where `--n_model_shards` refers to the number of shards you want to split the large model into (it's usually the number of GPUs on your device if it's not 1).

### Adapting Cappy to boost FLAN-T5

```shell

XLA_PYTHON_CLIENT_MEM_FRACTION=.95 python cappy_bigbench.py \

  --model_name_or_path btan2/cappy-large \

  --bigbench_subset_name auto_categorization \

  --bigbench_gen_model flan-t5-xxl \

  --train_size 102400

```

* `XLA_PYTHON_CLIENT_MEM_FRACTION=.95`: (In case GPU memory exceeds) adjust the GPU memory pre-allocation to Jax, see [here](https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html) for more details. 

* `--bigbench_subset_name`: the name of subset from Big-Bench (see `./bigbench_data/subset_names.json` for all of them).

* `--bigbench_gen_model`: the FLAN model to be boosted. 

* `--train_size`: the target data size to construct for Cappy's finetuning on the task (collect FLAN outputs, and then truncate or repeat). 

See `def main(...)` in [cappy_bigbench.py](cappy_bigbench.py) for all the arguments.

Every sub-task takes 40 mins to run on a single A10G GPU. The result will be logged in `./bigbench_cappy_results/{flan_model}/{subset_name}.json`.

Besides, to run all the Big-Bench subsets at once, 

```shell

python scripts/run_cappy_bigbench.py --cuda_idx 0

```

### Results

To present baseline results, `python scripts/present_bigbench_baselines.py`

To present Cappy results on all 45 Big-Bench subtasks,

`python scripts/present_cappy_bigbench_results.py --gen_model_name flan-t5-xxl`

The reported numbers on the paper are produced on TPU machines. Here we provide our

reproduction results on A10G GPUs in `./bigbench_cappy_results`. The gap between

them is slight (`ΔrougeL <= 0.8`).

|                      | flan-t5-small | flan-t5-base | flan-t5-large | flan-t5-xl  | flan-t5-xxl |

|----------------------|---------------|--------------|---------------|-------------|-------------|

| Beam Search (beam=4) | 16.4025       | 19.8594      | 23.4802       | 26.1177     | 29.6608     |

| Sampling             | 11.4317       | 15.7909      | 19.6248       | 23.2191     | 25.7273     |

| Temperature (t=0.9)  | 12.0126       | 17.0571      | 20.0481       | 24.2702     | 27.0985     |

| Topk (k=40)          | 11.5157       | 15.7481      | 19.7634       | 22.6692     | 25.8226     |

| Nucleus (p=0.95)     | 11.9171       | 16.6174      | 20.1986       | 24.1654     | 26.9036     |

| Self-Score (sum)     | 15.0806       | 20.711       | 24.1224       | 28.4665     | 32.0156     |

| Self-Score (mean)    | 16.4223       | 20.1317      | 23.7828       | 26.7694     | 30.246      |

| **Cappy (ours)**     | **23.6543**   | **27.6178**  | **30.3802**   | **33.2775** | **37.1678** |

## Acknowledgement

*Cappy* is Mario's ally throughout Super Mario Odyssey and assists him in various ways. We thank Nintendo for the nice game!

![](imgs/cappy.jpg)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tanyuqian/cappy

Awesome Lists containing this project

README