https://github.com/amzn/bayespe

Zero-shot and in-context learning classification with LLMs and uncertainty estimation using multiple prompts.
https://github.com/amzn/bayespe
bayesian bayespe llms prompting prompts uncertainty-quantification
Last synced: 4 months ago
JSON representation
Zero-shot and in-context learning classification with LLMs and uncertainty estimation using multiple prompts.
Host: GitHub
URL: https://github.com/amzn/bayespe
Owner: amzn
License: apache-2.0
Created: 2024-07-31T09:21:49.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-07-31T09:24:05.000Z (almost 2 years ago)
Last Synced: 2025-10-25T08:11:13.275Z (8 months ago)
Topics: bayesian, bayespe, llms, prompting, prompts, uncertainty-quantification
Language: Python
Homepage: https://www.amazon.science/publications/bayesian-prompt-ensembles-model-uncertainty-estimation-for-black-box-large-language-models
Size: 258 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

README

          
## Description

This package implements the method described and evaluated in the paper 

"Bayesian Prompt Ensembles: Model Uncertainty Estimation for Black-Box Large Language Models".

Bayesian Prompt Ensembles (BayesPE) is a method to combine multiple semantically equivalent 

prompts to obtain well-calibrated output probabilities with Large Language Models. 

The package includes tools to i) perform classification through prompting with LLMs and 

ii) use the BayesPE approach to ensemble multiple prompts, improving calibration performance.

Below you will find a comprehensive tutorial divided into four self-contained parts:

1. **Zero-Shot Classification with an LLM:** Use an LLM to perform classification through prompting.

2. **Few-Shot Classification with an LLM:** Use an LLM to perform classification through in-context learning, providing a few labelled examples in the prompt.

3. **BayesPE for Zero-Shot Classification:** Use BayesPE to ensemble different semantically equivalent prompts to perform classification with an LLM.

4. **BayesPE for Few-Shot Classification:** Use BayesPE to ensemble different semantically equivalent prompts and in-context examples to perform classification with an LLM.

If you use this package, please cite our paper: https://www.amazon.science/publications/bayesian-prompt-ensembles-model-uncertainty-estimation-for-black-box-large-language-models

```console

@article{tonolini2024bayesian,

  title={Bayesian prompt ensembles: Model uncertainty estimation for black-box large language models},

  author={Tonolini, Francesco and Massiah, Jordan and Aletras, Nikolaos and Kazai, Gabriella},

  journal={Association for Computational Linguistics}

  year={2024}

}

```

If you have questions or need help, don't hesitate to get in touch: tonolini@amazon.com

## Installation

Copy this package to where you need it, then do the following:

1) move to the package's directory

```console

cd BayesPE

```

2) create a Python environment

```console

conda create --name bayespe python=3.10

```

3) activate the environment

```console

source activate bayespe

```

4) install requirements

```console

pip install -r requirements.txt

```

5) install Huggingface CLI

```console

pip install -U "huggingface_hub[cli]"

```

6) login to Huggingface (for access to LLMs) and enter your token.

```console

huggingface-cli login

```

And you are good to go!

## Example 1: Zero-Shot Classification with an LLM

Here is a simple example of classifying text with an LLM

using the package.

#### Imports

General imports:

```python

import sys

import os

import pandas as pd

```

Add the src directory to the path:

```python

path_to_package = os.path.split(os.path.split(__file__)[0])[0]

sys.path.append(os.path.join(path_to_package, 'src'))

```

Import relevant classes and scripts from src:

```python

from llm_model import LLM  # class for LLM wrapper

from llm_classifier import LLMClassifier  # class for classifier using LLMs

import evaluation  # evaluation functions

```

#### Load Data

We will be using sentiment classification of Amazon reviews

for appliances, where reviews are to be classified as either

positive or negative:

```python

df = pd.read_csv('data/amazon_reviews/test.csv', sep='\t')  # pandas DataFrame containing text strings and integer labels

```

Let's take 200 examples to classify, including text inputs and

numeric ground truth labels to compare with after inference:

```python

n_test = 200

df_test = df[:n_test]  # test split

samples_test = df_test['text'].values  # text inputs

gt_labels_test = df_test['ground_truth_label'].values.astype(int)  # classes ground-truths as integers

```

#### LLM and Prompt Formatting

Now we can call the LLM wrapper class to load the LLM of choice from

Huggingface. In this example we will use "mistralai/Mistral-7B-Instruct-v0.3";

a 7b instruction fine-tuned model from Mistral AI:

```python

llm = LLM(model_name="mistralai/Mistral-7B-Instruct-v0.3", use_reduced_precision=True)

```

We have used the "use_reduced_precision=True" argument, which will load

the model at bfloat16 precision, reducing memory requirements and making

the model much faster to run. For better performance, but higher compute

and memory, you can set this parameter to "False" or leave it as default.

Now we need to make some formatting functions and wrapping text to construct our

prompts and look for the right words at the output. These are

specific to the task and can be defined in a class or a separate

script. This class/script must hve the following objects:

```python

class PromptFormatting(object):

    def __init__(self):

        

        # 1) an instruction sentence

        INSTRUCTION = 'classify the sentiment of the Amazon review below into one of the following classes:'

        

        # 2) The words identifying the classes. In this case

        # 0 = negative and 1 = positive.

        self.CLASSES = [

            'negative',

            'positive'

        ]

        

        # 3) The list of options that will be given to the LLM

        # in the prompt (classes words in a numbered list)

        self.CLASSES_TEXT = '''1. {}

2. {}'''.format(self.CLASSES[0], self.CLASSES[1])

    def format_instruction(self, instruction):

        # 4) function which, given the instruction sentence, 

        # will put it together with the options list

        prompt = '''{}

{}

'''.format(instruction, self.CLASSES_TEXT)

        return prompt

    

    def format_content(self, content):

        # 5) formatting the text to be classified with a header and

        # the prompt to answer with one of the options. In this

        # case, the inputs are reviews.

        prompt = '''review: {}

the review is '''.format(content)

        return prompt

prompt_formatting = PromptFormatting()

 ```

You can play around with the objects in the 

class above to construct your prompts differently.

You can use this general format for any task.

Now we initialise the LLM classifier, which we can use to infer class

probabilities leveraging the LLM for prompting:

```python

classifier = LLMClassifier(model=llm, prompt_formatting=prompt_formatting)

```

The LLMClassifier class has a function to print out what the prompts will

look like and make sure it all looks ok:

```python

classifier.print_prompt_example()

```

This will return:

```console

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: 

the review is 

```

#### Classify Examples

Now that we have our prompts and our LLM ready, we can run classification

on our set of 200 examples. The function "soft_labels_batch" will run classification

using the LLM for all inputs in the list "input_texts" and return class probabilities:

```python

output_probs = classifier.soft_labels_batch(input_texts=samples_test)

```

"output_probs" is a 2D n_samples x n_classes array containing the predicted class probabilities.

We can have a look at a few examples:

```python

print(output_probs[:10, :])

```

This returns something similar to the following:

```console

[[9.97527377e-01 2.47262316e-03]

 [4.13993755e-08 9.99999959e-01]

 [3.05902227e-07 9.99999694e-01]

 [1.12535162e-07 9.99999887e-01]

 [9.93307149e-01 6.69285092e-03]

 [9.82013790e-01 1.79862100e-02]

 [9.97527377e-01 2.47262316e-03]

 [1.12535162e-07 9.99999887e-01]

 [9.82013790e-01 1.79862100e-02]

 [3.05902227e-07 9.99999694e-01]]

```

This output is an array of probability of each of the two classes (negative and positive)

for each input sample inferred by the LLM.

#### Evaluate

Now we can test performance, using the evaluation scripts. For example,

we can look at f1-score for classification performance and ECE for calibration:

```python

f1_score = evaluation.compute_metric(gt_labels_test, output_probs, metric='f1')

ece = evaluation.compute_metric(gt_labels_test, output_probs, metric='ece')

print('f1-score: {}, ECE: {}'.format(f1_score, ece))

```

This will return something similar to:

```console

f1-score: 0.8897243107769424, ECE: 0.08265417069196701

```

With the "compute_metric" function you can compute the following metrics:

| metric | returns |

|:--------|:---------|

| 'f1' | macro f1-score |

| 'acc' | classification accuracy |

| 'nll' | negative log-likelihood |

| 'auc' | ROC-AUC score |

| 'ece' | expected calibration error (ECE) |

| 'mce' | maximum calibration error (MCE) |

| 'brier' | Brier score |

## Example 2: Few-Shot Classification with an LLM

This example performs the same classification of example 1, but providing the LLM with some labelled

samples in the prompt. This strategy is referred to as few-shot classification or in-context learning.

#### Imports

General imports:

```python

import sys

import os

import pandas as pd

```

Add the src directory to the path:

```python

path_to_package = os.path.split(os.path.split(__file__)[0])[0]

sys.path.append(os.path.join(path_to_package, 'src'))

```

Import relevant classes and scripts from src:

```python

from llm_model import LLM  # class for LLM wrapper

from llm_classifier import LLMClassifier  # class for classifier using LLMs

import evaluation  # evaluation functions

```

#### Load Data

We will be using sentiment classification of Amazon reviews

for appliances, where reviews are to be classified as either

positive or negative:

```python

df = pd.read_csv('data/amazon_reviews/test.csv', sep='\t')  # pandas DataFrame containing text strings and integer labels

```

Let's take 200 examples to classify, including text inputs and

numeric ground truth labels to compare with after inference:

```python

n_test = 200

df_test = df[:n_test]  # test split

samples_test = df_test['text'].values  # text inputs

gt_labels_test = df_test['ground_truth_label'].values.astype(int)  # classes ground-truths as integers

```

We will also take 5 examples and associated labels to form a few-shot prompt, giving the LLM some examples

of the task we want it to perform:

```python

n_in_context = 5  # number of in-context examples to give in the prompt

df_in_context = df[n_test:n_test+n_in_context]  # in-context exmples

samples_in_context = df_in_context['text'].values  # text inputs

gt_labels_in_context = df_in_context['ground_truth_label'].values.astype(int)  # classes outputs as integers

```

#### LLM and Prompt Formatting

Now we can call the LLM wrapper class to load the LLM of choice from

Huggingface. In this example we will use "mistralai/Mistral-7B-Instruct-v0.3";

a 7b instruction fine-tuned model from Mistral AI:

```python

llm = LLM(model_name="mistralai/Mistral-7B-Instruct-v0.3", use_reduced_precision=True)

```

We have used the "use_reduced_precision=True" argument, which will load

the model at bfloat16 precision, reducing memory requirements and making

the model much faster to run. For better performance, but higher compute

and memory, you can set this parameter to "False" or leave it as default.

Now we need to make some formatting functions and wrapping text to construct our

prompts and look for the right words at the output. These are

specific to the task and can be defined in a class or a separate

script. This class/script must hve the following objects:

```python

class PromptFormatting(object):

    def __init__(self):

        

        # 1) an instruction sentence

        INSTRUCTION = 'classify the sentiment of the Amazon review below into one of the following classes:'

        

        # 2) The words identifying the classes. In this case

        # 0 = negative and 1 = positive.

        self.CLASSES = [

            'negative',

            'positive'

        ]

        

        # 3) The list of options that will be given to the LLM

        # in the prompt (classes words in a numbered list)

        self.CLASSES_TEXT = '''1. {}

2. {}'''.format(self.CLASSES[0], self.CLASSES[1])

    def format_instruction(self, instruction):

        # 4) function which, given the instruction sentence, 

        # will put it together with the options list

        prompt = '''{}

{}

'''.format(instruction, self.CLASSES_TEXT)

        return prompt

    

    def format_content(self, content):

        # 5) formatting the text to be classified with a header and

        # the prompt to answer with one of the options. In this

        # case, the inputs are reviews.

        prompt = '''review: {}

the review is '''.format(content)

        return prompt

prompt_formatting = PromptFormatting()

 ```

You can play around with the objects in the 

class above to construct your prompts differently.

You can use this general format for any task.

Now we initialise the LLM classifier, which we can use to infer class

probabilities leveraging the LLM for prompting:

```python

classifier = LLMClassifier(model=llm, prompt_formatting=prompt_formatting)

```

The LLMClassifier class has a function to print out what the prompts will

look like and make sure it all looks ok. We can call this function with the in-context

examples and labels as arguments to see the resulting prompt that is given to the LLM:

```python

classifier.print_prompt_example(input_examples=samples_in_context, labels_examples=gt_labels_in_context)

```

This will return:

```console

EXAMPLE 1:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: Installed this in my fridge, resettled the light and still shines red. Water come so out just fine, just not sure if it's our fridge or the filter.

the review is negative

EXAMPLE 2:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: It had a decent size dent in the door.

the review is negative

EXAMPLE 3:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: Good

the review is positive

EXAMPLE 4:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: This is a perfect replacement for our KitchenAid utensil rack that had several holes in the bottom.

the review is positive

EXAMPLE 5:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: I ordered one before this and it worked as good as the original factory one.  I will continue to buy from this company

the review is positive

EXAMPLE 6:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: 

the review is 

```

The prompt above lists five examples where we have provided the correct answer. We then initiate a sixth

example, where we will input the test sample in  and let the LLM chose the class at .

This will be automatically applied to all test examples during inference (see below).

#### Classify Examples

Now that we have our prompts and our LLM ready, we can run classification

on our set of 200 examples. The function "soft_labels_batch" will run classification

using the LLM for all inputs in the list "input_texts", using provided in-context examples

and labels to construct the prompt. The output will be class probabilities:

```python

output_probs = classifier.soft_labels_batch(input_texts=samples_test, input_examples=samples_in_context, labels_examples=gt_labels_in_context)

```

"output_probs" is a 2D n_samples x n_classes array containing the predicted class probabilities.

We can have a look at a few examples:

```python

print(output_probs[:10, :])

```

This returns something similar to the following:

```console

[[9.99664650e-01 3.35350130e-04]

 [3.05902227e-07 9.99999694e-01]

 [8.31528028e-07 9.99999168e-01]

 [8.31528028e-07 9.99999168e-01]

 [9.99876605e-01 1.23394576e-04]

 [9.99088949e-01 9.11051194e-04]

 [9.99876605e-01 1.23394576e-04]

 [2.26032430e-06 9.99997740e-01]

 [9.99088949e-01 9.11051194e-04]

 [2.26032430e-06 9.99997740e-01]]

```

This output is an array of probability of each of the two classes (negative and positive)

for each input sample inferred by the LLM.

#### Evaluate

Now we can test performance, using the evaluation scripts. For example,

we can look at f1-score for classification performance and ECE for calibration:

```python

f1_score = evaluation.compute_metric(gt_labels_test, output_probs, metric='f1')

ece = evaluation.compute_metric(gt_labels_test, output_probs, metric='ece')

print('f1-score: {}, ECE: {}'.format(f1_score, ece))

```

This will return something similar to:

```console

f1-score: 0.934998374959374, ECE: 0.06773155927658081

```

## Example 3: BayesPE for Zero-Shot Classification

In this example we will show how to use BayesPE to combine multiple prompt instructions and

improve calibration of the resulting classification. BayesPE learns how "good" each

instruction is with a labelled validation set and weights them accordingly. At inference

time, we can set a budget of forward passes through the LLM to balance performance and 

cost. For example, setting the budget to 1 will simply choose the best performing prompt and

run classification with it.

#### Imports

General imports:

```python

import sys

import os

import pandas as pd

```

Add the src directory to the path:

```python

path_to_package = os.path.split(os.path.split(__file__)[0])[0]

sys.path.append(os.path.join(path_to_package, 'src'))

```

Import relevant classes and scripts from src:

```python

from bpe import BayesPE  # the BayesPE class

import evaluation  # evaluation functions

```

#### Load Data

We will be using sentiment classification of Amazon reviews

for appliances, where reviews are to be classified as either

positive or negative:

```python

df = pd.read_csv('data/amazon_reviews/test.csv', sep='\t')  # pandas DataFrame containing text strings and integer labels

```

We will take 100 examples for validation and 200 examples for testing.

Both will include text inputs and numeric ground truth labels. For the test set,

the ground-truth labels will be used for evaluation.

```python

# Validation set

n_val = 100

df_val = df[:n_val]  # validation split

samples_val = df_val['text'].values  # text inputs

gt_labels_val = df_val['ground_truth_label'].values.astype(int)  # classes outputs as integers

# Test set

n_test = 200

df_test = df[n_val:n_val+n_test]  # test split

samples_test = df_test['text'].values  # text inputs

gt_labels_test = df_test['ground_truth_label'].values.astype(int)  # classes outputs as integers

```

#### Prompt Formatting and Instructions

We need to make some formatting functions and wrapping text to construct our

prompts and look for the right words at the output. These are

specific to the task and can be defined in a class or a separate

script. This class/script must hve the following objects:

```python

class PromptFormatting(object):

    def __init__(self):

        

        # 1) an instruction sentence

        INSTRUCTION = 'classify the sentiment of the Amazon review below into one of the following classes:'

        

        # 2) The words identifying the classes. In this case

        # 0 = negative and 1 = positive.

        self.CLASSES = [

            'negative',

            'positive'

        ]

        

        # 3) The list of options that will be given to the LLM

        # in the prompt (classes words in a numbered list)

        self.CLASSES_TEXT = '''1. {}

2. {}'''.format(self.CLASSES[0], self.CLASSES[1])

    def format_instruction(self, instruction):

        # 4) function which, given the instruction sentence, 

        # will put it together with the options list

        prompt = '''{}

{}

'''.format(instruction, self.CLASSES_TEXT)

        return prompt

    

    def format_content(self, content):

        # 5) formatting the text to be classified with a header and

        # the prompt to answer with one of the options. In this

        # case, the inputs are reviews.

        prompt = '''review: {}

the review is '''.format(content)

        return prompt

prompt_formatting = PromptFormatting()

 ```

You can play around with the objects in the 

class above to construct your prompts differently.

You can use this general format for any task.

Next, we need to define the different prompt instructions we are going to ensemble with

BayesPE. These are semantically equivalent instructions for the task at hand, stored in a list of strings. In our paper,

We investigated many strategies to automatically generate these. In this tutorial we will manually 

define them. Let's make 9:

```python

instructions = [

'classify the sentiment of the Amazon review below into one of the following classes:',

'Categorize the sentiment of the Amazon review provided into one of the following classes:',

'Categorize the sentiment of the Amazon review provided into one of the given classes:',

'Determine the sentiment category of the given Amazon review by classifying it into one of the following classes:',

'Classify the sentiment of the given Amazon review into one of the following categories:',

'Assign the sentiment of the Amazon review provided to one of the given categories:',

'Categorize the sentiment of the provided Amazon review into one of the following classes:',

'Determine the sentiment category that best corresponds to the Amazon review provided amongst the following options:',

'Classify the sentiment expressed in the Amazon review below into one of the following categories:'

]

 ```

Each of these will take the place of PromptFormatting.INSTRUCTIONS when iteratively running 

the LLM to form the ensemble.

#### Initialising and Optimising BayesPE

With the prompt formatting and our ensemble of instructions ready, we can initialise the BayesPE

classifier and optimise the ensemble weights with the validation set. First, we initialise

the BayesPE class:

```python

bayespe_classifier = BayesPE(model_name="mistralai/Mistral-7B-Instruct-v0.3", prompt_formatting=prompts, instructions=instructions, use_reduced_precision=True)

```

The BayesPE class takes as arguments the huggingface name of the underlying LLM to

be used (in this case Mistral-7b-Instruct), the prompt formatting class or script, the list of semantically equivalent 

instructions and, optionally, a boolean argument indicating whether to load the model

at reduced precision for efficiency (set to 'True' in this example). There are a few additional 

optional arguments (see doc string for details).

Similarly to the LLMClassifier class, the BayesPE class has a function to print out what

the prompts will look like and make sure it all looks ok:

```python

bayespe_classifier.print_prompt_example()

```

This will return the prompt that will be used for the LLM, using the first instruction

in the list:

```console

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: 

the review is 

```

If the prompt looks ok, we can now run the LLM with all instructions on the validation set

and optimise the BayesPE prompts' weights. This is done by simply running the following

function:

```python

bayespe_classifier.optimise_weights(samples_val, gt_labels_val)

```

The above optimises the weights to assign to each instruction when running inference

using the validation samples and associated labels.

#### Inference with BayesPE

Now that the weights are optimised, we can use BayesPE to infer class probabilities for

test examples. We can decide our budget of LLM forward passes, up to the maximum available

instructions (in this case 9). BayesPE will start by using the most important instructions,

according to the optimised weights, and progressively work its way down. For example, if we set the forward

passes to 1, BayesPE will run once with the best instruction only. Let's try with 5:

```python

output_probs = bayespe_classifier.forward(samples_test, n_forward_passes=5)

```

"output_probs" is a 2D n_samples x n_classes array containing the predicted class probabilities.

We can have a look at a few examples:

```python

print(output_probs[:10, :])

```

This returns something similar to the following:

```console

[[7.32112607e-01 2.67887438e-01]

 [9.96170234e-01 3.82981073e-03]

 [1.01965173e-05 9.99989848e-01]

 [1.14533497e-05 9.99988591e-01]

 [1.39176421e-04 9.99860868e-01]

 [1.11139489e-05 9.99988931e-01]

 [8.41226263e-04 9.99158818e-01]

 [7.84371738e-01 2.15628307e-01]

 [1.15778909e-03 9.98842256e-01]

 [5.90006247e-05 9.99941044e-01]]

```

This output is an array of probability of each of the two classes (negative and positive)

for each input sample inferred by the LLM.

#### Evaluate

Now we can test performance, using the evaluation scripts. For example,

we can look at f1-score for classification performance and ECE for calibration:

```python

f1_score = evaluation.compute_metric(gt_labels_test, output_probs, metric='f1')

ece = evaluation.compute_metric(gt_labels_test, output_probs, metric='ece')

print('f1-score: {}, ECE: {}'.format(f1_score, ece))

```

This will return something similar to:

```console

f1-score: 0.8996386993175431, ECE: 0.07812481373548508

```

#### Save and Re-Load the BayesPE Weights

You can save the BayesPE weights after optimising them with the following function:

```python

bayespe_classifier.save_weights(save_dir='saved_weights/ensemble_weights')

```

This will save the weights as a Pickle object in the specified directory. 

Similarly, re-load weights saved in a given directory with:

```python

bayespe_classifier.load_weights(load_dir='saved_weights/ensemble_weights')

```

## Example 4: BayesPE for Few-Shot Classification

In this example we will show how to use BayesPE to combine multiple prompt instructions and

improve calibration of the resulting classification, similarly to example 3. However, we will

use BayesPE for in-context learning, providing the LLM with some labelled examples in the prompt.

#### Imports

General imports:

```python

import sys

import os

import pandas as pd

```

Add the src directory to the path:

```python

path_to_package = os.path.split(os.path.split(__file__)[0])[0]

sys.path.append(os.path.join(path_to_package, 'src'))

```

Import relevant classes and scripts from src:

```python

from bpe import BayesPE  # the BayesPE class

import evaluation  # evaluation functions

```

#### Load Data

We will be using sentiment classification of Amazon reviews

for appliances, where reviews are to be classified as either

positive or negative:

```python

df = pd.read_csv('data/amazon_reviews/test.csv', sep='\t')  # pandas DataFrame containing text strings and integer labels

```

We will take 100 examples for validation and 200 examples for testing.

Both will include text inputs and numeric ground truth labels. For the test set,

the ground-truth labels will be used for evaluation.

```python

# Validation set

n_val = 100

df_val = df[:n_val]  # validation split

samples_val = df_val['text'].values  # text inputs

gt_labels_val = df_val['ground_truth_label'].values.astype(int)  # classes outputs as integers

# Test set

n_test = 200

df_test = df[n_val:n_val+n_test]  # test split

samples_test = df_test['text'].values  # text inputs

gt_labels_test = df_test['ground_truth_label'].values.astype(int)  # classes outputs as integers

```

#### Prompts and In-Context Examples

We need to make some formatting functions and wrapping text to construct our

prompts and look for the right words at the output. These are

specific to the task and can be defined in a class or a separate

script. This class/script must hve the following objects:

```python

class PromptFormatting(object):

    def __init__(self):

        

        # 1) an instruction sentence

        INSTRUCTION = 'classify the sentiment of the Amazon review below into one of the following classes:'

        

        # 2) The words identifying the classes. In this case

        # 0 = negative and 1 = positive.

        self.CLASSES = [

            'negative',

            'positive'

        ]

        

        # 3) The list of options that will be given to the LLM

        # in the prompt (classes words in a numbered list)

        self.CLASSES_TEXT = '''1. {}

2. {}'''.format(self.CLASSES[0], self.CLASSES[1])

    def format_instruction(self, instruction):

        # 4) function which, given the instruction sentence, 

        # will put it together with the options list

        prompt = '''{}

{}

'''.format(instruction, self.CLASSES_TEXT)

        return prompt

    

    def format_content(self, content):

        # 5) formatting the text to be classified with a header and

        # the prompt to answer with one of the options. In this

        # case, the inputs are reviews.

        prompt = '''review: {}

the review is '''.format(content)

        return prompt

prompt_formatting = PromptFormatting()

 ```

You can play around with the objects in the 

class above to construct your prompts differently.

You can use this general format for any task.

Next, we need to define the different prompt instructions we are going to ensemble with

BayesPE. These are semantically equivalent instructions for the task at hand, stored in a list of strings. In our paper,

We investigated many strategies to automatically generate these. In this tutorial we will manually 

define them. Let's make 9:

```python

instructions = [

'classify the sentiment of the Amazon review below into one of the following classes:',

'Categorize the sentiment of the Amazon review provided into one of the following classes:',

'Categorize the sentiment of the Amazon review provided into one of the given classes:',

'Determine the sentiment category of the given Amazon review by classifying it into one of the following classes:',

'Classify the sentiment of the given Amazon review into one of the following categories:',

'Assign the sentiment of the Amazon review provided to one of the given categories:',

'Categorize the sentiment of the provided Amazon review into one of the following classes:',

'Determine the sentiment category that best corresponds to the Amazon review provided amongst the following options:',

'Classify the sentiment expressed in the Amazon review below into one of the following categories:'

]

 ```

Each of these will take the place of PromptFormatting.INSTRUCTIONS when iteratively running 

the LLM to form the ensemble.

As we are performing classification with in-context learning, each instruction will need a

set of labelled examples to provide to the LLM. These can be defined for each instruction in

different ways. In this tutorial, we are simply going to use different random examples for

each instruction. We will take 5 examples for each instruction:

```python

n_in_context = 5  # number of in-context examples to use

for i in range(len(instructions)):  # for each instruction in the instructions list

    df_in_context = df[n_val+n_test+i*n_in_context:n_val+n_test+(i+1)*n_in_context]  # take 5 in-context exmples

    samples_in_context_i = df_in_context[constants.TEXT].values  # 5 text inputs

    gt_labels_in_context_i = df_in_context[constants.GROUND_TRUTH_LABEL].values.astype(int)  # 5 classes outputs as integers

    

    # concatenate over the iterations to form 2D arrays of input texts and labels

    if i==0:

        samples_in_context = np.expand_dims(samples_in_context_i, axis=1)

        gt_labels_in_context = np.expand_dims(gt_labels_in_context_i, axis=1)

    else:

        samples_in_context = np.concatenate((samples_in_context, np.expand_dims(samples_in_context_i, axis=1)), axis=1)

        gt_labels_in_context = np.concatenate((gt_labels_in_context, np.expand_dims(gt_labels_in_context_i, axis=1)), axis=1)

 ```

The result of the above are two 2D arrays, one of strings containing input texts and 

one of integers containing class labels, each of size n_in_context x n_instructions.

This is the format in which the BayesPE accepts in-context examples.

#### Initialising and Optimising BayesPE

With the prompt formatting and our ensemble of instructions ready, we can initialise the BayesPE

classifier and optimise the ensemble weights with the validation set. First, we initialise

the BayesPE class:

```python

bayespe_classifier = BayesPE(model_name="mistralai/Mistral-7B-Instruct-v0.3", prompt_formatting=prompt_formatting, instructions=instructions, few_shot_texts_sets=samples_in_context, few_shot_labels_sets=gt_labels_in_context, use_reduced_precision=True)

```

The BayesPE class takes as arguments the huggingface name of the underlying LLM to

be used (in this case Mistral-7b-Instruct), the prompt formatting class or script and the list of semantically equivalent 

instructions. As we are performing in-context learning, we have also provided the 2D arrays 

'few_shot_texts_sets' and 'few_shot_labels_sets', containing sets of text inputs and labels respectively

for each instruction in the ensemble. Optionally, we can define a boolean argument indicating whether to load the model

at reduced precision for efficiency (set to 'True' in this example). There are a few additional 

optional arguments (see doc string for details).

Similarly to the LLMClassifier class, the BayesPE class has a function to print out what

the prompts will look like and make sure it all looks ok:

```python

bayespe_classifier.print_prompt_example()

```

This will return an example of the prompt that will be given to the LLM, using the first instruction

in the list and the first set of in-context examples:

```console

EXAMPLE 1:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: This is a mixed review. .. When I got the ice maker I was in love. I LOVE ice.. and it was making ice like a champ for about one month then slowly it started making half the cubes.. then 4 cubes.. then very thin see through cubes... to none. I will however say that the company has been very receptive to my returning it to be repaired. .. returning is always a pain in the butt and it seems so that a brandy new product should not be having any problems. Will letchu know how the "repair" turns out.

the review is negative

EXAMPLE 2:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: I bought this in Feb 2016, so I have used it for a good 14 months now. The first problem is the oven does not always stay on after lighting. This is very irritating when you when you "think" you are pre-heating the oven and it is actually not on! Secondly, there is only one high temp burner, so forget about cooking a pot of water for pasta AND something else at the same time. Thirdly, the knobs are very cheap and easily moved so if you set an oven temperature and bump into the knob, it may no longer be set at the desired temperature. Finally, one of the burner knobs just broke. So....good luck if you buy this oven and expect to cook!

the review is negative

EXAMPLE 3:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: I got what I thought was a great deal.  It was only used a couple of months and the people "remodeled" so they upgraded to a larger unit. Yeah.  First the doors just don't like to be shut.  That's why GE put a buzzer on it.  Second the drain gets plugged and it is a bear to remove the freezer drawers and the interior freezer back panel to clean it out.  Why hide it behind a panel?  It's noisy, has cheap refrigerator drawers.  The only thing good is it looks nice.

GE lost me as a customer for life after this.

the review is negative

EXAMPLE 4:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: Ice maker did not work. Just kept leaking water all over the floor. leaked an entire 5 gallon jug in just a few hours.

the review is negative

EXAMPLE 5:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: The packaging is very different from the one I bought from Home Depot.

the review is negative

EXAMPLE 6:

classify the sentiment of the Amazon review below into one of the following classes:

1. negative

2. positive

review: 

the review is 

```

If the prompt looks ok, we can now run the LLM with all instructions on the validation set

and optimise the BayesPE prompts' weights. This is done by simply running the following

function:

```python

bayespe_classifier.optimise_weights(samples_val, gt_labels_val)

```

The above optimises the weights to assign to each instruction when running inference

using the validation samples and associated labels.

#### Inference with BayesPE

Now that the weights are optimised, we can use BayesPE to infer class probabilities for

test examples. We can decide our budget of LLM forward passes, up to the maximum available

instructions (in this case 9). BayesPE will start by using the most important instructions,

according to the optimised weights, and progressively work its way down. For example, if we set the forward

passes to 1, BayesPE will run once with the best instruction only. Let's try with 5:

```python

output_probs = bayespe_classifier.forward(samples_test, n_forward_passes=5)

```

"output_probs" is a 2D n_samples x n_classes array containing the predicted class probabilities.

We can have a look at a few examples:

```python

print(output_probs[:10, :])

```

This returns something similar to the following:

```console

[[9.55111911e-01 4.48880816e-02]

 [9.99915070e-01 8.49220944e-05]

 [1.29251932e-05 9.99987067e-01]

 [5.33277146e-05 9.99946665e-01]

 [3.05689604e-05 9.99969424e-01]

 [3.26815948e-05 9.99967311e-01]

 [1.08215687e-05 9.99989171e-01]

 [9.18138780e-01 8.18612125e-02]

 [1.31488799e-01 8.68511194e-01]

 [1.51307649e-05 9.99984862e-01]]

```

This output is an array of probability of each of the two classes (negative and positive)

for each input sample inferred by the LLM.

#### Evaluate

Now we can test performance, using the evaluation scripts. For example,

we can look at f1-score for classification performance and ECE for calibration:

```python

f1_score = evaluation.compute_metric(gt_labels_test, output_probs, metric='f1')

ece = evaluation.compute_metric(gt_labels_test, output_probs, metric='ece')

print('f1-score: {}, ECE: {}'.format(f1_score, ece))

```

This will return something similar to:

```console

f1-score: 0.9368717948717948, ECE: 0.04805548116564751

```

#### Save and Re-Load the BayesPE Weights

You can save the BayesPE weights after optimising them with the following function:

```python

bayespe_classifier.save_weights(save_dir='saved_weights/ensemble_weights')

```

This will save the weights as a Pickle object in the specified directory. 

Similarly, re-load weights saved in a given directory with:

```python

bayespe_classifier.load_weights(load_dir='saved_weights/ensemble_weights')

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/amzn/bayespe

Awesome Lists containing this project

README