{"id":19646579,"url":"https://github.com/amzn/bayespe","last_synced_at":"2026-02-18T13:33:22.227Z","repository":{"id":252575728,"uuid":"836161725","full_name":"amzn/BayesPE","owner":"amzn","description":"Zero-shot and in-context learning classification with LLMs and uncertainty estimation using multiple prompts.","archived":false,"fork":false,"pushed_at":"2024-07-31T09:24:05.000Z","size":264,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-25T08:11:13.275Z","etag":null,"topics":["bayesian","bayespe","llms","prompting","prompts","uncertainty-quantification"],"latest_commit_sha":null,"homepage":"https://www.amazon.science/publications/bayesian-prompt-ensembles-model-uncertainty-estimation-for-black-box-large-language-models","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amzn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-31T09:21:49.000Z","updated_at":"2025-08-15T08:25:24.000Z","dependencies_parsed_at":"2024-08-10T21:47:41.167Z","dependency_job_id":"f8887567-f47a-4184-8e04-ca1aa3e241e3","html_url":"https://github.com/amzn/BayesPE","commit_stats":null,"previous_names":["amzn/bayespe"],"tags_count":0,"template":false,"template_full_name":"amazon-archives/__template_Apache-2.0","purl":"pkg:github/amzn/BayesPE","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2FBayesPE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2FBayesPE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2FBayesPE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2FBayesPE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amzn","download_url":"https://codeload.github.com/amzn/BayesPE/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amzn%2FBayesPE/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29580808,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-18T08:38:15.585Z","status":"ssl_error","status_checked_at":"2026-02-18T08:38:14.917Z","response_time":162,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bayesian","bayespe","llms","prompting","prompts","uncertainty-quantification"],"created_at":"2024-11-11T14:39:19.603Z","updated_at":"2026-02-18T13:33:17.212Z","avatar_url":"https://github.com/amzn.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n## Description\n\nThis package implements the method described and evaluated in the paper \n\"Bayesian Prompt Ensembles: Model Uncertainty Estimation for Black-Box Large Language Models\".\nBayesian Prompt Ensembles (BayesPE) is a method to combine multiple semantically equivalent \nprompts to obtain well-calibrated output probabilities with Large Language Models. \nThe package includes tools to i) perform classification through prompting with LLMs and \nii) use the BayesPE approach to ensemble multiple prompts, improving calibration performance.\nBelow you will find a comprehensive tutorial divided into four self-contained parts:\n1. **Zero-Shot Classification with an LLM:** Use an LLM to perform classification through prompting.\n2. **Few-Shot Classification with an LLM:** Use an LLM to perform classification through in-context learning, providing a few labelled examples in the prompt.\n3. **BayesPE for Zero-Shot Classification:** Use BayesPE to ensemble different semantically equivalent prompts to perform classification with an LLM.\n4. **BayesPE for Few-Shot Classification:** Use BayesPE to ensemble different semantically equivalent prompts and in-context examples to perform classification with an LLM.\n\nIf you use this package, please cite our paper: https://www.amazon.science/publications/bayesian-prompt-ensembles-model-uncertainty-estimation-for-black-box-large-language-models\n\n```console\n@article{tonolini2024bayesian,\n  title={Bayesian prompt ensembles: Model uncertainty estimation for black-box large language models},\n  author={Tonolini, Francesco and Massiah, Jordan and Aletras, Nikolaos and Kazai, Gabriella},\n  journal={Association for Computational Linguistics}\n  year={2024}\n}\n```\n\nIf you have questions or need help, don't hesitate to get in touch: tonolini@amazon.com\n\n## Installation\n\nCopy this package to where you need it, then do the following:\n1) move to the package's directory\n```console\ncd BayesPE\n```\n\n2) create a Python environment\n```console\nconda create --name bayespe python=3.10\n```\n\n3) activate the environment\n```console\nsource activate bayespe\n```\n\n4) install requirements\n```console\npip install -r requirements.txt\n```\n\n5) install Huggingface CLI\n```console\npip install -U \"huggingface_hub[cli]\"\n```\n\n6) login to Huggingface (for access to LLMs) and enter your token.\n```console\nhuggingface-cli login\n```\n\n\n\nAnd you are good to go!\n\n## Example 1: Zero-Shot Classification with an LLM\n\nHere is a simple example of classifying text with an LLM\nusing the package.\n\n#### Imports\n\nGeneral imports:\n\n```python\nimport sys\nimport os\nimport pandas as pd\n```\nAdd the src directory to the path:\n```python\npath_to_package = os.path.split(os.path.split(__file__)[0])[0]\nsys.path.append(os.path.join(path_to_package, 'src'))\n```\nImport relevant classes and scripts from src:\n```python\nfrom llm_model import LLM  # class for LLM wrapper\nfrom llm_classifier import LLMClassifier  # class for classifier using LLMs\nimport evaluation  # evaluation functions\n```\n\n#### Load Data\n\nWe will be using sentiment classification of Amazon reviews\nfor appliances, where reviews are to be classified as either\npositive or negative:\n\n```python\ndf = pd.read_csv('data/amazon_reviews/test.csv', sep='\\t')  # pandas DataFrame containing text strings and integer labels\n```\n\nLet's take 200 examples to classify, including text inputs and\nnumeric ground truth labels to compare with after inference:\n```python\nn_test = 200\ndf_test = df[:n_test]  # test split\nsamples_test = df_test['text'].values  # text inputs\ngt_labels_test = df_test['ground_truth_label'].values.astype(int)  # classes ground-truths as integers\n```\n#### LLM and Prompt Formatting\n\nNow we can call the LLM wrapper class to load the LLM of choice from\nHuggingface. In this example we will use \"mistralai/Mistral-7B-Instruct-v0.3\";\na 7b instruction fine-tuned model from Mistral AI:\n```python\nllm = LLM(model_name=\"mistralai/Mistral-7B-Instruct-v0.3\", use_reduced_precision=True)\n```\nWe have used the \"use_reduced_precision=True\" argument, which will load\nthe model at bfloat16 precision, reducing memory requirements and making\nthe model much faster to run. For better performance, but higher compute\nand memory, you can set this parameter to \"False\" or leave it as default.\n\n\nNow we need to make some formatting functions and wrapping text to construct our\nprompts and look for the right words at the output. These are\nspecific to the task and can be defined in a class or a separate\nscript. This class/script must hve the following objects:\n```python\nclass PromptFormatting(object):\n    def __init__(self):\n        \n        # 1) an instruction sentence\n        INSTRUCTION = 'classify the sentiment of the Amazon review below into one of the following classes:'\n        \n        # 2) The words identifying the classes. In this case\n        # 0 = negative and 1 = positive.\n        self.CLASSES = [\n            'negative',\n            'positive'\n        ]\n        \n        # 3) The list of options that will be given to the LLM\n        # in the prompt (classes words in a numbered list)\n        self.CLASSES_TEXT = '''1. {}\n2. {}'''.format(self.CLASSES[0], self.CLASSES[1])\n\n    def format_instruction(self, instruction):\n        # 4) function which, given the instruction sentence, \n        # will put it together with the options list\n        prompt = '''{}\n{}\n'''.format(instruction, self.CLASSES_TEXT)\n        return prompt\n    \n    def format_content(self, content):\n        # 5) formatting the text to be classified with a header and\n        # the prompt to answer with one of the options. In this\n        # case, the inputs are reviews.\n        prompt = '''review: {}\nthe review is '''.format(content)\n        return prompt\n\nprompt_formatting = PromptFormatting()\n ```\nYou can play around with the objects in the \nclass above to construct your prompts differently.\nYou can use this general format for any task.\n\nNow we initialise the LLM classifier, which we can use to infer class\nprobabilities leveraging the LLM for prompting:\n```python\nclassifier = LLMClassifier(model=llm, prompt_formatting=prompt_formatting)\n```\n\nThe LLMClassifier class has a function to print out what the prompts will\nlook like and make sure it all looks ok:\n```python\nclassifier.print_prompt_example()\n```\nThis will return:\n```console\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: \u003cTEXT_IN\u003e\nthe review is \u003cLABEL_OUT\u003e\n```\n\n#### Classify Examples\n\nNow that we have our prompts and our LLM ready, we can run classification\non our set of 200 examples. The function \"soft_labels_batch\" will run classification\nusing the LLM for all inputs in the list \"input_texts\" and return class probabilities:\n```python\noutput_probs = classifier.soft_labels_batch(input_texts=samples_test)\n```\n\"output_probs\" is a 2D n_samples x n_classes array containing the predicted class probabilities.\nWe can have a look at a few examples:\n```python\nprint(output_probs[:10, :])\n```\nThis returns something similar to the following:\n```console\n[[9.97527377e-01 2.47262316e-03]\n [4.13993755e-08 9.99999959e-01]\n [3.05902227e-07 9.99999694e-01]\n [1.12535162e-07 9.99999887e-01]\n [9.93307149e-01 6.69285092e-03]\n [9.82013790e-01 1.79862100e-02]\n [9.97527377e-01 2.47262316e-03]\n [1.12535162e-07 9.99999887e-01]\n [9.82013790e-01 1.79862100e-02]\n [3.05902227e-07 9.99999694e-01]]\n```\nThis output is an array of probability of each of the two classes (negative and positive)\nfor each input sample inferred by the LLM.\n\n#### Evaluate\n\nNow we can test performance, using the evaluation scripts. For example,\nwe can look at f1-score for classification performance and ECE for calibration:\n```python\nf1_score = evaluation.compute_metric(gt_labels_test, output_probs, metric='f1')\nece = evaluation.compute_metric(gt_labels_test, output_probs, metric='ece')\nprint('f1-score: {}, ECE: {}'.format(f1_score, ece))\n```\nThis will return something similar to:\n```console\nf1-score: 0.8897243107769424, ECE: 0.08265417069196701\n```\n\nWith the \"compute_metric\" function you can compute the following metrics:\n\n| metric | returns |\n|:--------|:---------|\n| 'f1' | macro f1-score |\n| 'acc' | classification accuracy |\n| 'nll' | negative log-likelihood |\n| 'auc' | ROC-AUC score |\n| 'ece' | expected calibration error (ECE) |\n| 'mce' | maximum calibration error (MCE) |\n| 'brier' | Brier score |\n\n## Example 2: Few-Shot Classification with an LLM\n\nThis example performs the same classification of example 1, but providing the LLM with some labelled\nsamples in the prompt. This strategy is referred to as few-shot classification or in-context learning.\n\n#### Imports\n\nGeneral imports:\n\n```python\nimport sys\nimport os\nimport pandas as pd\n```\nAdd the src directory to the path:\n```python\npath_to_package = os.path.split(os.path.split(__file__)[0])[0]\nsys.path.append(os.path.join(path_to_package, 'src'))\n```\nImport relevant classes and scripts from src:\n```python\nfrom llm_model import LLM  # class for LLM wrapper\nfrom llm_classifier import LLMClassifier  # class for classifier using LLMs\nimport evaluation  # evaluation functions\n```\n\n#### Load Data\n\nWe will be using sentiment classification of Amazon reviews\nfor appliances, where reviews are to be classified as either\npositive or negative:\n\n```python\ndf = pd.read_csv('data/amazon_reviews/test.csv', sep='\\t')  # pandas DataFrame containing text strings and integer labels\n```\n\nLet's take 200 examples to classify, including text inputs and\nnumeric ground truth labels to compare with after inference:\n```python\nn_test = 200\ndf_test = df[:n_test]  # test split\nsamples_test = df_test['text'].values  # text inputs\ngt_labels_test = df_test['ground_truth_label'].values.astype(int)  # classes ground-truths as integers\n```\nWe will also take 5 examples and associated labels to form a few-shot prompt, giving the LLM some examples\nof the task we want it to perform:\n```python\nn_in_context = 5  # number of in-context examples to give in the prompt\ndf_in_context = df[n_test:n_test+n_in_context]  # in-context exmples\nsamples_in_context = df_in_context['text'].values  # text inputs\ngt_labels_in_context = df_in_context['ground_truth_label'].values.astype(int)  # classes outputs as integers\n```\n\n#### LLM and Prompt Formatting\n\nNow we can call the LLM wrapper class to load the LLM of choice from\nHuggingface. In this example we will use \"mistralai/Mistral-7B-Instruct-v0.3\";\na 7b instruction fine-tuned model from Mistral AI:\n```python\nllm = LLM(model_name=\"mistralai/Mistral-7B-Instruct-v0.3\", use_reduced_precision=True)\n```\nWe have used the \"use_reduced_precision=True\" argument, which will load\nthe model at bfloat16 precision, reducing memory requirements and making\nthe model much faster to run. For better performance, but higher compute\nand memory, you can set this parameter to \"False\" or leave it as default.\n\n\nNow we need to make some formatting functions and wrapping text to construct our\nprompts and look for the right words at the output. These are\nspecific to the task and can be defined in a class or a separate\nscript. This class/script must hve the following objects:\n```python\nclass PromptFormatting(object):\n    def __init__(self):\n        \n        # 1) an instruction sentence\n        INSTRUCTION = 'classify the sentiment of the Amazon review below into one of the following classes:'\n        \n        # 2) The words identifying the classes. In this case\n        # 0 = negative and 1 = positive.\n        self.CLASSES = [\n            'negative',\n            'positive'\n        ]\n        \n        # 3) The list of options that will be given to the LLM\n        # in the prompt (classes words in a numbered list)\n        self.CLASSES_TEXT = '''1. {}\n2. {}'''.format(self.CLASSES[0], self.CLASSES[1])\n\n    def format_instruction(self, instruction):\n        # 4) function which, given the instruction sentence, \n        # will put it together with the options list\n        prompt = '''{}\n{}\n'''.format(instruction, self.CLASSES_TEXT)\n        return prompt\n    \n    def format_content(self, content):\n        # 5) formatting the text to be classified with a header and\n        # the prompt to answer with one of the options. In this\n        # case, the inputs are reviews.\n        prompt = '''review: {}\nthe review is '''.format(content)\n        return prompt\n\nprompt_formatting = PromptFormatting()\n ```\nYou can play around with the objects in the \nclass above to construct your prompts differently.\nYou can use this general format for any task.\n\nNow we initialise the LLM classifier, which we can use to infer class\nprobabilities leveraging the LLM for prompting:\n```python\nclassifier = LLMClassifier(model=llm, prompt_formatting=prompt_formatting)\n```\n\nThe LLMClassifier class has a function to print out what the prompts will\nlook like and make sure it all looks ok. We can call this function with the in-context\nexamples and labels as arguments to see the resulting prompt that is given to the LLM:\n```python\nclassifier.print_prompt_example(input_examples=samples_in_context, labels_examples=gt_labels_in_context)\n```\nThis will return:\n```console\nEXAMPLE 1:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: Installed this in my fridge, resettled the light and still shines red. Water come so out just fine, just not sure if it's our fridge or the filter.\nthe review is negative\n\nEXAMPLE 2:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: It had a decent size dent in the door.\nthe review is negative\n\nEXAMPLE 3:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: Good\nthe review is positive\n\nEXAMPLE 4:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: This is a perfect replacement for our KitchenAid utensil rack that had several holes in the bottom.\nthe review is positive\n\nEXAMPLE 5:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: I ordered one before this and it worked as good as the original factory one.  I will continue to buy from this company\nthe review is positive\n\nEXAMPLE 6:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: \u003cTEXT_IN\u003e\nthe review is \u003cLABEL_OUT\u003e\n```\nThe prompt above lists five examples where we have provided the correct answer. We then initiate a sixth\nexample, where we will input the test sample in \u003cTEXT_IN\u003e and let the LLM chose the class at \u003cLABEL_OUT\u003e.\nThis will be automatically applied to all test examples during inference (see below).\n\n#### Classify Examples\n\nNow that we have our prompts and our LLM ready, we can run classification\non our set of 200 examples. The function \"soft_labels_batch\" will run classification\nusing the LLM for all inputs in the list \"input_texts\", using provided in-context examples\nand labels to construct the prompt. The output will be class probabilities:\n```python\noutput_probs = classifier.soft_labels_batch(input_texts=samples_test, input_examples=samples_in_context, labels_examples=gt_labels_in_context)\n```\n\"output_probs\" is a 2D n_samples x n_classes array containing the predicted class probabilities.\nWe can have a look at a few examples:\n```python\nprint(output_probs[:10, :])\n```\nThis returns something similar to the following:\n```console\n[[9.99664650e-01 3.35350130e-04]\n [3.05902227e-07 9.99999694e-01]\n [8.31528028e-07 9.99999168e-01]\n [8.31528028e-07 9.99999168e-01]\n [9.99876605e-01 1.23394576e-04]\n [9.99088949e-01 9.11051194e-04]\n [9.99876605e-01 1.23394576e-04]\n [2.26032430e-06 9.99997740e-01]\n [9.99088949e-01 9.11051194e-04]\n [2.26032430e-06 9.99997740e-01]]\n```\nThis output is an array of probability of each of the two classes (negative and positive)\nfor each input sample inferred by the LLM.\n\n#### Evaluate\n\nNow we can test performance, using the evaluation scripts. For example,\nwe can look at f1-score for classification performance and ECE for calibration:\n```python\nf1_score = evaluation.compute_metric(gt_labels_test, output_probs, metric='f1')\nece = evaluation.compute_metric(gt_labels_test, output_probs, metric='ece')\nprint('f1-score: {}, ECE: {}'.format(f1_score, ece))\n```\nThis will return something similar to:\n```console\nf1-score: 0.934998374959374, ECE: 0.06773155927658081\n```\n\n## Example 3: BayesPE for Zero-Shot Classification\n\nIn this example we will show how to use BayesPE to combine multiple prompt instructions and\nimprove calibration of the resulting classification. BayesPE learns how \"good\" each\ninstruction is with a labelled validation set and weights them accordingly. At inference\ntime, we can set a budget of forward passes through the LLM to balance performance and \ncost. For example, setting the budget to 1 will simply choose the best performing prompt and\nrun classification with it.\n\n#### Imports\n\nGeneral imports:\n\n```python\nimport sys\nimport os\nimport pandas as pd\n```\nAdd the src directory to the path:\n```python\npath_to_package = os.path.split(os.path.split(__file__)[0])[0]\nsys.path.append(os.path.join(path_to_package, 'src'))\n```\nImport relevant classes and scripts from src:\n```python\nfrom bpe import BayesPE  # the BayesPE class\nimport evaluation  # evaluation functions\n```\n\n#### Load Data\n\nWe will be using sentiment classification of Amazon reviews\nfor appliances, where reviews are to be classified as either\npositive or negative:\n\n```python\ndf = pd.read_csv('data/amazon_reviews/test.csv', sep='\\t')  # pandas DataFrame containing text strings and integer labels\n```\n\nWe will take 100 examples for validation and 200 examples for testing.\nBoth will include text inputs and numeric ground truth labels. For the test set,\nthe ground-truth labels will be used for evaluation.\n```python\n# Validation set\nn_val = 100\ndf_val = df[:n_val]  # validation split\nsamples_val = df_val['text'].values  # text inputs\ngt_labels_val = df_val['ground_truth_label'].values.astype(int)  # classes outputs as integers\n# Test set\nn_test = 200\ndf_test = df[n_val:n_val+n_test]  # test split\nsamples_test = df_test['text'].values  # text inputs\ngt_labels_test = df_test['ground_truth_label'].values.astype(int)  # classes outputs as integers\n```\n\n#### Prompt Formatting and Instructions\n\nWe need to make some formatting functions and wrapping text to construct our\nprompts and look for the right words at the output. These are\nspecific to the task and can be defined in a class or a separate\nscript. This class/script must hve the following objects:\n```python\nclass PromptFormatting(object):\n    def __init__(self):\n        \n        # 1) an instruction sentence\n        INSTRUCTION = 'classify the sentiment of the Amazon review below into one of the following classes:'\n        \n        # 2) The words identifying the classes. In this case\n        # 0 = negative and 1 = positive.\n        self.CLASSES = [\n            'negative',\n            'positive'\n        ]\n        \n        # 3) The list of options that will be given to the LLM\n        # in the prompt (classes words in a numbered list)\n        self.CLASSES_TEXT = '''1. {}\n2. {}'''.format(self.CLASSES[0], self.CLASSES[1])\n\n    def format_instruction(self, instruction):\n        # 4) function which, given the instruction sentence, \n        # will put it together with the options list\n        prompt = '''{}\n{}\n'''.format(instruction, self.CLASSES_TEXT)\n        return prompt\n    \n    def format_content(self, content):\n        # 5) formatting the text to be classified with a header and\n        # the prompt to answer with one of the options. In this\n        # case, the inputs are reviews.\n        prompt = '''review: {}\nthe review is '''.format(content)\n        return prompt\n\nprompt_formatting = PromptFormatting()\n ```\nYou can play around with the objects in the \nclass above to construct your prompts differently.\nYou can use this general format for any task.\n\nNext, we need to define the different prompt instructions we are going to ensemble with\nBayesPE. These are semantically equivalent instructions for the task at hand, stored in a list of strings. In our paper,\nWe investigated many strategies to automatically generate these. In this tutorial we will manually \ndefine them. Let's make 9:\n```python\ninstructions = [\n'classify the sentiment of the Amazon review below into one of the following classes:',\n'Categorize the sentiment of the Amazon review provided into one of the following classes:',\n'Categorize the sentiment of the Amazon review provided into one of the given classes:',\n'Determine the sentiment category of the given Amazon review by classifying it into one of the following classes:',\n'Classify the sentiment of the given Amazon review into one of the following categories:',\n'Assign the sentiment of the Amazon review provided to one of the given categories:',\n'Categorize the sentiment of the provided Amazon review into one of the following classes:',\n'Determine the sentiment category that best corresponds to the Amazon review provided amongst the following options:',\n'Classify the sentiment expressed in the Amazon review below into one of the following categories:'\n]\n ```\nEach of these will take the place of PromptFormatting.INSTRUCTIONS when iteratively running \nthe LLM to form the ensemble.\n\n#### Initialising and Optimising BayesPE\n\nWith the prompt formatting and our ensemble of instructions ready, we can initialise the BayesPE\nclassifier and optimise the ensemble weights with the validation set. First, we initialise\nthe BayesPE class:\n```python\nbayespe_classifier = BayesPE(model_name=\"mistralai/Mistral-7B-Instruct-v0.3\", prompt_formatting=prompts, instructions=instructions, use_reduced_precision=True)\n```\nThe BayesPE class takes as arguments the huggingface name of the underlying LLM to\nbe used (in this case Mistral-7b-Instruct), the prompt formatting class or script, the list of semantically equivalent \ninstructions and, optionally, a boolean argument indicating whether to load the model\nat reduced precision for efficiency (set to 'True' in this example). There are a few additional \noptional arguments (see doc string for details).\n\nSimilarly to the LLMClassifier class, the BayesPE class has a function to print out what\nthe prompts will look like and make sure it all looks ok:\n```python\nbayespe_classifier.print_prompt_example()\n```\nThis will return the prompt that will be used for the LLM, using the first instruction\nin the list:\n```console\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: \u003cSAMPLE_IN\u003e\nthe review is \u003cLABEL_OUT\u003e\n```\nIf the prompt looks ok, we can now run the LLM with all instructions on the validation set\nand optimise the BayesPE prompts' weights. This is done by simply running the following\nfunction:\n```python\nbayespe_classifier.optimise_weights(samples_val, gt_labels_val)\n```\nThe above optimises the weights to assign to each instruction when running inference\nusing the validation samples and associated labels.\n\n#### Inference with BayesPE\n\nNow that the weights are optimised, we can use BayesPE to infer class probabilities for\ntest examples. We can decide our budget of LLM forward passes, up to the maximum available\ninstructions (in this case 9). BayesPE will start by using the most important instructions,\naccording to the optimised weights, and progressively work its way down. For example, if we set the forward\npasses to 1, BayesPE will run once with the best instruction only. Let's try with 5:\n```python\noutput_probs = bayespe_classifier.forward(samples_test, n_forward_passes=5)\n```\n\"output_probs\" is a 2D n_samples x n_classes array containing the predicted class probabilities.\nWe can have a look at a few examples:\n```python\nprint(output_probs[:10, :])\n```\nThis returns something similar to the following:\n```console\n[[7.32112607e-01 2.67887438e-01]\n [9.96170234e-01 3.82981073e-03]\n [1.01965173e-05 9.99989848e-01]\n [1.14533497e-05 9.99988591e-01]\n [1.39176421e-04 9.99860868e-01]\n [1.11139489e-05 9.99988931e-01]\n [8.41226263e-04 9.99158818e-01]\n [7.84371738e-01 2.15628307e-01]\n [1.15778909e-03 9.98842256e-01]\n [5.90006247e-05 9.99941044e-01]]\n```\nThis output is an array of probability of each of the two classes (negative and positive)\nfor each input sample inferred by the LLM.\n\n#### Evaluate\n\nNow we can test performance, using the evaluation scripts. For example,\nwe can look at f1-score for classification performance and ECE for calibration:\n```python\nf1_score = evaluation.compute_metric(gt_labels_test, output_probs, metric='f1')\nece = evaluation.compute_metric(gt_labels_test, output_probs, metric='ece')\nprint('f1-score: {}, ECE: {}'.format(f1_score, ece))\n```\nThis will return something similar to:\n```console\nf1-score: 0.8996386993175431, ECE: 0.07812481373548508\n```\n\n\n#### Save and Re-Load the BayesPE Weights\n\nYou can save the BayesPE weights after optimising them with the following function:\n```python\nbayespe_classifier.save_weights(save_dir='saved_weights/ensemble_weights')\n```\nThis will save the weights as a Pickle object in the specified directory. \nSimilarly, re-load weights saved in a given directory with:\n```python\nbayespe_classifier.load_weights(load_dir='saved_weights/ensemble_weights')\n```\n\n## Example 4: BayesPE for Few-Shot Classification\n\nIn this example we will show how to use BayesPE to combine multiple prompt instructions and\nimprove calibration of the resulting classification, similarly to example 3. However, we will\nuse BayesPE for in-context learning, providing the LLM with some labelled examples in the prompt.\n\n#### Imports\n\nGeneral imports:\n\n```python\nimport sys\nimport os\nimport pandas as pd\n```\nAdd the src directory to the path:\n```python\npath_to_package = os.path.split(os.path.split(__file__)[0])[0]\nsys.path.append(os.path.join(path_to_package, 'src'))\n```\nImport relevant classes and scripts from src:\n```python\nfrom bpe import BayesPE  # the BayesPE class\nimport evaluation  # evaluation functions\n```\n\n#### Load Data\n\nWe will be using sentiment classification of Amazon reviews\nfor appliances, where reviews are to be classified as either\npositive or negative:\n\n```python\ndf = pd.read_csv('data/amazon_reviews/test.csv', sep='\\t')  # pandas DataFrame containing text strings and integer labels\n```\n\nWe will take 100 examples for validation and 200 examples for testing.\nBoth will include text inputs and numeric ground truth labels. For the test set,\nthe ground-truth labels will be used for evaluation.\n```python\n# Validation set\nn_val = 100\ndf_val = df[:n_val]  # validation split\nsamples_val = df_val['text'].values  # text inputs\ngt_labels_val = df_val['ground_truth_label'].values.astype(int)  # classes outputs as integers\n# Test set\nn_test = 200\ndf_test = df[n_val:n_val+n_test]  # test split\nsamples_test = df_test['text'].values  # text inputs\ngt_labels_test = df_test['ground_truth_label'].values.astype(int)  # classes outputs as integers\n```\n\n#### Prompts and In-Context Examples\n\nWe need to make some formatting functions and wrapping text to construct our\nprompts and look for the right words at the output. These are\nspecific to the task and can be defined in a class or a separate\nscript. This class/script must hve the following objects:\n```python\nclass PromptFormatting(object):\n    def __init__(self):\n        \n        # 1) an instruction sentence\n        INSTRUCTION = 'classify the sentiment of the Amazon review below into one of the following classes:'\n        \n        # 2) The words identifying the classes. In this case\n        # 0 = negative and 1 = positive.\n        self.CLASSES = [\n            'negative',\n            'positive'\n        ]\n        \n        # 3) The list of options that will be given to the LLM\n        # in the prompt (classes words in a numbered list)\n        self.CLASSES_TEXT = '''1. {}\n2. {}'''.format(self.CLASSES[0], self.CLASSES[1])\n\n    def format_instruction(self, instruction):\n        # 4) function which, given the instruction sentence, \n        # will put it together with the options list\n        prompt = '''{}\n{}\n'''.format(instruction, self.CLASSES_TEXT)\n        return prompt\n    \n    def format_content(self, content):\n        # 5) formatting the text to be classified with a header and\n        # the prompt to answer with one of the options. In this\n        # case, the inputs are reviews.\n        prompt = '''review: {}\nthe review is '''.format(content)\n        return prompt\n\nprompt_formatting = PromptFormatting()\n ```\nYou can play around with the objects in the \nclass above to construct your prompts differently.\nYou can use this general format for any task.\n\nNext, we need to define the different prompt instructions we are going to ensemble with\nBayesPE. These are semantically equivalent instructions for the task at hand, stored in a list of strings. In our paper,\nWe investigated many strategies to automatically generate these. In this tutorial we will manually \ndefine them. Let's make 9:\n```python\ninstructions = [\n'classify the sentiment of the Amazon review below into one of the following classes:',\n'Categorize the sentiment of the Amazon review provided into one of the following classes:',\n'Categorize the sentiment of the Amazon review provided into one of the given classes:',\n'Determine the sentiment category of the given Amazon review by classifying it into one of the following classes:',\n'Classify the sentiment of the given Amazon review into one of the following categories:',\n'Assign the sentiment of the Amazon review provided to one of the given categories:',\n'Categorize the sentiment of the provided Amazon review into one of the following classes:',\n'Determine the sentiment category that best corresponds to the Amazon review provided amongst the following options:',\n'Classify the sentiment expressed in the Amazon review below into one of the following categories:'\n]\n ```\nEach of these will take the place of PromptFormatting.INSTRUCTIONS when iteratively running \nthe LLM to form the ensemble.\n\nAs we are performing classification with in-context learning, each instruction will need a\nset of labelled examples to provide to the LLM. These can be defined for each instruction in\ndifferent ways. In this tutorial, we are simply going to use different random examples for\neach instruction. We will take 5 examples for each instruction:\n```python\nn_in_context = 5  # number of in-context examples to use\nfor i in range(len(instructions)):  # for each instruction in the instructions list\n    df_in_context = df[n_val+n_test+i*n_in_context:n_val+n_test+(i+1)*n_in_context]  # take 5 in-context exmples\n    samples_in_context_i = df_in_context[constants.TEXT].values  # 5 text inputs\n    gt_labels_in_context_i = df_in_context[constants.GROUND_TRUTH_LABEL].values.astype(int)  # 5 classes outputs as integers\n    \n    # concatenate over the iterations to form 2D arrays of input texts and labels\n    if i==0:\n        samples_in_context = np.expand_dims(samples_in_context_i, axis=1)\n        gt_labels_in_context = np.expand_dims(gt_labels_in_context_i, axis=1)\n    else:\n        samples_in_context = np.concatenate((samples_in_context, np.expand_dims(samples_in_context_i, axis=1)), axis=1)\n        gt_labels_in_context = np.concatenate((gt_labels_in_context, np.expand_dims(gt_labels_in_context_i, axis=1)), axis=1)\n ```\nThe result of the above are two 2D arrays, one of strings containing input texts and \none of integers containing class labels, each of size n_in_context x n_instructions.\nThis is the format in which the BayesPE accepts in-context examples.\n\n#### Initialising and Optimising BayesPE\n\nWith the prompt formatting and our ensemble of instructions ready, we can initialise the BayesPE\nclassifier and optimise the ensemble weights with the validation set. First, we initialise\nthe BayesPE class:\n```python\nbayespe_classifier = BayesPE(model_name=\"mistralai/Mistral-7B-Instruct-v0.3\", prompt_formatting=prompt_formatting, instructions=instructions, few_shot_texts_sets=samples_in_context, few_shot_labels_sets=gt_labels_in_context, use_reduced_precision=True)\n```\nThe BayesPE class takes as arguments the huggingface name of the underlying LLM to\nbe used (in this case Mistral-7b-Instruct), the prompt formatting class or script and the list of semantically equivalent \ninstructions. As we are performing in-context learning, we have also provided the 2D arrays \n'few_shot_texts_sets' and 'few_shot_labels_sets', containing sets of text inputs and labels respectively\nfor each instruction in the ensemble. Optionally, we can define a boolean argument indicating whether to load the model\nat reduced precision for efficiency (set to 'True' in this example). There are a few additional \noptional arguments (see doc string for details).\n\nSimilarly to the LLMClassifier class, the BayesPE class has a function to print out what\nthe prompts will look like and make sure it all looks ok:\n```python\nbayespe_classifier.print_prompt_example()\n```\nThis will return an example of the prompt that will be given to the LLM, using the first instruction\nin the list and the first set of in-context examples:\n```console\nEXAMPLE 1:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: This is a mixed review. .. When I got the ice maker I was in love. I LOVE ice.. and it was making ice like a champ for about one month then slowly it started making half the cubes.. then 4 cubes.. then very thin see through cubes... to none. I will however say that the company has been very receptive to my returning it to be repaired. .. returning is always a pain in the butt and it seems so that a brandy new product should not be having any problems. Will letchu know how the \"repair\" turns out.\nthe review is negative\n\nEXAMPLE 2:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: I bought this in Feb 2016, so I have used it for a good 14 months now. The first problem is the oven does not always stay on after lighting. This is very irritating when you when you \"think\" you are pre-heating the oven and it is actually not on! Secondly, there is only one high temp burner, so forget about cooking a pot of water for pasta AND something else at the same time. Thirdly, the knobs are very cheap and easily moved so if you set an oven temperature and bump into the knob, it may no longer be set at the desired temperature. Finally, one of the burner knobs just broke. So....good luck if you buy this oven and expect to cook!\nthe review is negative\n\nEXAMPLE 3:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: I got what I thought was a great deal.  It was only used a couple of months and the people \"remodeled\" so they upgraded to a larger unit. Yeah.  First the doors just don't like to be shut.  That's why GE put a buzzer on it.  Second the drain gets plugged and it is a bear to remove the freezer drawers and the interior freezer back panel to clean it out.  Why hide it behind a panel?  It's noisy, has cheap refrigerator drawers.  The only thing good is it looks nice.\n\nGE lost me as a customer for life after this.\nthe review is negative\n\nEXAMPLE 4:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: Ice maker did not work. Just kept leaking water all over the floor. leaked an entire 5 gallon jug in just a few hours.\nthe review is negative\n\nEXAMPLE 5:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: The packaging is very different from the one I bought from Home Depot.\nthe review is negative\n\nEXAMPLE 6:\nclassify the sentiment of the Amazon review below into one of the following classes:\n1. negative\n2. positive\n\nreview: \u003cSAMPLE_IN\u003e\nthe review is \u003cLABEL_OUT\u003e\n```\nIf the prompt looks ok, we can now run the LLM with all instructions on the validation set\nand optimise the BayesPE prompts' weights. This is done by simply running the following\nfunction:\n```python\nbayespe_classifier.optimise_weights(samples_val, gt_labels_val)\n```\nThe above optimises the weights to assign to each instruction when running inference\nusing the validation samples and associated labels.\n\n#### Inference with BayesPE\n\nNow that the weights are optimised, we can use BayesPE to infer class probabilities for\ntest examples. We can decide our budget of LLM forward passes, up to the maximum available\ninstructions (in this case 9). BayesPE will start by using the most important instructions,\naccording to the optimised weights, and progressively work its way down. For example, if we set the forward\npasses to 1, BayesPE will run once with the best instruction only. Let's try with 5:\n```python\noutput_probs = bayespe_classifier.forward(samples_test, n_forward_passes=5)\n```\n\"output_probs\" is a 2D n_samples x n_classes array containing the predicted class probabilities.\nWe can have a look at a few examples:\n```python\nprint(output_probs[:10, :])\n```\nThis returns something similar to the following:\n```console\n[[9.55111911e-01 4.48880816e-02]\n [9.99915070e-01 8.49220944e-05]\n [1.29251932e-05 9.99987067e-01]\n [5.33277146e-05 9.99946665e-01]\n [3.05689604e-05 9.99969424e-01]\n [3.26815948e-05 9.99967311e-01]\n [1.08215687e-05 9.99989171e-01]\n [9.18138780e-01 8.18612125e-02]\n [1.31488799e-01 8.68511194e-01]\n [1.51307649e-05 9.99984862e-01]]\n```\nThis output is an array of probability of each of the two classes (negative and positive)\nfor each input sample inferred by the LLM.\n\n#### Evaluate\n\nNow we can test performance, using the evaluation scripts. For example,\nwe can look at f1-score for classification performance and ECE for calibration:\n```python\nf1_score = evaluation.compute_metric(gt_labels_test, output_probs, metric='f1')\nece = evaluation.compute_metric(gt_labels_test, output_probs, metric='ece')\nprint('f1-score: {}, ECE: {}'.format(f1_score, ece))\n```\nThis will return something similar to:\n```console\nf1-score: 0.9368717948717948, ECE: 0.04805548116564751\n```\n\n#### Save and Re-Load the BayesPE Weights\n\nYou can save the BayesPE weights after optimising them with the following function:\n```python\nbayespe_classifier.save_weights(save_dir='saved_weights/ensemble_weights')\n```\nThis will save the weights as a Pickle object in the specified directory. \nSimilarly, re-load weights saved in a given directory with:\n```python\nbayespe_classifier.load_weights(load_dir='saved_weights/ensemble_weights')\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famzn%2Fbayespe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famzn%2Fbayespe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famzn%2Fbayespe/lists"}