{"id":18358397,"url":"https://github.com/wiseodd/lapeft-bayesopt","last_synced_at":"2025-04-06T13:31:34.818Z","repository":{"id":221459003,"uuid":"748397200","full_name":"wiseodd/lapeft-bayesopt","owner":"wiseodd","description":"Discrete Bayesian optimization with LLMs, PEFT finetuning methods, and the Laplace approximation.","archived":false,"fork":false,"pushed_at":"2024-07-30T14:03:19.000Z","size":5088,"stargazers_count":17,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-04T14:49:09.420Z","etag":null,"topics":["bayesian-optimization","laplace-approximation","llm","peft"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wiseodd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-25T22:03:00.000Z","updated_at":"2025-03-14T08:45:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"eb96f0e7-9379-4a1b-a92f-891e52c8eac1","html_url":"https://github.com/wiseodd/lapeft-bayesopt","commit_stats":null,"previous_names":["wiseodd/lapeft-bayesopt"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wiseodd%2Flapeft-bayesopt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wiseodd%2Flapeft-bayesopt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wiseodd%2Flapeft-bayesopt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wiseodd%2Flapeft-bayesopt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wiseodd","download_url":"https://codeload.github.com/wiseodd/lapeft-bayesopt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247488488,"owners_count":20946954,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bayesian-optimization","laplace-approximation","llm","peft"],"created_at":"2024-11-05T22:17:44.048Z","updated_at":"2025-04-06T13:31:31.762Z","avatar_url":"https://github.com/wiseodd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `lapeft-bayesopt`: Discrete Bayesian Optimization with LLM + PEFT + Laplace Approximation\n\nThis is the accompanying library of the paper [_A Sober Look at LLMs for Material Discovery: Are They Actually Good for Bayesian Optimization Over Molecules?_](https://arxiv.org/abs/2402.05015).\n\n\u003e [!TIP]\n\u003e If you are looking for the experiment code, check out the sister repo: \u003chttps://github.com/wiseodd/llm-chem-sober-look\u003e.\n\n\u003e [!IMPORTANT]\n\u003e If you use this library, please cite using the following bib entry.\n\n```\n@inproceedings{kristiadi2024sober,\n  title={A Sober Look at {LLMs} for Material Discovery: {A}re They Actually Good for {B}ayesian Optimization Over Molecules?},\n  author={Kristiadi, Agustinus and Strieth-Kalthoff, Felix and Skreta, Marta and Poupart, Pascal and Aspuru-Guzik, Al\\'{a}n and Pleiss, Geoff},\n  booktitle={ICML},\n  year={2024}\n}\n```\n\n## Table of Contents\n\n1. [Setup](#setup)\n2. [Warmup: Using LLMs as _fixed_ feature extractors](#fixed-feature)\n3. [Using finetuned LLMs as surrogates](#finetuning)\n\n\u003ca id=\"setup\"\u003e\u003c/a\u003e\n\n## Setup\n\n\u003e [!IMPORTANT]\n\u003e Note that the ordering below is important.\n\n1. Install PyTorch (with CUDA; version 2+ is supported): \u003chttps://pytorch.org/get-started/locally/\u003e\n2. Install laplace-torch (not from pip!): `pip install git+https://github.com/aleximmer/Laplace.git@0.2`\n3. Clone and install this repo:\n\n```\ngit clone git@github.com:wiseodd/lapeft-bayesopt.git\ncd lapeft-bayesopt\npip install -e .\n```\n\n\u003ca id=\"fixed-feature\"\u003e\u003c/a\u003e\n\n## Warmup: Using LLMs as _fixed_ feature extractors\n\n**Full example:** `examples/run_fixed_features.py`\n\nThe simplest way to incorporate LLMs into BO surrogates is by viewing them as fixed feature extractors.\nGiven a data point $x \\in \\mathcal{D}\\_\\mathrm{cand}$ from the pool of candidates $\\mathcal{D}\\_\\mathrm{cand} = \\\\{x_1, \\dots, x_n\\\\}$ we want to find the best from, we wrap it in a textual prompt $c(x)$ and then do a forward pass over the LLM and take the last transformer embedding which has shape `(seq_len, embd_dim)`.\nThen, we aggregate it by e.g. averaging over the sequence dimension to get a feature vector for $h(x)$ with shape `(embd_dim,)`.\nDoing this for all $x$'s, we can then do the usual discrete BO loop with standard surrogate functions like GPs or Bayesian NNs over the candidates $\\mathcal{D}_\\mathrm{cand}$.\n\nThis package provides an easy way to do the transformation from $x$ to $h(x)$.\nHere are the steps:\n\n1. We assume that your dataset is a pandas dataframe. Inherit `lapeft_bayesopt.problems.DataProcessor`. Example (from `examples/data_processor.py`):\n\n```python\nimport lapeft_bayesopt.problems.DataProcessor\n\nclass RedoxDataProcessor(DataProcessor):\n    \"\"\"\n    Pandas dataframe spec:\n    ----------------------\n    RangeIndex: 1407 entries, 0 to 1406\n    Data columns (total 6 columns):\n    #   Column                 Non-Null Count  Dtype\n    --  ------                 --------------  -----\n    0   Entry Number           1407 non-null   int64\n    1   File Name              1407 non-null   object\n    2   SMILES                 1407 non-null   object\n    3   Ered                   1407 non-null   float64\n    4   HOMO                   1407 non-null   float64\n    5   Gsol                   1407 non-null   float64\n    6   Absorption Wavelength  1407 non-null   float64\n    dtypes: float64(4), int64(1), object(2)\n    memory usage: 77.1+ KB\n\n    Objective: Minimize Ered over a list of molecules in SMILES.\n    \"\"\"\n    def __init__(self, prompt_builder, tokenizer):\n        # num_outputs = 1 since this is a single-objective problem\n        super().__init__(prompt_builder=prompt_builder, num_outputs=1, tokenizer=tokenizer)\n\n        # We must specify the four properties below\n        # -----------------------------------------\n        # `x_col` is the column name of your x\n        self.x_col = 'SMILES'\n        # `target_col` is the pandas column name of the property that we want to optimize\n        self.target_col = 'Ered'\n        # `obj_str` is the textual description of that property (useful for prompting the LLM later)\n        self.obj_str = 'redox potential'\n        # `maximization` is whether we want to maximize or minimize the property\n        self.maximization = False\n\n    def _get_columns_to_remove(self) -\u003e List[str]:\n        # List all columns of your dataset! We will remove all of them after we preprocess the dataset using Huggingface (we only need the resulting `input_ids` and `labels`)\n        return ['Entry Number', 'File Name', 'SMILES', 'HOMO', 'Ered', 'Gsol', 'Absorption Wavelength']\n```\n\n2. Next, we create prompting schemes $c(x)$. Here's an example (from `examples/prompting.py`):\n\n```python\nfrom lapeft_bayesopt.problems.prompting import PromptBuilder\n\nclass MyPromptBuilder(PromptBuilder):\n    def __init__(self, kind: str):\n        self.kind = kind\n\n    def get_prompt(self, smiles_str: str, obj_str: str) -\u003e str:\n        if self.kind == 'completion':\n            return f'The estimated {obj_str} of the molecule {smiles_str} is: '\n        elif self.kind == 'just-smiles':\n            return smiles_str\n        else:\n            return NotImplementedError\n\n```\n\n3. Then, we need the LLM feature extractor itself. This package has some ready-made ones, e.g., `lapeft_bayesopt.foundation_models.t5.T5Regressor`. Feel free to follow that example to create your own.\n\n4. Then, we can start extracting the LLM features from $\\mathcal{D}_\\mathrm{cand}$. See the `load_features` method in `examples/run_fixed_features.py`.\n\n5. Finally, we can do the discrete BO loop using those features (cache provided in `examples/data/cache`). See `examples/run_fixed_features.py` for a complete, self-contained example. Note that, at this point, we can use any BO algorithm and surrogate function. E.g. we can just use BoTorch for the BO loop.\n\n\u003ca id=\"finetuning\"\u003e\u003c/a\u003e\n\n## Using finetuned LLMs as surrogates\n\n**Note:** Please check the previous section first since we will reuse some objects here.\n\n**Full example:** `examples/run_finetuning.py`\n\nWe can go one step further by making $h(x)$ trainable.\nThis can be done by attaching a PEFT method (LoRA, PrefixTuning, Adapter, etc) to the frozen LLM.\nThen, we train and do a Laplace approximation on the PEFT's and regression head's weights.\n\nTo do this, we can use the surrogates provided in `lapeft_bayesopt.surrogates`.\nCurrently LoRA is supported and it is very easy to support other PEFT methods, using Huggingface's `peft` library.\n\nSee `examples/run_finetuning.py` for an example. It's actually quite simple to use!\n\n1. First, define the base PEFT-infused LLM in a function so that it is freshly initialized at each call. (Useful since at each BO iteration, the surrogate model is retrained.)\n\n```python\ndef get_model():\n    # Load a foundation model with a regression head attached\n    model = T5Regressor(\n        kind='GT4SD/multitask-text-and-chemistry-t5-base-augm',\n        tokenizer=tokenizer\n    )\n\n    # Attach LoRA or any other PEFT on the foundation model\n    target_modules = ['q', 'v']\n    config = LoraConfig(\n        r=4,\n        lora_alpha=16,\n        target_modules=target_modules,\n        lora_dropout=0.1,\n        bias='none',\n        # This is necessary so that the regression head is also trained\n        modules_to_save=['head'],\n    )\n    lora_model = get_peft_model(model, config)\n\n    # For some reason HF's peft duplicates the head. So we need to \"detach\" the one that is unused\n    for p in lora_model.base_model.head.original_module.parameters():\n        p.requires_grad = False\n\n    return lora_model\n```\n\n2. Then, we can configure the training and the Laplace approximation of this PEFT surrogate. For full options, check out `lapeft_bayesopt.utils.configs.LaplaceConfig`\n\n```python\n\n# Config for the Laplace approx over PEFT\nconfig = LaplaceConfig(\n    noise_var=0.001,\n    hess_factorization='kron',\n    subset_of_weights='all',\n    marglik_mode='posthoc',\n    prior_prec_structure='layerwise'\n)\n```\n\n3. We then initialize a training dataset, e.g. via random sampling from the candidate set (a pandas dataframe). The training dataset should be a list of pandas rows or dicts. Don't forget to remove that row from the original dataset $\\mathcal{D}_\\mathrm{cand}$. We can use `lapeft_bayesopt.utils.helpers.pop_df()` to do so.\n\n```python\ndataset_train = []\nwhile len(dataset_train) \u003c n_init_data:\n    idx = np.random.randint(len(pd_dataset))\n    # Make sure that the optimum is not included\n    if pd_dataset.loc[idx][OBJ_COL] \u003e= ground_truth_max:\n        continue\n    dataset_train.append(helpers.pop_df(pd_dataset, idx))\n```\n\n4. Next, create the surrogate.\n\n```python\n\n# Create the surrogate model based on the LLM+PEFT regressor above\nmodel = LAPEFTBayesOptLoRA(\n    get_model, dataset_train, data_processor, laplace_config=config\n)\n```\n\n5. At each BO iteration, we preprocess the candidate $x$'s and infer the posterior mean and variance of the Laplace approximation over PEFT (LAPEFT!) to compute the acquisition function. Here we use an approximate Thompson sampling, but other acquisition function like EI can of course be used.\n\n```python\n# Preprocess D_cand (`dataset`) so that we can make predictions over it\ndataloader = data_processor.get_dataloader(pd_dataset, batch_size=16, shuffle=False)\n\n# Make prediction over D_cand, get means and vars, compute the acqf\nacq_vals = []\nfor data in dataloader:\n    posterior = model.posterior(data)\n    f_mean, f_var = posterior.mean, posterior.variance\n    acq_vals.append(thompson_sampling(f_mean, f_var))\nacq_vals = torch.cat(acq_vals, dim=0).cpu().squeeze()\n```\n\n6. We pick the $x$ that maximizes the acquisition function and remove it from the candidate set. That $x$ is represented by a pandas row that we popped from the candidate set (a pandas dataframe).\n\n```python\n# Pick an x (a row in the current pandas dataset) that maximizes the acquisition\nidx_best = torch.argmax(acq_vals).item()\nnew_data = helpers.pop_df(pd_dataset, idx_best)\n\n# Update the current best y\nif new_data[OBJ_COL] \u003e best_y:\n    best_y = new_data[OBJ_COL]\n```\n\n7. Finally, we just feed this new data point to the surrogate object. It will append it to the training set and retrain the surrogate for the next iteration.\n\n```python\n# Update surrogate using the new data point\nmodel = model.condition_on_observations(new_data)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwiseodd%2Flapeft-bayesopt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwiseodd%2Flapeft-bayesopt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwiseodd%2Flapeft-bayesopt/lists"}