{"id":18439446,"url":"https://github.com/idiap/nvib","last_synced_at":"2025-06-16T19:34:02.396Z","repository":{"id":144962436,"uuid":"598196444","full_name":"idiap/nvib","owner":"idiap","description":null,"archived":false,"fork":false,"pushed_at":"2025-04-15T13:20:16.000Z","size":661,"stargazers_count":8,"open_issues_count":1,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-15T14:29:16.275Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/idiap.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSES/GPL-3.0-only.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-06T15:52:52.000Z","updated_at":"2025-04-15T13:20:20.000Z","dependencies_parsed_at":"2024-11-06T06:27:42.433Z","dependency_job_id":"3a6d97ae-5694-4e75-b749-7fdd78718a2b","html_url":"https://github.com/idiap/nvib","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/idiap/nvib","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fnvib","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fnvib/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fnvib/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fnvib/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/idiap","download_url":"https://codeload.github.com/idiap/nvib/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fnvib/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260224272,"owners_count":22977370,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T06:24:49.437Z","updated_at":"2025-06-16T19:34:02.376Z","avatar_url":"https://github.com/idiap.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"..\n.. SPDX-FileCopyrightText: Copyright © 2023 Idiap Research Institute \u003ccontact@idiap.ch\u003e\n..\n.. SPDX-FileContributor: Fabio J Fehr \u003cfabio.fehr@idiap.ch\u003e\n..\n.. SPDX-License-Identifier: GPL-3.0-only\n..\n\n================================================================================================================\nNonparametric Variational Information Bottleneck (NVIB)\n================================================================================================================\n\n.. image:: figures/nvib_denoising.png\n\n\nThe NVIB Python package containing the NVIB layer and the Denoising attention module. This is the package for the papers:\n\n- [Tag v4.0] Coming soon! `Fine-Tuning Pretrained Models with NVIB for Improved Generalisation \u003chttps://openreview.net/forum?id=eX0VFgG4IS\u003e`_ (ICLR 2025)\n- [Tag v3.0] `Nonparametric Variational Regularisation of Pretrained Transformers \u003chttps://openreview.net/forum?id=Zu8OWNUC0u#discussion\u003e`_ (COLM 2024)\n- [Tag v2.0] `Learning to Abstract with Nonparametric Variational Information Bottleneck \u003chttps://openreview.net/forum?id=vU0KbvQ91x\u003e`_ (EMNLP 2023)\n- [Tag v1.0] `A Variational AutoEncoder for Transformers with Nonparametric Variational Information Bottleneck \u003chttps://openreview.net/forum?id=6QkjC_cs03X\u003e`_ (ICLR 2023)\n\nPlease cite the original authors for their work in any publication(s) that uses this work:\n\n.. code:: bib\n\n    @inproceedings{\n    fehr2025finetuning,\n    title={Fine-Tuning Pretrained Models with {NVIB} for Improved Generalisation},\n    author={Fabio James Fehr and Alina Elena Baia and Xiaoguang Chang and Andrei Catalin Coman and Karl El Hajal and Dina El Zein and Shashi Kumar and Juan Pablo Zuluaga Gomez and Andrea Cavallaro and Damien Teney and James Henderson},\n    booktitle={Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions},\n    year={2025},\n    url={https://openreview.net/forum?id=eX0VFgG4IS}\n    }\n\n    @inproceedings{fehr2024nonparametric,\n    title={Nonparametric Variational Regularisation of Pretrained Transformers},\n    author={Fabio James Fehr and James Henderson},\n    booktitle={First Conference on Language Modeling},\n    year={2024},\n    url={https://openreview.net/forum?id=Zu8OWNUC0u}\n    }\n\n    @inproceedings{behjati2023learning,\n    title={Learning to Abstract with Nonparametric Variational Information Bottleneck},\n    author={Melika Behjati and Fabio James Fehr and James Henderson},\n    booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},\n    year={2023},\n    url={https://openreview.net/forum?id=vU0KbvQ91x}\n    }\n\n    @inproceedings{henderson23_nvib,\n    author    = {James Henderson and Fabio James Fehr},\n    title     = {{A VAE for Transformers with Nonparametric Variational Information Bottleneck}},\n    year      = {2023},\n    booktitle = {International Conference on Learning Representations},\n    url={https://openreview.net/forum?id=6QkjC_cs03X}\n    }\n\n\n\nDescription\n------------\n\nThe NVIB project containing the NVIB layer and the Denoising attention functions for training and evaluation time.\n\n\nRequirements\n-------------\n\n- Python 3.10\n- PyTorch 2.0.0\n- pytest\n\n\nInstallation\n------------\n\nClone this repository.  Activate your environment and install this package locally into your environment:\n\n.. code:: bash\n\n    git clone https://gitlab.idiap.ch/ffehr/nvib.git\n    pip install nvib/.\n\nOr use the environment.yml file to create a new conda environment\n\n.. code:: bash\n\n    conda env create -f environment.yml\n    conda activate nvib\n\n\nTesting\n----------------\n\nWe test that NVIB layer and the Denoising attention functions are equivalent to scaled dot product \nattention.\n\nTo run the tests, run the following command:\n\n.. code:: bash\n\n    pytest\n\n\nProject status\n----------------\n\nDevelopment is ongoing and soon to have implementations for: \n\n- Pretrained model implemetations with NVIB and denoising attention\n- Causal self attention\n- Cuda kernels for NVIB\n\n\nPython Usage\n-------------------\n\nImport the package and its components\n\n.. code:: python\n\n    from nvib.nvib_layer import Nvib\n\n\nFor running the following examples:\n\n.. code:: python \n\n    # For examples\n    import torch \n    import torch.nn as nn \n    torch.manual_seed(42)\n\n    Ns, Nt, B, P, nheads = 10, 6, 2, 512, 8\n    number_samples = 3\n    encoder_output = torch.rand(B,Ns,P)\n    src_key_padding_mask = torch.zeros((B,Ns),dtype=bool)\n    tgt = torch.rand(B,Nt,P)\n    tgt_key_padding_mask = torch.zeros((B,Nt),dtype=bool)\n    memory_key_padding_mask = torch.zeros((number_samples,Ns),dtype=bool)\n    device = \"cpu\"\n\n\nNonparametric Variational Information Bottleneck\n-------------------------------------------------\n\nInitialise the NVIB layer (Source length = :math:`N_s`, embedding size = :math:`P`, Batch size = :math:`B`).\n\n- `size_in` The embedding size input\n- `size_out` The embedding size output (typically the same)\n- `prior_mu` Prior for Gaussian means :math:`\\mu^p`\n- `prior_var` Prior for Gaussian variance :math:`(\\sigma^2)^p`\n- `prior_log_alpha` Logged Prior for Dirichlet psuedo-counts :math:`\\alpha_0^p`\n- `prior_log_alpha_stdev` Logged standard deviation for prior for Dirichlet psuedo-counts :math:`\\alpha_0^p`\n- `delta` Conditional prior :math:`\\alpha^\\Delta`\n- `kappa` Number of samples per component :math:`\\kappa^\\Delta`\n- `nheads` Number of heads for the attention module\n- `alpha_tau` Temperature parameter for the Dirichlet distribution where 0 is the posterior and 1 is the prior\n- `stdev_tau` Temperature parameter for the Gaussian standard deviation where 0 is the posterior and 1 is the prior\n- `mu_tau` Temperature parameter for the Gaussian mean where 0 is the posterior and 1 is the prior\n\n\n**Note:** The output size in training will always be :math:`(N_s+1) \\times \\kappa^\\Delta` as it includes the prior :math:`(+1)` and does\n:math:`\\kappa^\\Delta` samples in training. At evaluation time we only use the means and thus only :math:`N_s+1`.\n\n\n.. code:: python\n\n    nvib_layer = Nvib(size_in=P,\n                  size_out=P,\n                  prior_mu=None,\n                  prior_var=None,\n                  prior_log_alpha=None,\n                  prior_log_alpha_stdev=None,\n                  delta=1,\n                  kappa=1,\n                  nheads=nheads,\n                  alpha_tau=None,\n                  stdev_tau=None,\n                  mu_tau=None,\n                  )\n\nRun the forward of the layer with encoder_output size :math:`(B, N_s, P)` and boolean mask size :math:`(B, N_s)` where True masks the\ntoken. In self-attention layers we could include the `alpha_skip` parameter which accumulates the :math:`\\alpha` from the previous layer\n\n\n.. code:: python\n    \n    # Initial layer\n    latent_dict_0 = nvib_layer(encoder_output, src_key_padding_mask, alpha_skip=None)\n\n    # Subsequent layers\n    latent_dict_1 = nvib_layer(encoder_output, src_key_padding_mask, alpha_skip=latent_dict_0['alpha'])\n\n\nThe dictionary returned is of the form:\n\n`{z,pi,memory_key_padding_mask,mu,logvar,alpha, avg_num_vec, avg_prop_vec, avg_alpha0}`\n\nwhere `z` is a tuple containing `(z, pi, mu, logvar)` variables. This tuple is what is passed to\nthe `DenoisingMultiheadAttention` forward function such that it may access the parameters.\n\n- The `z` within the tuple is the Gaussian component vectors. :math:`(B, (N_s+1) \\times \\kappa^\\Delta, P)`\n- `alpha` is the psuedo-counts. :math:`(B, (N_s+1) \\times \\kappa^\\Delta, 1)`\n- `pi` is the Dirichlet probability reparameterised from psuedo-counts :math:`(B, (N_s+1) \\times \\kappa^\\Delta, 1)`\n- `mu` is the means of the Gaussian components. :math:`(B, (N_s+1) \\times \\kappa^\\Delta, P)`\n- `logvar` is the logged variance of the Gaussian components. :math:`(B, (N_s+1) \\times \\kappa^\\Delta, P)`\n- `memory_key_padding_mask` is the encoders boolean attention mask. :math:`(B, (N_s+1) \\times \\kappa^\\Delta)`\n- `avg_num_vec` is the number of non-zero psuedo-counts averaged over the batch (used for logging)\n- `avg_prop_vec` is the proportion of non-zero psuedo-counts averaged over the batch (used for logging)\n- `avg_alpha0` is the sum of psuedo-counts used averaged over the batch (used for logging)\n\nsampling can be done as follows with integer `number_samples` (seen as a batch size) and boolean mask size :math:`(B, N_s)` where\nTrue masks the token.\nThis mask is made with :math:`N_s` being the largest size you wish to sample and lengths can predetermined by the user.\n\n\n.. code:: python\n\n    z = nvib_layer.sample(number_samples, memory_key_padding_mask, device)\n\n\nDenoising Attention\n---------------------\n\nDenoising attention can be used for self attention or cross attention. The forward function is the same for both.\n\n\n.. code:: python\n    from nvib.denoising_attention import DenoisingMultiheadAttention\n\n\nCross Attention\n===============\n\nThis duplicates and augments the `multi_head_attention_forward` function and `multi_head_attention` class from Pytorch.\n\n.. code:: python\n\n    decoder_layer = nn.TransformerDecoderLayer(d_model=P,\n                                            dim_feedforward=4*P,\n                                            nhead=nheads,\n                                            dropout=0.1,\n                                            batch_first=True\n                                            )\n\n    transformer_decoder = nn.TransformerDecoder(decoder_layer,\n                                                num_layers=nheads)\n\nSet each layer which interfaces encoder and decoder to Denoising Attention:\n\n\n.. code:: python\n\n    for layer_num, layer in enumerate(transformer_decoder.layers):\n        layer.multihead_attn = DenoisingMultiheadAttention(embed_dim=P,\n                                                        num_heads=nheads,\n                                                        dropout=0.1,\n                                                        bias=False,\n                                                        batch_first=True\n                                                        )\n\n\nNow the forward for this decoder: **Note:** It assumes keys and values from the encoder output are a\ntuple `(z, pi, mu, logvar)` where the `z` within the tuple was the original input.\n\n\n.. code:: python\n\n    \n    output = transformer_decoder(tgt=tgt,\n                                memory=latent_dict_0[\"z\"],\n                                tgt_key_padding_mask=tgt_key_padding_mask,\n                                memory_key_padding_mask=latent_dict_0[\"memory_key_padding_mask\"])\n\n\nSelf Attention\n===============\n\nHere is an visualisation of a self attention layer with the NVIB layer. The embeddings first pass through the NVIB layer and then denoising attention layer\nwithin each transformer block. \n\n.. image:: figures/NVIBSaTransformer.png\n\n**Note:** The query comes from our original output and the key and value are come from the NVIB layer. This maintains the idea of query denoising in self attention.\n\n\nKL functions\n--------------\n\nSimple implementation for KL divergence between univariate Gaussians tensors augmented with weights from our\npsuedo-counts :math:`\\alpha` (see paper for more details).\n\n.. code:: python\n\n    kl_g = nvib_layer.kl_gaussian(**latent_dict)\n\nwhere `mu`, `logvar`, `alpha` and the `memory_key_padding_mask` come from NVIB layer latent dict and priors and number of \nsamples :math:`\\kappa^\\Delta` are set. The output is a KL loss of  dimension (B).\n\nThe KL divergence between Dirichlet components (see paper for more details).\n\n.. code:: python\n\n    kl_d = nvib_layer.kl_dirichlet(**latent_dict)\n\nwhere `alpha` and the `memory_key_padding_mask` come from NVIB layer latent dict and priors and number of \nsamples :math:`\\kappa^\\Delta` are set. The output is a KL loss of dimension (B).\n\n\nRepository Structure\n-----------------------------\n\n.. code:: bash\n\n    .\n    ├── figures\n    │   ├── nvib_denoising.png\n    │   └── NVIBSaTransformer.png\n    ├── LICENSE\n    ├── nvib\n    │   ├── __init__.py\n    │   ├── denoising_attention.py\n    │   └── nvib_layer.py\n    ├── README.rst\n    ├── setup.py\n    └── tests\n        ├── __init__.py\n        ├── test_denoising_attention.py\n        ├── test_nvib_layer.py\n        ├── test_memory_and_compute.py\n        ├── test_matrix_multiplication.py\n        └── test_speed_memory.py\n\n\nContact\n---------\nFor questions or reporting issues to this software package, kindly contact the author_.\n\n.. _author: fabio.fehr@idiap.ch\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidiap%2Fnvib","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fidiap%2Fnvib","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidiap%2Fnvib/lists"}