{"id":15066194,"url":"https://github.com/krishnap25/mauve","last_synced_at":"2025-04-12T21:31:15.485Z","repository":{"id":43215933,"uuid":"339517981","full_name":"krishnap25/mauve","owner":"krishnap25","description":"Package to compute Mauve, a similarity score between neural text and human text. Install with `pip install mauve-text`.","archived":false,"fork":false,"pushed_at":"2024-07-12T06:52:55.000Z","size":4571,"stargazers_count":286,"open_issues_count":1,"forks_count":25,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-04T01:09:41.633Z","etag":null,"topics":["deep-learning","huggingface-transformers","nlp","pytorch","text-generation"],"latest_commit_sha":null,"homepage":"https://krishnap25.github.io/mauve/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krishnap25.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-16T20:10:39.000Z","updated_at":"2025-04-01T13:54:21.000Z","dependencies_parsed_at":"2022-08-20T13:30:59.004Z","dependency_job_id":"489a5bbc-4add-42d4-97b1-0e1fccc21c84","html_url":"https://github.com/krishnap25/mauve","commit_stats":{"total_commits":17,"total_committers":4,"mean_commits":4.25,"dds":"0.23529411764705888","last_synced_commit":"91ae06d8c28e5a200bcc83bb444c29fd5ed9bdf7"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnap25%2Fmauve","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnap25%2Fmauve/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnap25%2Fmauve/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnap25%2Fmauve/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krishnap25","download_url":"https://codeload.github.com/krishnap25/mauve/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248634970,"owners_count":21137152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","huggingface-transformers","nlp","pytorch","text-generation"],"created_at":"2024-09-25T01:03:30.252Z","updated_at":"2025-04-12T21:31:15.416Z","avatar_url":"https://github.com/krishnap25.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# MAUVE\n\nThis is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text\nwith the MAUVE measure, \nintroduced  in [this NeurIPS 2021 paper](https://arxiv.org/pdf/2102.01454.pdf) (Outstanding Paper Award) and [this JMLR 2023 paper](https://arxiv.org/pdf/2212.14578.pdf).\n\n\nMAUVE is a measure of the gap between neural text and human text. It is computed using the Kullback–Leibler (KL) divergences between the two text distributions in a quantized embedding space of a large language model. MAUVE can identify differences in quality arising from model sizes and decoding algorithms.\n\n### [Documentation Link](https://krishnap25.github.io/mauve/)\n\n### _New: MAUVE is available via [HuggingFace Evaluate](https://huggingface.co/spaces/evaluate-metric/mauve)!_\n\n\n**Features**:\n- MAUVE with quantization using *k*-means. \n- Adaptive selection of *k*-means hyperparameters. \n- Compute MAUVE using pre-computed GPT-2 features (i.e., terminal hidden state), \n    or featurize raw text using HuggingFace transformers + PyTorch.\n- MAUVE can also be used for other modalities (e.g. images or audio): pass in pre-computed feature embeddings to our API.\n\nFurther details can be found below.\n\nFor scripts to reproduce the experiments in the paper, please see \n[this repository](https://github.com/krishnap25/mauve-experiments).\n\n## Installation\n\nFor a direct install, run this command from your terminal:\n```\npip install mauve-text\n``` \nIf you wish to edit or contribute to MAUVE, you should install from source\n```\ngit clone git@github.com:krishnap25/mauve.git\ncd mauve\npip install -e .\n``` \nSome functionality requires more packages. Please see the requirements below.\n\n## Requirements\nThe installation command above installs the main requirements, which are:\n- `numpy\u003e=1.18.1`\n- `scikit-learn\u003e=0.22.1`\n- `faiss-cpu\u003e=1.7.0`\n- `tqdm\u003e=4.40.0`\n\nIn addition, if you wish to use featurization within MAUVE, you need to manually install:\n- `torch\u003e=1.1.0`: [Instructions](https://pytorch.org/get-started/locally/)\n- `transformers\u003e=3.2.0`:  Simply run `pip install transformers` after PyTorch has been installed \n    ([Detailed Instructions](https://huggingface.co/transformers/installation.html))\n\n\n\n## Quick Start\nLet `p_text` and `q_text` each be a list of strings, where each string is a complete generation (including context). \nFor best practice, MAUVE needs at least a few thousand generations each for `p_text` and `q_text`\n(the paper uses 5000 each).\nFor our demo, we use 100 generations each for fast running time.\n\nTo demonstrate the functionalities of this package on some real data, \nthis repository provides some functionalities to\ndownload and use sample data in the `./examples` folder\n(these are not a part of the MAUVE package, you need to clone the repository for these).\n\nLet use download some Amazon product reviews as well as\nmachine generations, provided by the \n[GPT-2 output dataset repo](https://github.com/openai/gpt-2-output-dataset)\n by running this command in our shell (downloads ~17M in size):\n```bash\npython examples/download_gpt2_dataset.py\n\n```\nThe data is downloaded into the `./data` folder. \nWe can load the data (100 samples out of the available 5000) in Python as \n```python\nfrom examples import load_gpt2_dataset\np_text = load_gpt2_dataset('data/amazon.valid.jsonl', num_examples=100) # human\nq_text = load_gpt2_dataset('data/amazon-xl-1542M.valid.jsonl', num_examples=100) # machine\n```\n\nWe can now compute MAUVE as follows\n(note that this requires installation of [PyTorch](https://pytorch.org) \nand HF [Transformers](https://huggingface.co/transformers)). \n```python\nimport mauve \n\n# call mauve.compute_mauve using raw text on GPU 0; each generation is truncated to 256 tokens\nout = mauve.compute_mauve(p_text=p_text, q_text=q_text, device_id=0, max_text_length=256, verbose=False)\nprint(out.mauve) # prints 0.9917\n```\nThis first downloads GPT-2 large tokenizer and pre-trained model (if you do not have them downloaded already). \nEven if you have the model offline, it takes it up to 30 seconds to load the model the first time. \n`out` now contains the fields:\n- `out.mauve`: MAUVE score, a number between 0 and 1. Larger values indicate that P and Q are closer.\n- `out.frontier_integral`: Frontier Integral, a number between 0 and 1. Smaller values indicate that P and Q are closer.\n- `out.mauve_star` and `out.frontier_integral_star`: their corresponding versions computed with Krichevsky-Trofimov smoothing. See [this JMLR 2023 paper](https://arxiv.org/pdf/2212.14578.pdf) on why this could be preferable.\n- `out.divergence_curve`: a `numpy.ndarray` of shape (m, 2); plot it with matplotlib to view the divergence curve\n- `out.p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`\n- `out.q_hist`: same as above, but with `q_text`  \n\nYou can plot the divergence curve using\n```python\n# Make sure matplotlib is installed in your environment\nimport matplotlib.pyplot as plt  \nplt.plot(out.divergence_curve[:, 1], out.divergence_curve[:, 0])\n```\n\n## Other Ways of Using MAUVE \nFor each text (in both `p_text` and `q_text`), \nMAUVE internally uses the terimal hidden state from GPT-2 large as a feature representation. Of course, more recent LLMs can also be used. Generally, the better the feature embeddings, the better is the performance of MAUVE.\n\nThere are multiple ways to use this package. For instance, you can use cached hidden states directly\n(this does not require PyTorch and HF Transformers to be installed): \n```python\n# call mauve.compute_mauve using features obtained directly\n# p_feats and q_feats are `np.ndarray`s of shape (n, dim)\n# we use a synthetic example here\nimport numpy as np\np_feats = np.random.randn(100, 1024)  # feature dimension = 1024\nq_feats = np.random.randn(100, 1024)\nout = mauve.compute_mauve(p_features=p_feats, q_features=q_feats)\n```\nNote that this API can be used to evaluate other modalities such as images or audio with MAUVE.\n\n\nYou can also compute MAUVE using the tokenized (BPE) representation using the GPT-2 vocabulary \n(e.g., obtained from using an explicit call to `transformers.GPT2Tokenizer`).\n```python\n# call mauve.compute_mauve using tokens on GPU 1\n# p_toks, q_toks are each a list of LongTensors of shape [1, length]\n# we use synthetic examples here\nimport torch\np_toks = [torch.LongTensor(np.random.choice(50257, size=(1, 32), replace=True)) for _ in range(100)]\nq_toks = [torch.LongTensor(np.random.choice(50257, size=(1, 32), replace=True)) for _ in range(100)]\nout = mauve.compute_mauve(p_tokens=p_toks, q_tokens=q_toks, device_id=1, max_text_length=1024)\n```\nTo view the progress messages, pass in the argument `verbose=True` to `mauve.compute_mauve`.\nYou can also use different forms as inputs for `p` and `q`, e.g., \n`p` via `p_text` and `q` via `q_features`. \n\n## Available Options\n`mauve.compute_mauve` takes the following arguments\n- `p_features`: `numpy.ndarray` of shape (n, d), where n is the number of generations\n- `q_features`: `numpy.ndarray` of shape (n, d), where n is the number of generations\n- `p_tokens`: list of length n, each entry is torch.LongTensor of shape (1, length); length can vary between generations\n- `q_tokens`: list of length n, each entry is torch.LongTensor of shape (1, length); length can vary between generations\n- `p_text`: list of length n, each entry is a string\n- `q_text`: list of length n, each entry is a string\n- `num_buckets`: the size of the histogram to quantize P and Q. Options: 'auto' (default) or an integer\n- `pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If `-1`, use all the data. Default -1\n- `kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. Default 0.9\n- `kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). Default 5\n- `kmeans_max_iter`: maximum number of k-means iterations. Default 500\n- `featurize_model_name`: name of the model from which features are obtained. Default `'gpt2-large'`\n    Use one of `['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl']`.\n- `device_id`: Device for featurization. Supply a GPU id (e.g. 0 or 3) to use GPU. If no GPU with this id is found, use CPU\n- `max_text_length`: maximum number of tokens to consider. Default 1024\n- `divergence_curve_discretization_size`: Number of points to consider on the divergence curve. Default 25\n- `mauve_scaling_factor`: \"c\" from the paper. Default 5.\n- `verbose`: If True (default), print running time updates\n- `seed`: random seed to initialize *k*-means cluster assignments.\n- `batch_size`: Batch size for feature extraction.\n\nNote: `p` and `q` can be of different lengths, but it is\nrecommended that they are the same length.\n\n## Contact\nThe best way to contact the authors in case of any questions or clarifications (about the package or the paper) is by raising an issue on GitHub.\nWe are not able to respond to queries over email. \n\n## Contributing\nIf you find any bugs, please raise an issue on GitHub. \nIf you would like to contribute, please submit a pull request.\nWe encourage and highly value community contributions.\n\nSome features which would be good to have are:\n- featurization in HuggingFace Transformers with a JAX backend.\n    \n## Best Practices for MAUVE\nMAUVE is quite different from most metrics in common use, so here are a few guidelines on proper usage of MAUVE:\n1. *Relative comparisons*: \n    - We find that MAUVE is best suited for relative comparisons while \n    the absolute MAUVE score is less meaningful. \n    - For instance if we wish to find which of `model1` and `model2` are better at generating \n    the human distribution, we can compare `MAUVE(text_model1, text_human)` and `MAUVE(text_model2, text_human)`.\n    - The absolute number  `MAUVE(text_model1, text_human)` can vary based on the hyperparameters selected below, \n        but the relative trends remain the same.\n    - One must ensure that the hyperparameters are exactly the same for \n        the MAUVE scores under comparison.\n    - Some hyperparameters are described below. \n2. *Number of generations*: \n    - MAUVE computes the similarity between two *distributions*. \n    - Therefore, each distribution must contain at least\n    a few thousand samples (we use 5000 each). MAUVE with a smaller number of samples is biased towards optimism\n    (that is, MAUVE typically goes down as the number of samples increase) \n    and exhibits a larger standard deviation between runs.\n3. *Number of clusters (discretization size)*: \n    - We take `num_buckets` to be 0.1 * the number of samples. \n    - The performance of MAUVE is quite robust to this, provided the number of generations is not too small. \n4. *MAUVE is too large or too small*:\n    - The parameter `mauve_scaling_parameter` controls the absolute value of the MAUVE score,\n        without changing the relative ordering between various methods. \n        The main purpose of this parameter is to help with interpretability.  \n    - If you find that all your methods get a very high MAUVE score (e.g., 0.995, 0.994),\n        try increasing the value of `mauve_scaling_factor`.\n        (note: this also increases the per-run standard deviation of MAUVE). \n    - If you find that all your methods get a very low MAUVE score (e.g. \u003c 0.4), then \n        try decreasing the value of `mauve_scaling_factor`.\n5. *MAUVE takes too long to run*: \n    - You can also try reducing the number of clusters using the argument `num_buckets`. The\n        clustering algorithm's run time scales as the square of the number of clusters. \n        Once the number of clusters exceeds 500, the clustering really starts to slow down. \n        In this case, it could be helpful to set the number of clusters to 500\n        by overriding the default (which is `num_data_points / 10`, so use this when the number of \n        samples for each of p and q is over 5000).\n    - In this case, try reducing the clustering hyperparameters: \n        set `kmeans_num_redo` to `1`, and if this does not work, `kmeans_max_iter` to `100`.\n        This enables the clustering to run faster at the cost of returning a worse clustering. \n        \n6. **MAUVE's variance is large relative to the differences we try to quantify**: \n    - We observed that is quite easy to capture basic errors with MAUVE but much harder to quantify subtle errors (e.g., when trying to improve over nucleus sampling).\n    - To measure subtle differences with confidence, the best solution is to use better embeddings, if you have access to them.\n    - You might also want to consider more random runs to reduce the variance: more number of k-means seeds (cheapest in terms of compute), more number of generation seeds (for sampling based algorithms), or larger number of text samples.\n    \n## Citation\nIf you find this package useful, or you use it in your research, please cite the following papers:\n```\n@article{pillutla-etal:mauve:jmlr2023,\n  title={{MAUVE Scores for Generative Models: Theory and Practice}},\n  author={Pillutla, Krishna and Liu, Lang and Thickstun, John and Welleck, Sean and Swayamdipta, Swabha and Zellers, Rowan and Oh, Sewoong and Choi, Yejin and Harchaoui, Zaid},\n  journal={JMLR},\n  year={2023}\n}\n\n@inproceedings{pillutla-etal:mauve:neurips2021,\n  title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},\n  author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},\n  booktitle = {NeurIPS},\n  year      = {2021}\n}\n\n@inproceedings{liu-etal:mauve-theory:neurips2021,\n  title={{Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals}},\n  author={Liu, Lang and Pillutla, Krishna and Welleck, Sean and Oh, Sewoong and Choi, Yejin and Harchaoui, Zaid},\n  booktitle={NeurIPS},\n  year={2021}\n}\n\n```\n    \n## Acknowledgements\nThis work was supported by NSF DMS-2134012, NSF CCF-2019844, NSF DMS-2023166, the DARPA MCS program through NIWC Pacific (N66001-19-2-4031), the CIFAR \"Learning in Machines \u0026 Brains\" program, a Qualcomm Innovation Fellowship, and faculty research awards.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishnap25%2Fmauve","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrishnap25%2Fmauve","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishnap25%2Fmauve/lists"}