{"id":30762856,"url":"https://github.com/interactiveaudiolab/ppgs","last_synced_at":"2025-09-04T15:50:03.233Z","repository":{"id":224871966,"uuid":"523825813","full_name":"interactiveaudiolab/ppgs","owner":"interactiveaudiolab","description":"High-Fidelity Neural Phonetic Posteriorgrams","archived":false,"fork":false,"pushed_at":"2025-02-23T04:31:17.000Z","size":103380,"stargazers_count":112,"open_issues_count":2,"forks_count":8,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-07-01T23:36:35.258Z","etag":null,"topics":["distance","intelligibility","phonemes","posteriorgram","pronunciation","speech"],"latest_commit_sha":null,"homepage":"https://maxrmorrison.com/sites/ppgs","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/interactiveaudiolab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-11T18:18:41.000Z","updated_at":"2025-05-12T02:00:11.000Z","dependencies_parsed_at":"2024-03-14T14:50:08.027Z","dependency_job_id":"64fd33de-72da-446e-8595-f3ba3ca7cf98","html_url":"https://github.com/interactiveaudiolab/ppgs","commit_stats":null,"previous_names":["interactiveaudiolab/ppgs"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/interactiveaudiolab/ppgs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/interactiveaudiolab%2Fppgs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/interactiveaudiolab%2Fppgs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/interactiveaudiolab%2Fppgs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/interactiveaudiolab%2Fppgs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/interactiveaudiolab","download_url":"https://codeload.github.com/interactiveaudiolab/ppgs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/interactiveaudiolab%2Fppgs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273633414,"owners_count":25140774,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-04T02:00:08.968Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distance","intelligibility","phonemes","posteriorgram","pronunciation","speech"],"created_at":"2025-09-04T15:49:57.943Z","updated_at":"2025-09-04T15:50:03.214Z","avatar_url":"https://github.com/interactiveaudiolab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eHigh-Fidelity Neural Phonetic Posteriorgrams\u003c/h1\u003e\n\u003cdiv align=\"center\"\u003e\n\n[![PyPI](https://img.shields.io/pypi/v/ppgs.svg)](https://pypi.python.org/pypi/ppgs)\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![Downloads](https://static.pepy.tech/badge/ppgs)](https://pepy.tech/project/ppgs)\n\nTraining, evaluation, and inference of neural phonetic posteriorgrams (PPGs) in PyTorch\n\n[[Paper]](https://www.maxrmorrison.com/pdfs/churchwell2024high.pdf) [[Website]](https://www.maxrmorrison.com/sites/ppgs/)\n\u003c/div\u003e\n\n\n## Table of contents\n\n- [Installation](#installation)\n- [Inference](#inference)\n    * [Application programming interface (API)](#application-programming-interface-api)\n        * [`ppgs.from_audio`](#ppgsfrom_audio)\n        * [`ppgs.from_file`](#ppgsfrom_file)\n        * [`ppgs.from_file_to_file`](#ppgsfrom_file_to_file)\n        * [`ppgs.from_files_to_files`](#ppgsfrom_files_to_files)\n    * [Command-line interface (CLI)](#command-line-interface-cli)\n- [Distance](#distance)\n- [Interpolate](#interpolate)\n- [Edit](#edit)\n    * [`ppgs.edit.grid.constant`](#ppgseditgridconstant)\n    * [`ppgs.edit.grid.from_alignments`](#ppgseditgridfrom_alignments)\n    * [`ppgs.edit.grid.of_length`](#ppgseditgridof_length)\n    * [`ppgs.edit.grid.sample`](#ppgseditgridsample)\n    * [`ppgs.edit.reallocate`](#ppgseditreallocate)\n    * [`ppgs.edit.regex`](#ppgseditregex)\n    * [`ppgs.edit.shift`](#ppgseditshift)\n    * [`ppgs.edit.swap`](#ppgseditswap)\n- [Sparsify](#sparsify)\n- [Training](#training)\n    * [Download](#download)\n    * [Preprocess](#preprocess)\n    * [Partition](#partition)\n    * [Train](#train)\n    * [Monitor](#monitor)\n    * [Evaluate](#evaluate)\n- [Citation](#citation)\n\n\n## Installation\n\nAn inference-only installation with our best model is pip-installable\n\n`pip install ppgs`\n\nTo perform training, install training dependencies and FFMPEG.\n\n```bash\npip install ppgs[train]\nconda install -c conda-forge ffmpeg\n```\n\nIf you wish to use the Charsiu representation, download the code,\ninstall both inference and training dependencies, and install\nCharsiu as a Git submodule.\n\n```bash\n# Clone\ngit clone git@github.com/interactiveaudiolab/ppgs\ncd ppgs/\n\n# Install dependencies\npip install -e .[train]\nconda install -c conda-forge ffmpeg\n\n# Download Charsiu\ngit submodule init\ngit submodule update\n```\n\n\n## Inference\n\n```python\nimport ppgs\n\n# Load speech audio at correct sample rate\naudio = ppgs.load.audio(audio_file)\n\n# Choose a gpu index to use for inference. Set to None to use cpu.\ngpu = 0\n\n# Infer PPGs\nppgs = ppgs.from_audio(audio, ppgs.SAMPLE_RATE, gpu=gpu)\n```\n\n\n### Application programming interface (API)\n\n#### `ppgs.from_audio`\n\n```python\ndef from_audio(\n    audio: torch.Tensor,\n    sample_rate: Union[int, float],\n    representation: str = ppgs.REPRESENTATION,\n    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,\n    gpu: int = None\n) -\u003e torch.Tensor:\n    \"\"\"Infer ppgs from audio\n\n    Arguments\n        audio\n            Batched audio to process\n            shape=(batch, 1, samples)\n        sample_rate\n            Audio sampling rate\n        representation\n            The representation to use, 'mel' and 'w2v2fb' are currently supported\n        checkpoint\n            The checkpoint file\n        gpu\n            The index of the GPU to use for inference\n\n    Returns\n        ppgs\n            Phonetic posteriorgrams\n            shape=(batch, len(ppgs.PHONEMES), frames)\n    \"\"\"\n```\n\n\n#### `ppgs.from_file`\n\n```python\ndef from_file(\n    file: Union[str, bytes, os.PathLike],\n    representation: str = ppgs.REPRESENTATION,\n    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,\n    gpu: Optional[int] = None\n) -\u003e torch.Tensor:\n    \"\"\"Infer ppgs from an audio file\n\n    Arguments\n        file\n            The audio file\n        representation\n            The representation to use, 'mel' and 'w2v2fb' are currently supported\n        checkpoint\n            The checkpoint file\n        gpu\n            The index of the GPU to use for inference\n\n    Returns\n        ppgs\n            Phonetic posteriorgram\n            shape=(len(ppgs.PHONEMES), frames)\n    \"\"\"\n```\n\n\n#### `ppgs.from_file_to_file`\n\n```python\ndef from_file_to_file(\n    audio_file: Union[str, bytes, os.PathLike],\n    output_file: Union[str, bytes, os.PathLike],\n    representation: str = ppgs.REPRESENTATION,\n    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,\n    gpu: Optional[int] = None\n) -\u003e None:\n    \"\"\"Infer ppg from an audio file and save to a torch tensor file\n\n    Arguments\n        audio_file\n            The audio file\n        output_file\n            The .pt file to save PPGs\n        representation\n            The representation to use, 'mel' and 'w2v2fb' are currently supported\n        checkpoint\n            The checkpoint file\n        gpu\n            The index of the GPU to use for inference\n    \"\"\"\n```\n\n\n#### `ppgs.from_files_to_files`\n\n```python\ndef from_files_to_files(\n    audio_files: List[Union[str, bytes, os.PathLike]],\n    output_files: List[Union[str, bytes, os.PathLike]],\n    representation: str = ppgs.REPRESENTATION,\n    checkpoint: Optional[Union[str, bytes, os.PathLike]] = None,\n    num_workers: int = 0,\n    gpu: Optional[int] = None,\n    max_frames: int = ppgs.MAX_INFERENCE_FRAMES\n) -\u003e None:\n    \"\"\"Infer ppgs from audio files and save to torch tensor files\n\n    Arguments\n        audio_files\n            The audio files\n        output_files\n            The .pt files to save PPGs\n        representation\n            The representation to use, 'mel' and 'w2v2fb' are currently supported\n        checkpoint\n            The checkpoint file\n        num_workers\n            Number of CPU threads for multiprocessing\n        gpu\n            The index of the GPU to use for inference\n        max_frames\n            The maximum number of frames on the GPU at once\n    \"\"\"\n```\n\n\n### Command-line interface (CLI)\n\n```\nusage: python -m ppgs\n    [-h]\n    [--audio_files AUDIO_FILES [AUDIO_FILES ...]]\n    [--output_files OUTPUT_FILES [OUTPUT_FILES ...]]\n    [--representation REPRESENTATION]\n    [--checkpoint CHECKPOINT]\n    [--num-workers NUM_WORKERS]\n    [--gpu GPU]\n    [--max-frames MAX_TRAINING_FRAMES]\n\narguments:\n    --audio_files AUDIO_FILES [AUDIO_FILES ...]\n        Paths to input audio files\n    --output_files OUTPUT_FILES [OUTPUT_FILES ...]\n        The one-to-one corresponding output files\n\noptional arguments:\n    -h, --help\n        Show this help message and exit\n    --representation REPRESENTATION\n        Representation to use for inference\n    --checkpoint CHECKPOINT\n        The checkpoint file\n    --num-workers NUM_WORKERS\n        Number of CPU threads for multiprocessing\n    --gpu GPU\n        The index of the GPU to use for inference. Defaults to CPU.\n    --max-frames MAX_FRAMES\n        Maximum number of frames in a batch\n```\n\n\n## Distance\n\nTo compute the proposed normalized Jenson-Shannon divergence pronunciation\ndistance between two PPGs, use `ppgs.distance()`.\n\n```python\ndef distance(\n    ppgX: torch.Tensor,\n    ppgY: torch.Tensor,\n    reduction: str = 'mean',\n    normalize: bool = True,\n    exponent: float = ppgs.SIMILARITY_EXPONENT\n) -\u003e torch.Tensor:\n    \"\"\"Compute the pronunciation distance between two aligned PPGs\n\n    Arguments\n        ppgX\n            Input PPG X\n            shape=(len(ppgs.PHONEMES), frames)\n        ppgY\n            Input PPG Y to compare with PPG X\n            shape=(len(ppgs.PHONEMES), frames)\n        reduction\n            Reduction to apply to the output. One of ['mean', 'none', 'sum'].\n        normalize\n            Apply similarity based normalization\n        exponent\n            Similarty exponent\n\n    Returns\n        Normalized Jenson-shannon divergence between PPGs\n    \"\"\"\n```\n\n\n## Interpolate\n\n```python\ndef interpolate(\n    ppgX: torch.Tensor,\n    ppgY: torch.Tensor,\n    interp: Union[float, torch.Tensor]\n) -\u003e torch.Tensor:\n    \"\"\"Linear interpolation\n\n    Arguments\n        ppgX\n            Input PPG X\n            shape=(len(ppgs.PHONEMES), frames)\n        ppgY\n            Input PPG Y\n            shape=(len(ppgs.PHONEMES), frames)\n        interp\n            Interpolation values\n            scalar float OR shape=(frames,)\n\n    Returns\n        Interpolated PPGs\n        shape=(len(ppgs.PHONEMES), frames)\n    \"\"\"\n```\n\n\n## Edit\n\n```python\nimport ppgs\n\n# Get PPGs to edit\nppg = ppgs.from_file(audio_file, gpu=gpu)\n\n# Constant-ratio time-stretching (slowing down)\ngrid = ppgs.edit.grid.constant(ppg, ratio=0.8)\nslow = ppgs.edit.grid.sample(ppg, grid)\n\n# Stretch to a desired length (e.g., 100 frames)\ngrid = ppgs.edit.grid.of_length(ppg, 100)\nfixed = ppgs.edit.grid.sample(ppg, grid)\n```\n\n\n### `ppgs.edit.grid.constant`\n\n```python\ndef constant(ppg: torch.Tensor, ratio: float) -\u003e torch.Tensor:\n    \"\"\"Create a grid for constant-ratio time-stretching\n\n    Arguments\n        ppg\n            Input PPG\n        ratio\n            Time-stretching ratio; lower is slower\n\n    Returns\n        Constant-ratio grid for time-stretching ppg\n    \"\"\"\n```\n\n\n### `ppgs.edit.grid.from_alignments`\n\n```python\ndef from_alignments(\n    source: pypar.Alignment,\n    target: pypar.Alignment,\n    sample_rate: int = ppgs.SAMPLE_RATE,\n    hopsize: int = ppgs.HOPSIZE\n) -\u003e torch.Tensor:\n    \"\"\"Create time-stretch grid to convert source alignment to target\n\n    Arguments\n        source\n            Forced alignment of PPG to stretch\n        target\n            Forced alignment of target PPG\n        sample_rate\n            Audio sampling rate\n        hopsize\n            Hopsize in samples\n\n    Returns\n        Grid for time-stretching source PPG\n    \"\"\"\n```\n\n\n### `ppgs.edit.grid.of_length`\n\n```python\ndef of_length(ppg: torch.Tensor, length: int) -\u003e torch.Tensor:\n    \"\"\"Create time-stretch grid to resample PPG to a specified length\n\n    Arguments\n        ppg\n            Input PPG\n        length\n            Target length\n\n    Returns\n        Grid of specified length for time-stretching ppg\n    \"\"\"\n```\n\n\n### `ppgs.edit.grid.sample`\n\n```python\ndef grid_sample(ppg: torch.Tensor, grid: torch.Tensor) -\u003e torch.Tensor:\n    \"\"\"Grid-based PPG interpolation\n\n    Arguments\n        ppg\n            Input PPG\n        grid\n            Grid of desired length; each item is a float-valued index into ppg\n\n    Returns\n        Interpolated PPG\n    \"\"\"\n```\n\n\n### `ppgs.edit.reallocate`\n\n```python\ndef reallocate(\n    ppg: torch.Tensor,\n    source: str,\n    target: str,\n    value: Optional[float] = None\n) -\u003e torch.Tensor:\n    \"\"\"Reallocate probability from source phoneme to target phoneme\n\n    Arguments\n        ppg\n            Input PPG\n            shape=(len(ppgs.PHONEMES), frames)\n        source\n            Source phoneme\n        target\n            Target phoneme\n        value\n            Max amount to reallocate. If None, reallocates all probability.\n\n    Returns\n        Edited PPG\n    \"\"\"\n```\n\n\n### `ppgs.edit.regex`\n\n```python\ndef regex(\n    ppg: torch.Tensor,\n    source_phonemes: List[str],\n    target_phonemes: List[str]\n) -\u003e torch.Tensor:\n    \"\"\"Regex match and replace (via swap) for phoneme sequences\n\n    Arguments\n        ppg\n            Input PPG\n            shape=(len(ppgs.PHONEMES), frames)\n        source_phonemes\n            Source phoneme sequence\n        target_phonemes\n            Target phoneme sequence\n\n    Returns\n        Edited PPG\n    \"\"\"\n```\n\n\n### `ppgs.edit.shift`\n\n```python\ndef shift(ppg: torch.Tensor, phoneme: str, value: float):\n    \"\"\"Shift probability of a phoneme and reallocate proportionally\n\n    Arguments\n        ppg\n            Input PPG\n            shape=(len(ppgs.PHONEMES), frames)\n        phoneme\n            Input phoneme\n        value\n            Maximal shift amount\n\n    Returns\n        Edited PPG\n    \"\"\"\n```\n\n\n### `ppgs.edit.swap`\n\n```python\ndef swap(ppg: torch.Tensor, phonemeA: str, phonemeB: str) -\u003e torch.Tensor:\n    \"\"\"Swap the probabilities of two phonemes\n\n    Arguments\n        ppg\n            Input PPG\n            shape=(len(ppg.PHONEMES), frames)\n        phonemeA\n            Input phoneme A\n        phonemeB\n            Input phoneme B\n\n    Returns\n        Edited PPG\n    \"\"\"\n```\n\n\n## Sparsify\n\n```python\ndef sparsify(\n    ppg: torch.Tensor,\n    method: str = 'percentile',\n    threshold: torch.Tensor = torch.Tensor([0.85])\n) -\u003e torch.Tensor:\n    \"\"\"Make phonetic posteriorgrams sparse\n\n    Arguments\n        ppg\n            Input PPG\n            shape=(batch, len(ppgs.PHONEMES), frames)\n        method\n            Sparsification method. One of ['constant', 'percentile', 'topk'].\n        threshold\n            In [0, 1] for 'contant' and 'percentile'; integer \u003e 0 for 'topk'.\n\n    Returns\n        Sparse phonetic posteriorgram\n        shape=(batch, len(ppgs.PHONEMES), frames)\n    \"\"\"\n```\n\n\n## Training\n\n### Download\n\nDownloads, unzips, and formats datasets. Stores datasets in `data/datasets/`.\nStores formatted datasets in `data/cache/`.\n\n**N.B.** Common voice and TIMIT cannot be automatically downloaded. You must\nmanually download the tarballs and place them in `data/sources/commonvoice`\nor `data/sources/timit`, respectively, prior to running the following.\n\n```bash\npython -m ppgs.data.download --datasets \u003cdatasets\u003e\n```\n\n\n### Preprocess\n\nPrepares representations for training. Representations are stored\nin `data/cache/`.\n\n```\npython -m ppgs.preprocess \\\n   --datasets \u003cdatasets\u003e \\\n   --representatations \u003crepresentations\u003e \\\n   --gpu \u003cgpu\u003e \\\n   --num-workers \u003cworkers\u003e\n```\n\n\n### Partition\n\nPartitions a dataset. You should not need to run this, as the partitions\nused in our work are provided for each dataset in\n`ppgs/assets/partitions/`.\n\n```\npython -m ppgs.partition --datasets \u003cdatasets\u003e\n```\n\n\n### Train\n\nTrains a model. Checkpoints and logs are stored in `runs/`.\n\n```\npython -m ppgs.train --config \u003cconfig\u003e --dataset \u003cdataset\u003e --gpu \u003cgpu\u003e\n```\n\nIf the config file has been previously run, the most recent checkpoint will\nautomatically be loaded and training will resume from that checkpoint.\n\n\n### Monitor\n\nYou can monitor training via `tensorboard`.\n\n```\ntensorboard --logdir runs/ --port \u003cport\u003e --load_fast true\n```\n\nTo use the `torchutil` notification system to receive notifications for long\njobs (download, preprocess, train, and evaluate), set the\n`PYTORCH_NOTIFICATION_URL` environment variable to a supported webhook as\nexplained in [the Apprise documentation](https://pypi.org/project/apprise/).\n\n\n### Evaluate\n\nPerforms objective evaluation of phoneme accuracy. Results are stored\nin `eval/`.\n\n```\npython -m ppgs.evaluate \\\n    --config \u003cname\u003e \\\n    --datasets \u003cdatasets\u003e \\\n    --checkpoint \u003ccheckpoint\u003e \\\n    --gpu \u003cgpu\u003e\n```\n\n\n## Citation\n\n### IEEE\nC. Churchwell, M. Morrison, and B. Pardo, \"High-Fidelity Neural Phonetic Posteriorgrams,\"\nICASSP 2024 Workshop on Explainable Machine Learning for Speech and Audio, April 2024.\n\n\n### BibTex\n\n```\n@inproceedings{churchwell2024high,\n    title={High-Fidelity Neural Phonetic Posteriorgrams},\n    author={Churchwell, Cameron and Morrison, Max and Pardo, Bryan},\n    booktitle={ICASSP 2024 Workshop on Explainable Machine Learning for Speech and Audio},\n    month={April},\n    year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finteractiveaudiolab%2Fppgs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finteractiveaudiolab%2Fppgs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finteractiveaudiolab%2Fppgs/lists"}