{"id":18750360,"url":"https://github.com/ntia/alignnet","last_synced_at":"2025-09-01T10:31:03.632Z","repository":{"id":247675183,"uuid":"800195387","full_name":"NTIA/alignnet","owner":"NTIA","description":"Train no-reference speech quality estimators with multiple datasets via learned, per-dataset alignments.","archived":false,"fork":false,"pushed_at":"2025-05-13T16:55:06.000Z","size":52205,"stargazers_count":17,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-13T17:37:45.520Z","etag":null,"topics":["corpus-effect","listening-experiment","machine-learning","no-reference-audio-quality-assessment","speech-quality","speech-quality-evaluation","subjective-tests"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NTIA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-13T22:06:57.000Z","updated_at":"2025-03-17T08:09:03.000Z","dependencies_parsed_at":"2024-11-07T17:11:49.578Z","dependency_job_id":"f80b54ea-b0ab-4092-af42-7a9e3fa98d62","html_url":"https://github.com/NTIA/alignnet","commit_stats":null,"previous_names":["ntia/alignnet"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/NTIA/alignnet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NTIA%2Falignnet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NTIA%2Falignnet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NTIA%2Falignnet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NTIA%2Falignnet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NTIA","download_url":"https://codeload.github.com/NTIA/alignnet/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NTIA%2Falignnet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273107693,"owners_count":25046956,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-01T02:00:09.058Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus-effect","listening-experiment","machine-learning","no-reference-audio-quality-assessment","speech-quality","speech-quality-evaluation","subjective-tests"],"created_at":"2024-11-07T17:11:32.118Z","updated_at":"2025-09-01T10:31:00.067Z","avatar_url":"https://github.com/NTIA.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dataset Alignment\n[![DOI](https://zenodo.org/badge/800195387.svg)](https://zenodo.org/doi/10.5281/zenodo.12734153)\n\nThis code corresponds to the paper \"AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators,\" by Jaden Pieper, Stephen D. Voran, to appear in Proc. Interspeech 2024 and with [preprint available here](https://arxiv.org/abs/2406.10205).\n\nWhen training a no-reference (NR) speech quality estimator, multiple datasets provide more information and can thus lead to better training. But they often are inconsistent in the sense that they use different subjective testing scales, or the exact same scale is used differently by test subjects due to the corpus effect.\nAlignNet improves the training of NR speech quality estimators with multiple, independent datasets. AlignNet uses an AudioNet to generate intermediate score estimates before using the Aligner to map intermediate estimates to the appropriate score range.\nAlignNet is intentionally designed to be independent of the choice of AudioNet.\n\nThis repository contains implementations of two different AudioNet choices: [MOSNet](https://arxiv.org/abs/1904.08352) and a simple example of a novel multi-scale convolution approach. \n\nMOSNet demonstrates a network that takes the STFT of an audio signal as its input, and the multi-scale convolution network is provided primarily as an example of a network that takes raw audio as an input.\n\n# Installation\n## Dependencies\nThere are two included environment files. `environment.yml` has the dependencies required to train with alignnet but does not impose version requirements. It is thus susceptible to issues in the future if packages deprecate methods or have major backwards compatibility breaks. On the other hand, `environment-paper.yml` contains the exact versions of the packages that were used for all the results reported in our paper. \n\nCreate and activate the `alignnet` environment.\n```\nconda env create -f environment.yml\nconda activate alignnet\n```\n\n## Installing alignnet package\n```\npip install .\n```\n\n# Preparing data for training\nWhen training with multiple datasets, some work must first be done to format them in a consistent manner so they can all be loaded in the same way.\nFor each dataset, one must first make a csv that has subjective score in column called `MOS` and path to audio file in column called `audio_path`.\n\nIf your `audio_net` model requires transformed data, you can transform it prior to training with `pretransform_data.py` (see `python pretransform_data.py --help` for more information) and store paths to those transformed representation files in a column called `transform_path`. For example, MOSNet uses the STFT of audio as an input. For more efficient training, pretransforming the audio into STFT representations, saving them, and including a column called `stft_path` in the csv is recommended.\nMore generally, the column name must match the value of `data.pathcol`.\nFor examples, see [MOSNet](alignnet/config/models/pretrain-MOSNet.yaml) or [MultiScaleConvolution](alignnet/config/models/pretrain-msc.yaml).\n\n\nFor each dataset, split the data into training, validation, and testing portions with\n```\npython split_labeled_data.py /path/to/data/file.csv --output-dir /datasetX/splits/path\n```\nThis generates `train.csv`, `valid.csv`, and `test.csv` in `/datasetX/splits/path`.\nAdditional options for splitting can be seen via `python split_labeled_data.py --help`, including creating multiple independent splits and changing the amount of data placed into each split.\n\n# Training with AlignNet\nSetting up training runs is configured via [Hydra](https://hydra.cc/docs/intro/).\nBasic examples of configuration files can be found in [model/config](alignnet/config/models).\n\nSome basic training help can be found with \n\n```\npython train.py --help\n```\n\nTo see an example config file and all the overrideable parameters for training MOSNet with AlignNet, run\n```\npython train.py --config-dir alignnet/config/models --config-name=alignnet-MOSNet --cfg job\n```\nHere the `--cfg job` shows the configuration for this job without running the code.\n\nIf you are not training with a [clearML](https://clear.ml/) server, be sure to set `logging=none`.\nTo change the number of workers used for data loading, override the `data.num_workers` parameter, which defaults to 6.\n\nAs an example, and to confirm you have appropriately overridden these parameters, you could run \n```\npython train.py logging=none data.num_workers=4 --config-dir alignnet/config/models --config-name=alignnet-MOSNet --cfg job\n```\n\n### Pretraining MOSNet on a dataset\nIn order to pretrain on a dataset you run\n```\npython path/to/alignnet/train.py \\\ndata.data_dirs=[/absolute/path/datasetX/splits/path] \\\n--config-dir path/to/alignnet/alignnet/config/models/ --config-name pretrain-MOSNet.yaml\n```\nWhere `/absolute/path/datasetX/splits/path` contains `train.csv`, `valid.csv`, and `test.csv` for that dataset.\n\n### Training MOSNet with AlignNet\n```\npython path/to/alignnet/train.py \\\ndata.data_dirs=[/absolute/path/dataset1/splits/path,/absolute/path/dataset2/splits/path] \\\n--config-dir path/to/alignnet/alignnet/config/models/ --config-name alignnet-MOSNet.yaml\n```\n\n### Training MOSNet with AlignNet and MDF\n```\npython path/to/alignnet/train.py \\\ndata.data_dirs=[/absolute/path/dataset1/splits/path,/absolute/path/dataset2/splits/path] \\\nfinetune.restore_file=/absolute/path/to/alignnet/pretrained/model \\\n--config-dir path/to/alignnet/alignnet/config/models/ --config-name alignnet-MOSNet.yaml\n```\n\n### Training MOSNet in conventional way\nMultiple datasets, no alignment.\n```\npython path/to/alignnet/train.py \\\nproject.task=Conventional-MOSNet \\\ndata.data_dirs=[/absolute/path/dataset1/splits/path,/absolute/path/dataset2/splits/path] \\\n--config-dir path/to/alignnet/alignnet/config/models/ --config-name pretrain-MOSNet.yaml\n```\n\n## Examples\n## Training MOSNet with AlignNet and MDF starting with MOSNet that has been pretrained on Tencent dataset\n```\npython path/to/alignnet/train.py \\\ndata.data_dirs=[/absolute/path/dataset1/splits/path,/absolute/path/dataset2/splits/path] \\\nfinetune.restore_file=/absolute/path/to/alignnet/trained_models/pretrained-MOSNet-tencent \\\n--config-dir path/to/alignnet/alignnet/config/models/ --config-name alignnet-MOSNet.yaml\n```\n\n## MultiScaleConvolution example\nTraining NR speech estimators with AlignNet is intentionally designed to be agnostic to the choice of AudioNet.\nTo demonstrate this, we include code for a rudimentary network that takes in raw audio as an input and trains separate convolutional networks on multiple time scales that are then aggregated into a single network component.\nThis network is defined as `alignnet.MultiScaleConvolution` and can be trained via:\n```\npython path/to/alignnet/train.py \\\ndata.data_dirs=[/absolute/path/dataset1/splits/path,/absolute/path/dataset2/splits/path] \\\n--config-dir path/to/alignnet/alignnet/config/models/ --config-name alignnet-msc.yaml\n```\n\n# Using AlignNet models at inference\nTrained AlignNet models can easily be used at inference via the CLI built into `inference.py`.\nSome basic help can be seen via\n```\npython inference.py --help\n```\n\nIn general, three overrides must be set:\n* `model.path` - path to a trained model\n* `data.data_files` - list containing absolute paths to csv files that list audio files to perform inference on.\n* `output.file` - path to file where inference output will be stored.\n\nAfter running inference, a csv will be created at `output.file` with the following columns:\n* `file` - filenames where audio was loaded from\n* `estimate` - estimate generated by the model\n* `dataset` - index listing which file from `data.data_files` this file belongs to.\n* `AlignNet dataset index` - index listing which dataset within the model the scores come from. This will be the same for every file in the csv. The default dataset will always be the reference dataset, but this can be overriden via `model.dataset_index`.\n\nFor example, to run inference using the included AlignNet model trained on the smaller datasets, one would run\n```\npython inference.py \\\ndata.data_files=[/absolute/path/to/inference/data1.csv,/absolute/path/to/inference/data2.csv] \\\nmodel.path=trained_models/alignnet_mdf-MOSNet-small_data \\\noutput.file=estimations.csv\n```\n\n\n# Gathering datasets used in 2024 Conference Paper\nHere are links and references to help with locating the data we have used in the paper.\n\n* [Blizzard 2021](https://www.cstr.ed.ac.uk/projects/blizzard/data.html)\n  *  Z.-H. Ling, X. Zhou, and S. King, \"The Blizzard challenge 2021,\" in Proc. Blizzard Challenge Workshop, 2021.\n* [Blizzard 2008](https://www.cstr.ed.ac.uk/projects/blizzard/data.html)\n  * V. Karaiskos, S. King, R. A. J. Clark, and C. Mayo, \"The Blizzard challenge 2008,\" in Proc. Blizzard Challenge Workshop, 2008.\n* [FFTNet](https://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/clips/)\n  *  Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, \"FFTNet: a real-time speaker-dependent neural vocoder,\" in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.\n* [NOIZEUS](https://ecs.utdallas.edu/loizou/speech/noizeus/)\n  * Y. Hu and P. Loizou, \"Subjective comparison of speech enhancement algorithms,\" in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2006.\n* [VoiceMOS Challenge 2022](https://codalab.lisn.upsaclay.fr/competitions/695)\n  * W. C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, \"The VoiceMOS Challenge 2022,\" in Proc. Interspeech 2022, 2022, pp. 4536–4540.\n* [Tencent](https://github.com/ConferencingSpeech/ConferencingSpeech2022)\n  * G. Yi, W. Xiao, Y. Xiao, B. Naderi, S. Moller, W. Wardah, G. Mittag, R. Cutler, Z. Zhang, D. S. Williamson, F. Chen, F. Yang, and S. Shang, \"ConferencingSpeech 2022 Challenge: Non-intrusive objective speech quality assessment challenge for online conferencing applications,\" in Proc. Interspeech, 2022, pp. 3308–3312.\n* [NISQA](https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus)\n  * G. Mittag, B. Naderi, A. Chehadi, and S. Möller, \"NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Proc. Interspeech, 2021, pp. 2127–2131.\n* [Voice Conversion Challenge 2018](https://datashare.ed.ac.uk/handle/10283/3257)\n  * J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” in Proc. Speaker Odyssey, 2018.\n* [Indiana U. MOS](https://github.com/ConferencingSpeech/ConferencingSpeech2022)\n  * X. Dong and D. S. Williamson, \"A pyramid recurrent network for predicting crowdsourced speech-quality ratings of real-world signals,\" in Proc. Interspeech, 2020.\n* [PSTN](https://github.com/ConferencingSpeech/ConferencingSpeech2022)\n  * G. Mittag, R. Cutler, Y. Hosseinkashi, M. Revow, S. Srinivasan, N. Chande, and R. Aichner, “DNN no-reference PSTN speech quality prediction,” in Proc. Interspeech, 2020.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fntia%2Falignnet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fntia%2Falignnet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fntia%2Falignnet/lists"}