{"id":15008856,"url":"https://github.com/nttcslab-sp/kaldiio","last_synced_at":"2025-05-15T00:08:57.510Z","repository":{"id":55875956,"uuid":"147448065","full_name":"nttcslab-sp/kaldiio","owner":"nttcslab-sp","description":"A pure python module for reading and writing kaldi ark files","archived":false,"fork":false,"pushed_at":"2025-03-06T15:20:42.000Z","size":339,"stargazers_count":256,"open_issues_count":4,"forks_count":36,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-04-03T12:06:42.913Z","etag":null,"topics":["file-formats","fileio","kaldi","pure-python","python","python2","python3","speech-recognition"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nttcslab-sp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-05T02:27:00.000Z","updated_at":"2025-03-20T02:21:00.000Z","dependencies_parsed_at":"2024-06-18T14:05:58.945Z","dependency_job_id":"2249d429-4888-4dff-8792-6bf8feab7676","html_url":"https://github.com/nttcslab-sp/kaldiio","commit_stats":{"total_commits":227,"total_committers":9,"mean_commits":25.22222222222222,"dds":"0.23788546255506604","last_synced_commit":"28a6f0af573c87a409ec0e0d64e5cdd520ebd308"},"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nttcslab-sp%2Fkaldiio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nttcslab-sp%2Fkaldiio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nttcslab-sp%2Fkaldiio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nttcslab-sp%2Fkaldiio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nttcslab-sp","download_url":"https://codeload.github.com/nttcslab-sp/kaldiio/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248354185,"owners_count":21089771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["file-formats","fileio","kaldi","pure-python","python","python2","python3","speech-recognition"],"created_at":"2024-09-24T19:21:05.623Z","updated_at":"2025-04-11T06:28:46.235Z","avatar_url":"https://github.com/nttcslab-sp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Kaldiio\n[![pypi](https://img.shields.io/pypi/v/kaldiio.svg)](https://pypi.python.org/pypi/kaldiio)\n[![Supported Python versions](https://img.shields.io/pypi/pyversions/kaldiio.svg)](https://pypi.python.org/pypi/kaldiio)\n[![codecov](https://codecov.io/gh/nttcslab-sp/kaldiio/branch/master/graph/badge.svg)](https://codecov.io/gh/nttcslab-sp/kaldiio)\n\nA pure python module for reading and writing kaldi ark files\n\n- [Introduction](#introduction)\n    - [What is this? What are `ark` and `scp`?](#what-is-this-what-are-ark-and-scp)\n    - [Features](#features)\n    - [Similar projects](#similar-projects)\n- [Install](#install)\n- [Usage](#usage)\n    - [ReadHelper](#readhelper)\n    - [WriteHelper](#writehelper)\n- [More low level API](#more-low-level-api)\n\n## Introduction\n### What are `ark` and `scp`?\n`kaldiio` is an IO utility  implemented in pure Python language for several file formats used in [kaldi](https://github.com/kaldi-asr/kaldi), which are named as`ark` and `scp`.  `ark` and `scp` are used in  in order to archive some objects defined in Kaldi, typically it is Matrix object of Kaldi.\n\nIn this section, we describe the basic concept of `ark` and `scp`. More detail about the File-IO in `Kaldi-asr`: http://kaldi-asr.org/doc/io.html\n\n\n#### Basic of File IO in kaldi: Ark and copy-feats\n`ark` is an archive format to save any `Kaldi objects`. This library mainly support `KaldiMatrix/KaldiVector`.\nThis ia an example of ark file of KaldiMatrix: [ark file](tests/arks/test.ark)\n\nIf you have `Kaldi`, you can convert it to text format as following\n\n```bash\n# copy-feats \u003cread-specifier\u003e \u003cwrite-specifier\u003e\ncopy-feats ark:test.ark ark,t:text.ark\n```\n\n\n`copy-feats` is designed to have high affinity with unix command line:\n\n1. `ark` can be flushed to and from unix pipe.\n\n        cat test.ark | copy-feats ark:- ark,t:- | less # Show the contents in the ark\n    `-` indicates standard input stream or output stream.\n1. Unix command can be used as `read-specifier` and `wspecifier`\n\n        copy-feats ark:'gunzip -c some.ark.gz |' ark:some.ark\n\n#### Scp file\n`scp` is a text file such as,\n\n```\nuttid1 /some/where/feats.ark:123\nuttid2 /some/where/feats.ark:156\nuttid3 /some/where/feats.ark:245\n```\nThe first column, `uttid1`, indicates the utterance id and the second, `/some/where/feats.ark:123`, is the file path of matrix/vector of kaldi formats.  The number after colon is a starting addressof the object of the file.\n\n`scp` looks very simple format, but has several powerful features.\n\n1. Mutual conversion between`ark` and `scp`\n\n        copy-feats scp:foo.scp ark:foo.ark  # scp -\u003e ark\n        copy-feats ark:foo.ark ark,scp:bar.ark,bar.scp  # ark -\u003e ark,scp\n\n1. Unix command can be used insead of direct file path\n\n    For example, the following scp file can be also used.\n\n        uttid1 cat /some/where/feats1.mat |\n        uttid2 cat /some/where/feats2.mat |\n        uttid3 cat /some/where/feats3.mat |\n\n#### wav.scp\n`wav.scp` is a `scp` to describe wave file paths.\n\n```\nuttid1 /some/path/a.wav\nuttid2 /some/path/b.wav\nuttid3 /some/path/c.wav\n```\n\n`wav.scp` is also can be embeded unix command as normal scp file. This is often used for converting file format in kaldi recipes.\n\n```\nuttid1 sph2pipe -f wav /some/path/a.wv1 |\nuttid2 sph2pipe -f wav /some/path/b.wv1 |\nuttid3 sph2pipe -f wav /some/path/c.wv1 |\n```\n\n### Features\nKaldiio supports:\n\n- Read/Write for archive formats: ark, scp\n  - Binary/Text - Float/Double Matrix: DM, FM\n  - Binary/Text - Float/Double Vector: DV, FV\n  - Compressed Matrix for loading: CM, CM2, CM3\n  - Compressed Matrix for writing: All compressoin_method are supported: 1,2,3,4,5,6,7\n  - Binary/Text for Int-vector, typically used for `ali` files.\n- Read/Write via a pipe: e.g. \"ark: cat feats.ark |\"\n- Read wav.scp / wav.ark\n- (New!) Some extended ark format **not supported** in Kaldi originally.\n  - The ark file for numpy, pickle, wav, flac files.\n\nThe followings are **not supported**\n\n- Write in existing scp file\n- NNet2/NNet3 egs\n- Lattice file\n\n### Similar projects\n- Python-C++ binding\n   - https://github.com/pykaldi/pykaldi\n      - Looks great. I recommend pykaldi if you aren't particular about pure python.\n   - https://github.com/janchorowski/kaldi-python/\n      - Maybe not enough maintained now.\n   - https://github.com/t13m/kaldi-readers-for-tensorflow\n      - Ark reader for tensorflow\n   - https://github.com/csukuangfj/kaldi_native_io\n      - Implemented in C++\n      - Have interface for Python\n      - Support all types of `rspecifier` and `wspecifier`\n      - Have a uniform interface for writing, sequential reading, and random access reading\n      - `pip install kaldi_native_io`\n- Pure Python\n   - https://github.com/vesis84/kaldi-io-for-python\n      - `kaldiio` is based on this module, but `kaldiio` supports more features than it.\n   - https://github.com/funcwj/kaldi-python-io\n      - Python\u003e=3.6. `nnet3-egs`is also supported.\n\n## Install\n\n```bash\npip install kaldiio\n```\n\n## Usage\n`kaldiio` doesn't distinguish the API for each kaldi-objects, i.e.\n`Kaldi-Matrix`, `Kaldi-Vector`, not depending on whether it is binary or text, or compressed or not,\ncan be handled by the same API.\n\n### ReadHelper\n`ReadHelper` supports sequential accessing for `scp` or `ark`. If you need to access randomly, then use `kaldiio.load_scp`.\n\n\n- Read matrix-scp\n\n```python\nfrom kaldiio import ReadHelper\nwith ReadHelper('scp:file.scp') as reader:\n    for key, numpy_array in reader:\n        ...\n```\n\n\n- Read gziped ark\n\n```python\nfrom kaldiio import ReadHelper\nwith ReadHelper('ark: gunzip -c file.ark.gz |') as reader:\n    for key, numpy_array in reader:\n        ...\n\n# Ali file\nwith ReadHelper('ark: gunzip -c exp/tri3_ali/ali.*.gz |') as reader:\n    for key, numpy_array in reader:\n        ...\n```\n\n\n- Read wav.scp\n\n```python\nfrom kaldiio import ReadHelper\nwith ReadHelper('scp:wav.scp') as reader:\n    for key, (rate, numpy_array) in reader:\n        ...\n```\n\n　　　　- v2.11.0: Removed `wav` option. You can load `wav.scp` without any addtional argument.\n\n- Read wav.scp with segments\n\n```python\nfrom kaldiio import ReadHelper\nwith ReadHelper('scp:wav.scp', segments='segments') as reader\n    for key, (rate, numpy_array) in reader:\n        ...\n```\n\n- Read from stdin\n\n```python\nfrom kaldiio import ReadHelper\nwith ReadHelper('ark:-') as reader:\n    for key, numpy_array in reader:\n        ...\n```\n\n### WriteHelper\n- Write matrices and vectors in a ark with scp\n\n```python\nimport numpy\nfrom kaldiio import WriteHelper\nwith WriteHelper('ark,scp:file.ark,file.scp') as writer:\n    for i in range(10):\n        writer(str(i), numpy.random.randn(10, 10))\n        # The following is equivalent\n        # writer[str(i)] = numpy.random.randn(10, 10)\n```\n\n- Write in compressed matrix\n\n```python\nimport numpy\nfrom kaldiio import WriteHelper\nwith WriteHelper('ark:file.ark', compression_method=2) as writer:\n    for i in range(10):\n        writer(str(i), numpy.random.randn(10, 10))\n```\n\n- Write matrices in text\n\n```python\nimport numpy\nfrom kaldiio import WriteHelper\nwith WriteHelper('ark,t:file.ark') as writer:\n    for i in range(10):\n        writer(str(i), numpy.random.randn(10, 10))\n```\n\n- Write in gziped ark\n\n```python\nimport numpy\nfrom kaldiio import WriteHelper\nwith WriteHelper('ark:| gzip -c \u003e file.ark.gz') as writer:\n    for i in range(10):\n        writer(str(i), numpy.random.randn(10, 10))\n```\n- Write matrice to stdout\n\n```python\nimport numpy\nfrom kaldiio import WriteHelper\nwith WriteHelper('ark:-') as writer:\n    for i in range(10):\n        writer(str(i), numpy.random.randn(10, 10))\n```\n\n\n- (New!) Extended ark format using numpy, pickle, soundfile\n\n```python\nimport numpy\nfrom kaldiio import WriteHelper\n\n# NPY ARK\nwith WriteHelper('ark:-', write_function=\"numpy\") as writer:\n    writer(\"foo\", numpy.random.randn(10, 10))\n\n# PICKLE ARK\nwith WriteHelper('ark:-', write_function=\"pickle\") as writer:\n    writer(\"foo\", numpy.random.randn(10, 10))\n    \n# FLAC ARK\nwith WriteHelper('ark:-', write_function=\"soundfile_flac\") as writer:\n    writer(\"foo\", numpy.random.randn(1000))\n```\n\nNote that `soundfile` is an optional module and you need to install it to use this feature.\n\n```sh\npip install soundfile\n```\n\n## More low level API\n`WriteHelper` and `ReadHelper` are high level wrapper of the following API to support kaldi style arguments.\n\n### load_ark\n\n```python\nimport kaldiio\n\nd = kaldiio.load_ark('a.ark')  # d is a generator object\nfor key, numpy_array in d:\n    ...\n\n# === load_ark can accepts file descriptor, too\nwith open('a.ark') as fd:\n    for key, numpy_array in kaldiio.load_ark(fd):\n        ...\n\n# === Use with open_like_kaldi\nfrom kaldiio import open_like_kaldi\nwith open_like_kaldi('gunzip -c file.ark.gz |', 'r') as f:\n    for key, numpy_array in kaldiio.load_ark(fd):\n        ...\n```\n\n- `load_ark` can load both matrices of ark and vectors of ark and also, it can be both text and binary.\n\n### load_scp\n`load_scp` creates \"lazy dict\", i.e.\nThe data are loaded in memory when accessing the element.\n\n```python\nimport kaldiio\n\nd = kaldiio.load_scp('a.scp')\nfor key in d:\n    numpy_array = d[key]\n\n\nwith open('a.scp') as fd:\n    kaldiio.load_scp(fd)\n\nd = kaldiio.load_scp('data/train/wav.scp', segments='data/train/segments')\nfor key in d:\n    rate, numpy_array = d[key]\n```\n\nThe object created by `load_scp` is a dict-like object, thus it has methods of `dict`.\n\n```python\nimport kaldiio\nd = kaldiio.load_scp('a.scp')\nd.keys()\nd.items()\nd.values()\n'uttid' in d\nd.get('uttid')\n```\n\n### load_scp_sequential (from v2.13.0)\n\n`load_scp_sequential` creates \"generator\" as same as `load_ark`.\nIf you don't need random-accessing for each elements\nand use it just to iterate for whole data,\nthen this method possibly performs faster than `load_scp`.\n\n```python\nimport kaldiio\nd = kaldiio.load_scp_sequential('a.scp')\nfor key, numpy_array in d:\n    ...\n```\n\n### load_wav_scp\n```python\nd = kaldiio.load_scp('wav.scp')\nfor key in d:\n    rate, numpy_array = d[key]\n\n# Supporting \"segments\"\nd = kaldiio.load_scp('data/train/wav.scp', segments='data/train/segments')\nfor key in d:\n    rate, numpy_array = d[key]\n```\n\n- v2.11.0: `load_wav_scp` is deprecated now. Use `load_scp`.\n\n### load_mat\n```python\nnumpy_array = kaldiio.load_mat('a.mat')\nnumpy_array = kaldiio.load_mat('a.ark:1134')  # Seek and load\n\n# If the file is wav, gets Tuple[int, numpy.ndarray]\nrate, numpy_array = kaldiio.load_mat('a.wav')\n```\n- `load_mat` can load kaldi-matrix, kaldi-vector, and wave\n\n### save_ark\n```python\n\n# === Create ark file from numpy\nkaldiio.save_ark('b.ark', {'key': numpy_array, 'key2': numpy_array2})\n# Create ark with scp _file, too\nkaldiio.save_ark('b.ark', {'key': numpy_array, 'key2': numpy_array2},\n                 scp='b.scp')\n\n# === Writes arrays to sys.stdout\nimport sys\nkaldiio.save_ark(sys.stdout, {'key': numpy_array})\n\n# === Writes arrays for each keys\n# generate a.ark\nkaldiio.save_ark('a.ark', {'key': numpy_array, 'key2': numpy_array2})\n# After here, a.ark is opened with 'a' (append) mode.\nkaldiio.save_ark('a.ark', {'key3': numpy_array3}, append=True)\n\n\n# === Use with open_like_kaldi\nfrom kaldiio import open_like_kaldi\nwith open_like_kaldi('| gzip a.ark.gz', 'w') as f:\n    kaldiio.save_ark(f, {'key': numpy_array})\n    kaldiio.save_ark(f, {'key2': numpy_array2})\n```\n### save_mat\n```python\n# array.ndim must be 1 or 2\nkaldiio.save_mat('a.mat', numpy_array)\n```\n- `save_mat` can save both kaldi-matrix and kaldi-vector\n\n\n### open_like_kaldi\n\n``kaldiio.open_like_kaldi`` is a useful tool if you are familiar with Kaldi. This function can performs as following,\n\n```python\nfrom kaldiio import open_like_kaldi\nwith open_like_kaldi('echo -n hello |', 'r') as f:\n    assert f.read() == 'hello'\nwith open_like_kaldi('| cat \u003e out.txt', 'w') as f:\n    f.write('hello')\nwith open('out.txt', 'r') as f:\n    assert f.read() == 'hello'\n\nimport sys\nwith open_like_kaldi('-', 'r') as f:\n    assert f is sys.stdin\nwith open_like_kaldi('-', 'w') as f:\n    assert f is sys.stdout\n```\n\nFor example, if there are gziped alignment file, then you can load it as:\n\n```python\nfrom kaldiio import open_like_kaldi, load_ark\nwith open_like_kaldi('gunzip -c exp/tri3_ali/ali.*.gz |', 'rb') as f:\n    # Alignment format equals ark of IntVector\n    g = load_ark(f)\n    for k, numpy_array in g:\n        ...\n```\n\n### parse_specifier\n\n```python\nfrom kaldiio import parse_specifier, open_like_kaldi, load_ark\nrspecifier = 'ark:gunzip -c file.ark.gz |'\nspec_dict = parse_specifier(rspecifier)\n# spec_dict = {'ark': 'gunzip -c file.ark.gz |'}\n\nwith open_like_kaldi(spec_dict['ark'], 'rb') as fark:\n    for key, numpy_array in load_ark(fark):\n        ...\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnttcslab-sp%2Fkaldiio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnttcslab-sp%2Fkaldiio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnttcslab-sp%2Fkaldiio/lists"}