https://github.com/mlcommons/peoples-speech

The People’s Speech Dataset
https://github.com/mlcommons/peoples-speech

Last synced: 11 months ago
JSON representation

The People’s Speech Dataset

Host: GitHub
URL: https://github.com/mlcommons/peoples-speech
Owner: mlcommons
License: apache-2.0
Created: 2021-02-11T23:16:38.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2024-01-11T19:14:32.000Z (over 2 years ago)
Last Synced: 2025-06-09T14:12:56.820Z (about 1 year ago)
Language: Jupyter Notebook
Homepage: https://mlcommons.org/en/peoples-speech/
Size: 141 MB
Stars: 104
Watchers: 15
Forks: 12
Open Issues: 37
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

Awesome-SpeechLM-Survey - People dataset - Training | 30k | 2021 | (Popular Training Datasets / Mixed Tokenizers)

README

          # People's Speech Data Pipelines

Installation

```

# libprotobuf-dev is an onnx dependency, transitively brought in by nemo.

sudo apt-get install git-lfs sox ffmpeg

# Set up a virtual environment of some sort

pip install numpy Cython

python setup.py develop

cp galvasr2/*.jar $(python -c "import pyspark; print(pyspark.__path__[0])")/jars

```

Run forced alignment pipeline.

```

python galvasr2/align/spark/align_cuda_decoder.py --stage=0

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mlcommons/peoples-speech

Awesome Lists containing this project

README