Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mahtafetrat/virgoolinformal-speech-dataset
A dataset of informal Persian audio and text chunks, along with a fully open processing pipeline, suitable for ASR and TTS tasks. Created from crawled content on virgool.io.
https://github.com/mahtafetrat/virgoolinformal-speech-dataset
asr asr-evaluation forced-alignment persian persian-speech-corpus persian-speech-dataset persian-speech-recognition persian-text-to-speech speech-data-collection speech-dataset speech-processing tts
Last synced: 8 days ago
JSON representation
A dataset of informal Persian audio and text chunks, along with a fully open processing pipeline, suitable for ASR and TTS tasks. Created from crawled content on virgool.io.
- Host: GitHub
- URL: https://github.com/mahtafetrat/virgoolinformal-speech-dataset
- Owner: MahtaFetrat
- License: mit
- Created: 2024-06-03T16:59:11.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-09-13T05:43:23.000Z (2 months ago)
- Last Synced: 2024-09-13T16:58:07.260Z (2 months ago)
- Topics: asr, asr-evaluation, forced-alignment, persian, persian-speech-corpus, persian-speech-dataset, persian-speech-recognition, persian-text-to-speech, speech-data-collection, speech-dataset, speech-processing, tts
- Language: Jupyter Notebook
- Homepage:
- Size: 508 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# VirgoolInformal-Speech-Dataset
This repository contains a dataset of informal Persian audio and text chunks suitable for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks. The dataset was created by informal Persian text from [virgool.io](https://virgool.io), recording their spoken forms, and processing the raw audio and text files into smaller, equivalent chunks.
## Dataset Description
The dataset includes:
- Raw audio files recorded from the crawled text.
- Raw text files crawled from the blog.
- Processed audio and text chunks, aligned for ASR and TTS tasks.
- A fully open processing pipeline documented in a Jupyter Notebook, detailing each step from raw data to processed output.### Raw Data
The raw data consists of:
- Audio files in their original format.
- Text files in their original format.### Processed Data
The processed data consists of:
- Audio files converted from m4a to mono mp3 format.
- Text files normalized, cleaned, and tokenized into sentences.
- Aligned audio-text chunks created using the forced alignment tool Aeneas.## Processing Notebook
The processing of the raw data is documented in a Jupyter Notebook, which includes the following steps:
1. **Audio Processing**: Converting audio files from m4a to mono mp3.
2. **Text Processing**: Normalizing text, removing and substituting symbols, removing links and references, converting numbers to their spoken format, and removing extra spaces.
3. **Sentence Tokenization**: Splitting text files into sentences using a custom sentence tokenization script.
4. **Forced Alignment**: Creating aligned audio-text chunks using Aeneas.### Running the Notebook
To run the processing notebook, place the raw data files into a folder named `raw-data` in the root directory. The processed audio and text files will be output to a directory named `processed-data`, and the forced alignment results will be written to `forced-aligned-data`.
For detailed instructions on environment setup, please refer to [the processing notebook](https://github.com/MahtaFetrat/VirgoolInformal-Speech-Dataset/blob/main/VirgoolInformal_Dataset_Processing.ipynb).
You can view and run [the processing notebook in Google Colab](https://colab.research.google.com/drive/1AjvrRisJYdqvNdSDKdSWfxge6S29mavm?usp=sharing).
## Usage
The dataset can be used to train Persian ASR and TTS models, specifically tailored for informal Persian speech. Additionally, it can be utilized to evaluate ASR models in terms of Character Error Rate (CER).
## License
This project is licensed under the open MIT License for the code and under the open CC-0 License for the data.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.