{"id":27442957,"url":"https://github.com/pr0mila/parquettohuggingface","last_synced_at":"2025-04-15T01:17:50.717Z","repository":{"id":287853991,"uuid":"965922176","full_name":"pr0mila/ParquetToHuggingFace","owner":"pr0mila","description":"ParquetToHuggingFace processes raw audio data, converts it into Parquet files, and uploads them to Hugging Face. The README explains how to set up the environment, configure paths, and run the scripts to generate and upload the data.","archived":false,"fork":false,"pushed_at":"2025-04-14T13:03:36.000Z","size":1589,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-15T01:17:33.219Z","etag":null,"topics":["audio-dataset","huggingface","huggingface-datasets","pandas","parquet","parquet-generator","python3","speech-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pr0mila.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-14T06:01:51.000Z","updated_at":"2025-04-14T16:09:33.000Z","dependencies_parsed_at":"2025-04-14T10:33:20.313Z","dependency_job_id":null,"html_url":"https://github.com/pr0mila/ParquetToHuggingFace","commit_stats":null,"previous_names":["pr0mila/parquettohuggingface"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pr0mila%2FParquetToHuggingFace","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pr0mila%2FParquetToHuggingFace/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pr0mila%2FParquetToHuggingFace/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pr0mila%2FParquetToHuggingFace/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pr0mila","download_url":"https://codeload.github.com/pr0mila/ParquetToHuggingFace/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248986314,"owners_count":21194025,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-dataset","huggingface","huggingface-datasets","pandas","parquet","parquet-generator","python3","speech-data"],"created_at":"2025-04-15T01:17:50.138Z","updated_at":"2025-04-15T01:17:50.704Z","avatar_url":"https://github.com/pr0mila.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ParquetToHuggingFace :package:\n\nThis project processes audio data and creates Parquet files for uploading to Hugging Face. It uses the following two main scripts:\n\n- **`create_parquet.py`**: This script is used to create Parquet files from raw audio data. :musical_note:\n- **`upload_to_huggingface.py`**: This script uploads the Parquet files to Hugging Face, where they can be stored and shared. :cloud:\n\n### Dataset Example: [MediBeng Dataset on Hugging Face](https://huggingface.co/datasets/pr0mila-gh0sh/MediBeng) :floppy_disk:\n\nI followed the steps above to create the **MediBeng** dataset, which contains audio data along with their transcriptions, and uploaded it to Hugging Face. You can explore the dataset [here](https://huggingface.co/datasets/pr0mila-gh0sh/MediBeng).\n\n## Table of Contents\n- [1. Cloning the Repository](#1-cloning-the-repository) :book:\n- [2. Setting Up the Conda Environment](#2-setting-up-the-conda-environment) :wrench:\n- [3. Installing Dependencies](#3-installing-dependencies) :floppy_disk:\n- [4. Setting Up Hugging Face Token](#4-setting-up-hugging-face-token) :lock:\n- [5. Configuring the `config.yaml`](#5-configuring-the-configyaml) :gear:\n- [6. Data Setup](#6-data-setup) :file_folder:\n- [7. Running the Scripts](#7-running-the-scripts) :rocket:\n- [8. How the Code Works](#8-how-the-code-works) :memo:\n\n## 1. Cloning the Repository\n\nFirst, clone the repository to your local machine using the following command:\n\n```bash\ngit clone https://github.com/pr0mila/ParquetToHuggingFace.git\ncd ParquetToHuggingFace\n```\n\n## 2. Setting Up the Conda Environment\n\nCreate a new Conda environment to run the project:\n\n```bash\nconda create --name audio-parquet python=3.9\nconda activate audio-parquet\n```\n\n## 3. Installing Dependencies\n\nInstall the necessary dependencies by using the `requirements.txt` file:\n\n```bash\npip install -r requirements.txt\n```\nOr, install the dependencies\n\n```bash\npip install pandas soundfile numpy librosa datasets\n```\n\nThis will install all the required libraries and packages to run the project.\n\n## 4. Setting Up Hugging Face Token\n\nYou need to set your Hugging Face token as an environment variable to upload data to Hugging Face. Run the following command in your terminal (replace `your_token_here` with your actual token):\n\n```bash\nexport HUGGINGFACE_TOKEN='your_token_here'\n```\n\nYou can find your Hugging Face token by visiting [Hugging Face - Account Settings](https://huggingface.co/settings/tokens).\n\nTo ensure the token persists across sessions, you can add the `export` command to your shell's configuration file (e.g., `~/.bashrc` or `~/.zshrc`).\n\n## 5. Configuring the `config.yaml`\n\nThe `config.yaml` file stores the configuration for the paths and Hugging Face repository settings.\n\nMake sure to update the `config.yaml` according to your local setup. Example:\n\n```yaml\npaths:\n  base_data_directory: \"/path/to/your/raw/data\"\n  output_directory: \"/path/to/store/parquet\"\n\nhuggingface:\n  repo_id: \"your_username/your_dataset_name\"\n  token_env_var: \"HUGGINGFACE_TOKEN\"\n```\n\n- `base_data_directory`: Path to your directory where the raw audio files and CSV files are located (it will be in the `raw data` directory).\n- `output_directory`: Path to where the Parquet files will be saved (this will be in the `processed_data` directory).\n- `repo_id`: Your Hugging Face repository ID where you want to upload the dataset.\n\n## 6. Data Setup\n\nPlace your raw audio data and its corresponding CSV file into the `raw data` directory. The audio files should be in a format that the `create_parquet.py` script can read (e.g., `.wav` files).\n\nYour directory structure should look like this:\n\n```\nParquetToHuggingFace/\n├── data/\n│   ├── raw_data/\n│   │   ├── test/\n│   │   └── train/\n│   └── processed_data/\n├── config.yaml\n└── src/\n    ├── create_parquet.py\n    └── upload_to_huggingface.py\n```\n\n## 7. Running the Scripts\n\n### Step 1: Update Config Path in `main.py`\n\nBefore running the scripts, make sure to update the path of your `config.yaml` file in `src/main.py` to reflect your local configuration. This will ensure the scripts use the correct settings.\n\n### Step 2: Run `main.py` for Final Output\n\nOnce the config path is updated, run the `main.py` file to generate the final output:\n\n```bash\npython3 src/main.py\n```\n\n## 8. How the Code Works\n\n### `create_parquet.py`:\n- The `create_parquet.py` class processes raw audio data and its corresponding CSV file (which contains transcription and translation).\n- It calculates pitch statistics (mean and standard deviation) for each audio file.\n- The processed data, including audio, transcription, translation, and pitch statistics, is then saved as Parquet files in the `processed_data` directory.\n\n### `upload_to_huggingface.py`:\n- The `upload_to_huggingface.py` class logs you into Hugging Face using the token set in your environment.\n- It checks whether the repository exists or needs to be created on Hugging Face.\n- Finally, it uploads the Parquet files from the `processed_data` directory to your Hugging Face repository.\n\n\n---\n\n### Final Outcome:\n\n\n![View of Final Outcome](parquettohuggingface.png)\n\nOnce the scripts are successfully run, your data will be stored on Hugging Face as Parquet files, and you will have the ability to share and use them for various machine learning or research purposes.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpr0mila%2Fparquettohuggingface","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpr0mila%2Fparquettohuggingface","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpr0mila%2Fparquettohuggingface/lists"}