{"id":24497446,"url":"https://github.com/arnie1x/massive-dataset-lab","last_synced_at":"2026-05-07T17:40:29.109Z","repository":{"id":196485115,"uuid":"696254639","full_name":"Arnie1x/massive-dataset-lab","owner":"Arnie1x","description":"This project involves basic data manipulation with JSON files, focusing on tasks related to data processing of the MASSive dataset and file management.","archived":false,"fork":false,"pushed_at":"2023-10-02T13:42:47.000Z","size":25,"stargazers_count":0,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-17T17:51:11.917Z","etag":null,"topics":["data-science","numpy","pandas","pandas-dataframe","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Arnie1x.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-25T11:56:53.000Z","updated_at":"2023-10-02T13:05:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"7aeb1cfd-0bfe-4e98-b6e4-d9a1a2ac0d3f","html_url":"https://github.com/Arnie1x/massive-dataset-lab","commit_stats":null,"previous_names":["arnie1x/massive-dataset-lab"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Arnie1x/massive-dataset-lab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Arnie1x%2Fmassive-dataset-lab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Arnie1x%2Fmassive-dataset-lab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Arnie1x%2Fmassive-dataset-lab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Arnie1x%2Fmassive-dataset-lab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Arnie1x","download_url":"https://codeload.github.com/Arnie1x/massive-dataset-lab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Arnie1x%2Fmassive-dataset-lab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32749538,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-07T02:14:30.463Z","status":"ssl_error","status_checked_at":"2026-05-07T02:14:29.405Z","response_time":62,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","numpy","pandas","pandas-dataframe","python3"],"created_at":"2025-01-21T21:33:49.468Z","updated_at":"2026-05-07T17:40:29.087Z","avatar_url":"https://github.com/Arnie1x.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Working with Python: Assessment Test\n\n## Introduction\n\nThis project involves basic data manipulation with JSON files, focusing on tasks related to data processing of the MASSive dataset and file management. \n\n## Table of Contents\n\n- [Project Tasks/Features](#features)\n   - [Question 1 - Python3 Development Environment](#question-1-python3-development-environment)\n   - [Question 2 - Working with Files](#question-2-working-with-files)\n- [Installation](#installation)\n   - [Pre-requisites](#pre-requisites)\n   - [Installation Instructions](#installation-instructions)\n\n## Project Tasks/Features\u003ca name=\"features\"\u003e\u003c/a\u003e\n### Question 1 - Python3 Development Environment\u003ca name=\"question-1-python3-development-environment\"\u003e\u003c/a\u003e\nIn this section, you will set up the Python3 development environment and process the MASSIVE Dataset:\n\n**Task 1**: Set up a Python3 development environment and install necessary dependencies.\n**Task 2**: Create a project structure similar to PyCharm and import the dataset.\n**Task 3**: Generate \"en-xx.xlxs\" files for all languages, using id, utt, and annot_utt fields.\n**Task 4**: Avoid using recursive algorithms with high time complexity.\n**Task 5**: Refer to Flags for running the solution on generator.sh files.\n\n### Question 2 - Working with Files\u003ca name=\"question-2-working-with-files\"\u003e\u003c/a\u003e\nIn this section, you will work with JSON files and manage your project:\n\n**Task 1**: Generate separate JSONL files for English (en), Swahili (sw), and German (de) for test, train, and dev data sets.\n**Task 2**: Create a large JSON file that includes all translations from English (en) to other languages (xx) for the train sets, including id and utt fields.\n**Task 3**: Ensure the JSON file structure is pretty-printed.\n**Task 4**: Upload all generated files to your Google Drive Backup Folder.\n\n## Installation\u003ca name=\"installation\"\u003e\u003c/a\u003e\n\n### Pre-requisites\u003ca name=\"pre-requisites\"\u003e\u003c/a\u003e\n\nBefore you begin, make sure you have the following pre-requisites installed on your system:\n\n- [Python3 Development Environment](https://www.python.org/)\n\n### Installation Instructions\u003ca name=\"installation-instructions\"\u003e\u003c/a\u003e\n\n1. Clone this repository to your local machine:\n\n   ```bash\n   git clone https://github.com/Arnie1x/massive-dataset-lab.git\n   cd massive-dataset-lab\n   ```\n2. Setup a virtual environment\n   ```bash\n   virtualenv venv\n   ```\n3. Import the MASSive dataset to the dataset folder\n   The MASSive dataset can be found [here](https://github.com/alexa/massive/) together with the installation instructions.\n\n3. Install all the required dependencies needed to run the project\n   ```bash\n   python -r pip install requirements.txt\n   ```\n\n\n   \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnie1x%2Fmassive-dataset-lab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farnie1x%2Fmassive-dataset-lab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnie1x%2Fmassive-dataset-lab/lists"}