{"id":24545046,"url":"https://github.com/tkxwaweru/python_data_manipulation","last_synced_at":"2026-01-11T05:45:09.024Z","repository":{"id":196607155,"uuid":"696662589","full_name":"tkxwaweru/python_data_manipulation","owner":"tkxwaweru","description":"Manipulating the MASSIVE dataset using python","archived":false,"fork":false,"pushed_at":"2023-10-09T12:51:59.000Z","size":129,"stargazers_count":0,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-07T08:48:47.898Z","etag":null,"topics":["data","dataanalysis","excel","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tkxwaweru.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-26T07:41:12.000Z","updated_at":"2023-11-28T19:01:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"b155f97a-10cc-4817-9368-a5e630755939","html_url":"https://github.com/tkxwaweru/python_data_manipulation","commit_stats":null,"previous_names":["tkxwaweru/cat-1","tkxwaweru/python_data_manipulation"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tkxwaweru%2Fpython_data_manipulation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tkxwaweru%2Fpython_data_manipulation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tkxwaweru%2Fpython_data_manipulation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tkxwaweru%2Fpython_data_manipulation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tkxwaweru","download_url":"https://codeload.github.com/tkxwaweru/python_data_manipulation/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246651240,"owners_count":20811990,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","dataanalysis","excel","python"],"created_at":"2025-01-22T21:17:46.928Z","updated_at":"2026-01-11T05:45:05.710Z","avatar_url":"https://github.com/tkxwaweru.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003clink rel=\"stylesheet\" href=\"https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/css/all.min.css\"\u003e\n\n# Manipulating the MASSIVE dataset using python\n\n## Quick Links\n\n- [Introduction](https://github.com/tkxwaweru/cat-1#introduction)\n\n- [Project installation](https://github.com/tkxwaweru/cat-1#project-installation)\n\n- [Running the project](https://github.com/tkxwaweru/cat-1#running-the-project)\n\n- [Disclaimer](https://github.com/tkxwaweru/cat-1#disclaimer)\n\n- [Output files](https://github.com/tkxwaweru/cat-1#output-files)\n\n## Introduction\n\nThis repository contains code that makes use of the MASSIVE dataset by amazon using python. MASSIVE is a parallel dataset of more than 1 million utterances across 52 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types.\n\nIn this project, the dataset's files, which originally come in the .jsonl format, are converted to excel readable .xlsx files. The data from the dataset is also manipulated to generate new .jsonl files and to generate a large .json file showing some translations for utterances made as part of the train partition.\n\nYou can read more about the dataset [here](https://github.com/alexa/massive#readme).\n\n## Project installation\n\n1. Open your terminal and create a virtual python environment to store all the required dependencies to run this project. The project was created using python version 3.11.5 which can be installed automatically when working with anaconda environments or can be downloaded directly from [here](https://www.python.org/ftp/python/3.11.5/python-3.11.5-amd64.exe).\n\n   If you prefer to use python's venv facility:\n\n   ```\n   python3 -m venv environment_name\n   ```\n\n   You can read more on working with python and pip [here](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/).\n\n   If you prefer to use anaconda:\n\n   ```\n   conda create -n environment_name\n   ```\n\n   You can read more on working with anaconda [here](https://docs.anaconda.com/free/navigator/tutorials/index.html).\n\n   You can use pip to install all the project's dependencies into your environment:\n\n   ```\n   pip install -r requirements.txt\n   ```\n\n2. Fork and clone this repository.\n\n   Run the following command in your terminal to clone the forked repository:\n\n   ```{code}\n   git clone \u003crepository link\u003e \u003cfolder name\u003e\n   ```\n\n3. Download the massive dataset. The massive data set 1.1 which was used for this project can be downloaded \u003ca href=\"https://amazon-massive-nlu-dataset.s3.amazonaws.com/amazon-massive-dataset-1.1.tar.gz\"\u003ehere\u003c/a\u003e. You will need [WinRar](https://www.win-rar.com/fileadmin/winrar-versions/winrar/winrar-x64-623.exe) to extract the compressed folder.\n\n4. Retrieve the data folder from the extracted folder and import it into your local repository in the src folder.\n\n   The file hierarchy for this should be something like this:\n\n   ```{code}\n   C:\\Users\\username\\my_project\\src\\data\n   ```\n\n5. Install git bash which is usually obtained during git installation. You can begin your download of git from [here](https://git-scm.com/downloads).\n\n## Running the project\n\nUpon completing the project installation steps:\n\n1. Open your git bash terminal and navigate to the project's src folder.\n\n2. Run the following commands to execute the bash file and generate the project's output files.\n\n   To make the bash file executable:\n\n   ```{code}\n   chmod +x generator.sh\n   ```\n\n   To run the bash file and generate the project's output:\n\n   ```{code}\n   ./generator.sh\n   ```\n\n## Disclaimer\n\nDue to the large number of files being processed and generated, the process of generating the output could take a few minutes.\n\n## Output files\n\nThe project's output files were backed-up on Google Drive and can be accessed [here](https://drive.google.com/drive/folders/12aCT8Q7ztFkNASuDG5hMLCD6DZJMv5-G?usp=sharing).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftkxwaweru%2Fpython_data_manipulation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftkxwaweru%2Fpython_data_manipulation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftkxwaweru%2Fpython_data_manipulation/lists"}