https://github.com/mikemwai/massive
Amazon massive dataset processing project
https://github.com/mikemwai/massive
data-structures google-drive-api
Last synced: 5 months ago
JSON representation
Amazon massive dataset processing project
- Host: GitHub
- URL: https://github.com/mikemwai/massive
- Owner: mikemwai
- License: mit
- Created: 2023-09-26T08:53:03.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-09T09:22:59.000Z (over 2 years ago)
- Last Synced: 2023-12-09T10:27:29.061Z (over 2 years ago)
- Topics: data-structures, google-drive-api
- Language: Python
- Homepage:
- Size: 109 MB
- Stars: 1
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Massive
This project aims to process the [massive](https://huggingface.co/datasets/AmazonScience/massive) dataset, focusing on generating language-specific files, such as en-xx.xlsx, for multiple languages and creating separate JSONL files for English (en), Swahili (sw), and German (de) with test, train, and dev data. Additionally, it will generate a single JSON file that contains translations from English to all languages with id and utt for the training sets. This project is designed to efficiently handle the dataset without using recursive algorithms to avoid potential memory and time complexity issues.
## Prerequisites
- Python version 3.11.5
- PyCharm version 2023.2.1
## Installation
1. Clone the repository on your local machine:
```sh
git clone https://github.com/mikemwai/massive.git
```
2. Navigate to the project directory and create a virtual environment on your local machine through the command line:
```sh
py -m venv myenv
```
3. Activate your virtual environment:
- On Windows:
```sh
myenv\Scripts\activate
```
- On Mac:
```sh
source myenv/bin/activate
```
4. Install project dependencies on your virtual environment:
```sh
pip install -r requirements.txt
```
5. Extract the [dataset folder](dataset.rar) on your project folder. Use winrar to extract.
## Usage
Run the project on the IDE terminal:
```sh
python main.py generate_excel_files separate_files train_translations
```
## Contributions
If you'd like to contribute to this project:
- Please fork the repository
- Create a new branch for your changes
- Submit a [pull request](https://github.com/mikemwai/massive/pulls)
Contributions, bug reports, and feature requests are welcome!
## Issues
If you have any issues with the project, feel free to open up an [issue](https://github.com/mikemwai/massive/issues).
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.