{"id":22274031,"url":"https://github.com/josedanielchg/efficient-data-storage-for-predictive-modeling","last_synced_at":"2026-04-28T22:35:03.416Z","repository":{"id":264671745,"uuid":"893720529","full_name":"josedanielchg/Efficient-Data-Storage-for-Predictive-Modeling","owner":"josedanielchg","description":"DataCamp project from the Associate Data Scientist track, focusing on optimizing dataset storage by transforming data types and filtering. Prepares data for efficient machine learning workflows","archived":false,"fork":false,"pushed_at":"2024-11-25T17:03:46.000Z","size":2336,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-25T16:52:25.580Z","etag":null,"topics":["cleaning-dataset","data-analysis","jupyter-notebook","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/josedanielchg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-25T04:55:45.000Z","updated_at":"2024-11-25T17:15:39.000Z","dependencies_parsed_at":"2024-11-25T17:54:09.910Z","dependency_job_id":null,"html_url":"https://github.com/josedanielchg/Efficient-Data-Storage-for-Predictive-Modeling","commit_stats":null,"previous_names":["josedanielchg/efficient-data-storage-for-predictive-modeling"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/josedanielchg/Efficient-Data-Storage-for-Predictive-Modeling","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josedanielchg%2FEfficient-Data-Storage-for-Predictive-Modeling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josedanielchg%2FEfficient-Data-Storage-for-Predictive-Modeling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josedanielchg%2FEfficient-Data-Storage-for-Predictive-Modeling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josedanielchg%2FEfficient-Data-Storage-for-Predictive-Modeling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/josedanielchg","download_url":"https://codeload.github.com/josedanielchg/Efficient-Data-Storage-for-Predictive-Modeling/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josedanielchg%2FEfficient-Data-Storage-for-Predictive-Modeling/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32402671,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-28T19:38:08.556Z","status":"ssl_error","status_checked_at":"2026-04-28T19:37:55.688Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cleaning-dataset","data-analysis","jupyter-notebook","python"],"created_at":"2024-12-03T13:17:33.875Z","updated_at":"2026-04-28T22:35:03.401Z","avatar_url":"https://github.com/josedanielchg.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Efficient Data Storage for Predictive Modeling\n\nThis project, part of the [*Associate Data Scientist in Python*](https://app.datacamp.com/learn/career-tracks/associate-data-scientist-in-python) from DataCamp, involves optimizing the storage of a dataset from *Training Data Ltd.*, a leading online data science training provider. The goal is to preprocess and transform the dataset into an efficient format to facilitate faster and more scalable machine learning model development.\n\n## Project Description\n\nLarge datasets can significantly slow down machine learning pipelines. In this project, the dataset `customer_train.csv` is optimized for efficient storage by converting data types and filtering relevant data. The processed dataset will eventually be used to predict whether students are seeking new job opportunities, aiding recruiters in targeting potential candidates.\n\n## Dataset Overview\n\nThe dataset contains anonymized student information and includes the following columns:\n\n| Column                   | Description                                                                      |\n|--------------------------|----------------------------------------------------------------------------------|\n| `student_id`             | A unique ID for each student.                                                   |\n| `city`                   | A code for the city the student lives in.                                       |\n| `city_development_index` | A scaled development index for the city.                                        |\n| `gender`                 | The student's gender.                                                           |\n| `relevant_experience`    | Indicates if the student has relevant work experience.                          |\n| `enrolled_university`    | The type of university course enrolled in (if any).                             |\n| `education_level`        | The student's education level.                                                  |\n| `major_discipline`       | The educational discipline of the student.                                      |\n| `experience`             | The student's total work experience (in years).                                 |\n| `company_size`           | The size of the student's current employer.                                     |\n| `company_type`           | The type of company employing the student.                                      |\n| `last_new_job`           | The number of years between the student's current and previous jobs.            |\n| `training_hours`         | The number of hours of training completed.                                      |\n| `job_change`             | Indicates if the student is looking for a new job (`1` = Yes, `0` = No).        |\n\n## Project Objectives\n\n1. **Optimize Data Types**:\n   - Convert columns with two-factor categories into Booleans.\n   - Convert integer columns to 32-bit integers for memory efficiency.\n   - Convert floating-point columns to 16-bit floats.\n   - Convert nominal categorical data to the `category` type.\n   - Convert ordinal categorical data to ordered categories based on natural order.\n\n2. **Filter Relevant Data**:\n   - Retain only students with 10+ years of experience at companies with at least 1,000 employees.\n\n3. **Assess Memory Usage**:\n   - Compare memory usage of the original and transformed DataFrames using `.info()` and `.memory_usage()` methods.\n\n## Results\n\nThe preprocessing steps applied to the dataset resulted in significant memory optimization, even before filtering by experience and company size. Below is a comparison of memory usage for each column before and after transformation:\n\n| Column Name            | Records Count | Original Dtype | Original Memory (bytes) | Transformed Dtype | Transformed Memory (bytes) | Memory Reduction (%) |\n|-------------------------|---------------|----------------|--------------------------|--------------------|----------------------------|-----------------------|\n| `student_id`           | 19,158        | int64          | 153,264                  | int32              | 76,632                     | **50.00**                |\n| `city`                 | 19,158        | object         | 1,235,888                | category           | 31,246                     | **97.47**                |\n| `city_development_index` | 19,158      | float64        | 153,264                  | float16            | 38,316                     | **75.00**                |\n| `gender`               | 19,158        | object         | 1,040,573                | category           | 19,452                     | **98.13**                |\n| `relevant_experience`  | 19,158        | object         | 1,527,274                | bool               | 19,158                     | **98.75**                |\n| `enrolled_university`  | 19,158        | object         | 1,341,257                | category           | 19,482                     | **98.55**                |\n| `education_level`      | 19,158        | object         | 1,231,558                | category           | 19,658                     | **98.40**                |\n| `major_discipline`     | 19,158        | object         | 1,095,945                | category           | 19,718                     | **98.20**                |\n| `experience`           | 19,158        | object         | 1,121,964                | category           | 21,004                     | **98.13**                |\n| `company_size`         | 19,158        | object         | 1,023,519                | category           | 19,965                     | **98.05**                |\n| `company_type`         | 19,158        | object         | 1,047,279                | category           | 19,733                     | **98.12**                |\n| `last_new_job`         | 19,158        | object         | 1,113,264                | category           | 19,683                     | **98.23**                |\n| `training_hours`       | 19,158        | int64          | 153,264                  | int32              | 76,632                     | **50.00**                |\n| `job_change`           | 19,158        | float64        | 153,264                  | bool               | 19,158                     | **87.50**                |\n\n### Key Observations:\n- **Overall Memory Reduction**: Many columns achieved memory savings of over 90% due to dtype transformations, particularly the conversion of object columns to categories and reducing numeric precision (e.g., float64 to float16 and int64 to int32).\n- **Most Significant Savings**: \n  - `gender`: 98.13% memory reduction.\n  - `relevant_experience`: 98.75% memory reduction.\n  - `enrolled_university`: 98.55% memory reduction.\n- **Categorical Conversions**: Transforming columns with nominal and ordinal data into categories drastically reduced memory usage, particularly for high-cardinality columns like `city`.\n\nThese transformations demonstrate the power of efficient data storage techniques, preparing the dataset for more complex filtering and analysis steps while ensuring significant resource savings.\n\n\n## Requirements\n\n- Python 3.x\n- pandas library\n\n## Installation\n\n1. **Clone the repository**:\n   ```bash\n   git clone https://github.com/josedanielchg/Efficient-Data-Storage-for-Predictive-Modeling.git\n   ```\n\n2. **Navigate to the project directory**:\n   ```bash\n   cd Efficient-Data-Storage-for-Predictive-Modeling\n   ```\n\n3. **Install required dependencies**:\n   ```bash\n   pip install pandas\n   ```\n\n## Contributing\n\nContributions are welcome! Feel free to open an issue or submit a pull request for improvements or new features.\n\n## License\n\nThis project is licensed under the MIT License.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosedanielchg%2Fefficient-data-storage-for-predictive-modeling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjosedanielchg%2Fefficient-data-storage-for-predictive-modeling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosedanielchg%2Fefficient-data-storage-for-predictive-modeling/lists"}