{"id":16735211,"url":"https://github.com/ms8909/dptron","last_synced_at":"2025-04-10T12:20:39.241Z","repository":{"id":40974681,"uuid":"220053890","full_name":"ms8909/dptron","owner":"ms8909","description":"mltrons dptron: Dirty Data in, Clean Data Out!","archived":false,"fork":false,"pushed_at":"2022-11-11T07:51:36.000Z","size":79214,"stargazers_count":4,"open_issues_count":5,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-24T11:07:53.629Z","etag":null,"topics":["data","dataprep","datapreparation","datascience","datascience-machinelearning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ms8909.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-11-06T17:34:08.000Z","updated_at":"2022-08-08T04:40:38.000Z","dependencies_parsed_at":"2022-09-26T17:21:04.384Z","dependency_job_id":null,"html_url":"https://github.com/ms8909/dptron","commit_stats":null,"previous_names":["ms8909/mltrons-auto-data-prep"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ms8909%2Fdptron","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ms8909%2Fdptron/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ms8909%2Fdptron/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ms8909%2Fdptron/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ms8909","download_url":"https://codeload.github.com/ms8909/dptron/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248217178,"owners_count":21066633,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","dataprep","datapreparation","datascience","datascience-machinelearning"],"created_at":"2024-10-13T00:05:18.670Z","updated_at":"2025-04-10T12:20:39.222Z","avatar_url":"https://github.com/ms8909.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mltrons dptron: Dirty Data in, Clean Data Out!\nhttps://pypi.org/project/mltronsAutoDataPrep/\n\n\n## Introduction\n\nData is the most important element for data analysis. Real world data is unclean with a lot of spelling errors, missing values, formatting issues, skewness, no encoding or aggregation which makes it the most time-consuming \u0026 cumbersome task for analysts \u0026 scientists. As most of the scientists spend time around 80% of their time cleaning \u0026 preparing data, therefore we’re introducing dptron to make that process extremely easier and faster!\n\nDptron is an in-memory platform built for distributed \u0026 scalable data cleaning \u0026 preparation. DPtron is written in Python and is built on PySpark to deal with large amounts of data seamlessly. It uses an implementation of machine learning and deep learning algorithms to perform important data cleaning \u0026 preparation steps automatically. Dptron is extensible so that developers, analysts \u0026 scientists can streamline the process of data cleaning \u0026 preparation for better decision making while becoming more productive. \n\nDecision making is better \u0026 easier if the data is clean otherwise it’s garbage-in and garbage-out. \n\n\n## Important Features\n\n- Supports connection with AWS S3\n- Supports upto 10TB of data size\n- Treats spelling mistakes and other inconsistencies in URLs\n- Detects \u0026 treats skewness in data\n- Feature engineering for time variable\n- Treats \u0026 fills NULL values by using deep learning (next iteration)\n- Treats spelling mistakes and other inconsistencies in other variables (next iteration)\n\n\n## GETTING STARTED WITH DPTRON - AUTO DATA PREP\n\n### Installing On Mac Os\nOpen up your terminal and install Java8 required for pySpark:\n```sh\nbrew cask install adoptopenjdk/openjdk/adoptopenjdk8**\n```\nAfter installing Java8, set it as your default Java version:\n```sh\n/usr/libexec/java_home -V**\n```\nThis will output thefollowing:\n\nMatching Java Virtual Machines (3):\n```\n1.8.0_05, x86_64:   \"Java SE 8\" /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home\n1.6.0_65-b14-462, x86_64:   \"Java SE 6\" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home\n1.6.0_65-b14-462, i386: \"Java SE 6\" /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home\n/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home\n```\n\nPick the version you want to be the default (i.e 1.6.0_65-b14-462) then:\n```sh\nexport JAVA_HOME=/usr/libexec/java_home -v 1.8**\n```\n\nAfter you've successfully install Java8, install dptron with the following command: \n```sh\npip install mltronsAutoDataPrep\n```\n\n### Installing on Windows\n\nIt's important that you replace all the paths that include the folder \"Program Files\" or \"Program Files (x86)\" to avoid future problems while running Spark.\n\nIf you have Java already installed, you still need to fix the JAVA_HOME and PATH variables. To do that, you need to:\n\n**1. Rename \"Program Files\" with \"Progra~1\"**\n\n**2. Rename \"Program Files (x86)\" with \"Progra~2\"**\n```\nExample: \"C:\\Program FIles\\Java\\jdk1.8.0_161\" --\u003e \"C:\\Progra~1\\Java\\jdk1.8.0_161\"\n```\nAfter renaming, make sure you have Java 8 installed and the environment variables correctly defined1:\n\n**3. Download Java JDK 8 from [Java's official website] \n(https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)**\n\nAfter installing Java SDK 8, set the following environment variables:\n\n**4. JAVA_HOME = C:\\Progra~1\\Java\\jdk1.8.0_161**\n\n**5. PATH += C:\\Progra~1\\Java\\jdk1.8.0_161\\bin**\n\nAfter you've successfully installed and configured Java8, install dptron with the following command: \n```sh\npip install mltronsAutoDataPrep\n```\n\n\n## Using dptron\n\n\n### 1. Reading data functions\n\n- **address** path of the file\n\n- **local** location of the file exist (local pc or s3 bucket)\n\n- **file_format** format of the file (csv,excel,parquet)\n\n- **s3** s3 bucket credentials (applicable only if data on s3 bucket)\n\n\n```python\nfrom mltronsAutoDataPrep.lib.v2.Operations.readfile import ReadFile as rf\n\nres = rf.read(address=\"test.csv\", local=\"yes\", file_format=\"csv\", s3={})\n```\n\n\n\n### 2. Drop Features containing Null of certain threshold\n\n- provide dataframe with threshold of null values \n\n- return the list of columns containing null values more then the threshold\n\n```python\nfrom mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_null_val import DropNullValueCol\n\nres = rf.read(\"test.csv\", file_format='csv')\n\ndrop_col = DropNullValueCol()\ncolumns_to_drop = drop_col.delete_var_with_null_more_than(res, threshold=30)\ndf = res.drop(*columns_to_drop)\n```\n\n\n### 3. Drop Features containing same values \n\n- provide dataframe \n\n- return the list of columns containing same values\n\n```python\nfrom mltronsAutoDataPrep.lib.v2.Middlewares.drop_col_with_same_val import DropSameValueColumn\n\n\ndrop_same_val_col = DropSameValueColumn()\ncolumns_to_drop = drop_same_val_col.delete_same_val_com(res)\ndf = res.drop(*columns_to_drop)\n```\n\n### 4. Cleaned Url Features\n\n- Automatically detects features containing Urls\n\n- Pipeline structure to clean the urls using **NLP** techniques\n\n```python\n\nfrom mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline\n\netl_pipeline = EtlPipeline()\netl_pipeline.custom_url_transformer(res)\nres = etl_pipeline.transform(res)\n\n```\n\n\n### 5. Split Date Time features\n\n- Automatically detects features containing date/time\n\n- Split date time into usefull multiple feautures (day,month,year etc)\n\n\n```python\nfrom mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline\n\n\netl_pipeline = EtlPipeline()\netl_pipeline.custom_date_transformer(res)\nres = etl_pipeline.transform(res)\n\n```\n\n\n### 6. Filling Missing Values \n\n- Using Deep Learning techniques Missing values are filled\n\n\n```python\nfrom mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline\n\n\netl_pipeline = EtlPipeline()\netl_pipeline.custom_filling_missing_val(res)\nres = etl_pipeline.transform(res)\n\n```\n\n\n### 7. Removing Skewness from features\n\n\n- Automatically detects which column contains skewness\n\n- Minimize skewness using statistical methods\n\n```python\nfrom mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline\n\n\netl_pipeline = EtlPipeline()\netl_pipeline.custom_skewness_transformer(res)\nres = etl_pipeline.transform(res)\n```\n\n\n### 8. Remove Spelling mistakes \n\n- Provide list of features in which contains spelling mistakes\n\n```python\nfrom mltronsAutoDataPrep.lib.v2.Pipelines.etl_pipeline import EtlPipeline\n\n\netl_pipeline = EtlPipeline()\netl_pipeline.custom_spell_transformer(res,['col1','col2'])\nres2 = etl_pipeline.transform(res)\n\n```\n\n\n\n## Dependencies\n- [Java 8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)\n- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)\n- [NumPy](https://www.numpy.org)\n- [pandas](https://pandas.pydata.org)\n- [python-dateutil](https://labix.org/python-dateutil) \n- [pytz](https://pythonhosted.org/pytz)\n- see full list of dependicies [here](https://github.com/ms8909/mltrons-auto-data-prep/blob/master/requirements.txt)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fms8909%2Fdptron","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fms8909%2Fdptron","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fms8909%2Fdptron/lists"}