{"id":13767220,"url":"https://github.com/kjam/data-cleaning-101","last_synced_at":"2026-01-31T14:09:39.986Z","repository":{"id":64649671,"uuid":"89218319","full_name":"kjam/data-cleaning-101","owner":"kjam","description":"Data Cleaning Libraries with Python","archived":false,"fork":false,"pushed_at":"2023-09-15T08:50:33.000Z","size":7113,"stargazers_count":281,"open_issues_count":2,"forks_count":174,"subscribers_count":23,"default_branch":"master","last_synced_at":"2024-11-17T02:34:28.327Z","etag":null,"topics":["data-validation","data-wrangling","python","teaching"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kjam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-04-24T08:53:31.000Z","updated_at":"2024-10-12T06:05:02.000Z","dependencies_parsed_at":"2024-01-25T23:07:22.049Z","dependency_job_id":"7d7f5150-943e-4654-93cb-9976b0c1f405","html_url":"https://github.com/kjam/data-cleaning-101","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kjam%2Fdata-cleaning-101","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kjam%2Fdata-cleaning-101/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kjam%2Fdata-cleaning-101/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kjam%2Fdata-cleaning-101/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kjam","download_url":"https://codeload.github.com/kjam/data-cleaning-101/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253492529,"owners_count":21916959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-validation","data-wrangling","python","teaching"],"created_at":"2024-08-03T16:01:06.306Z","updated_at":"2026-01-31T14:09:39.979Z","avatar_url":"https://github.com/kjam.png","language":"Jupyter Notebook","funding_links":[],"categories":["amazing insight and delivery"],"sub_categories":["cleaning"],"readme":"## Data Cleaning 101 \n\nWelcome to the code repository for Practical Data Cleaning with Python! This is a two-day training offered through Safari with O'Reilly media. You can sign up by searching for the course on Safari.\n\nThis course aims to give you a practical overview of data cleaning and validation libraries and methods in Python. Since we only have 6 hours, it can't go massively in-depth into any one library or tool, but I have tried to include useful tools I have found in my work and incorporate a mixture of the munging and testing I have seen in my own and others workflows. \n\nIf you have a suggestion for another library or additional topic, feel free to drop me a line :)\n\n### Installation\n\nThese lessons has been tested for Python 3.4 and Python 3.6 and primarily uses the latest release of each library, except where versions are pinned. You likely can run most of the code with older releases, but if you run into an issue, try upgrading the library in question first.\n\n```pip install -r install_reqs.txt```\n\n\nI believe this will also work with Conda, although I am less familiar with Conda so please report issues! (special thanks to @blue_hacker for this fix!)\n\n```\n$ conda create -n dataclean --copy python=3.6\n$ source activate dataclean\n$ pip install -r install_reqs.txt\n```\n\nIn addition, you will need to install [sqlite3](https://www.sqlite.org/) or make changes to the second day case study with a connection string to your database of choice. [more info](https://dataset.readthedocs.io/en/latest/quickstart.html#connecting-to-a-database)\n\nIf you want to visualize graphs using Dask, you will need to install [Graphviz](http://www.graphviz.org/), which has special requirements on all platforms. For linux, it is usually available via the system package library (apt, yum). For other platforms, you might need to use a special installer. It is also [available via conda install graphviz](https://anaconda.org/anaconda/graphviz) and [pip install graphviz](https://pypi.python.org/pypi/graphviz), but these might not include all necessary dependencies for your OS. For best results, search for your\nOS and \"install graphviz and dependencies\" and follow a recent article on setup.\n\n### Repository structure\n\nEach day coincides with a particular notebook folder. For day one, we will use `cleaning-notebooks`. Day two will focus on `validation-notebooks`. The `data` folder holds data we will use throughout the course. The `queue_example.py` file is used in the day two case study.\n\n\n### Python2 v. Python3\n\nThis repository has been built with Python 3. If you are using Python 2 and need help porting some logic or finding alternatives, please let me know and I will try and help. :)\n\n### Corrections?\n\nIf you find any issues in these code examples, feel free to submit an Issue or Pull Request. I appreciate your input!\n\n### Questions?\n\nReach out to @kjam on Twitter or GitHub. @kjam is also often on freenode. :)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkjam%2Fdata-cleaning-101","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkjam%2Fdata-cleaning-101","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkjam%2Fdata-cleaning-101/lists"}