https://github.com/paulbradshaw/cleaning

Resources on dealing with dirty data problems and using Open Refine
https://github.com/paulbradshaw/cleaning

cleaning dirtydata openrefine

Last synced: 10 months ago
JSON representation

Resources on dealing with dirty data problems and using Open Refine

Host: GitHub
URL: https://github.com/paulbradshaw/cleaning
Owner: paulbradshaw
Created: 2018-04-13T06:16:33.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2023-10-30T11:52:09.000Z (over 2 years ago)
Last Synced: 2025-04-19T14:11:27.660Z (over 1 year ago)
Topics: cleaning, dirtydata, openrefine
Size: 7.14 MB
Stars: 6
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Dealing with dirty data problems (and using Open Refine)

This repo contains resources for dealing with data problems. In particular, the tool Open Refine.

## Data cleaning tools

* [Open Refine](http://openrefine.org/download.html) 

* [Crosswalker](https://crosswalker.washingtonpost.com) - allows you to match datasets where the matching columns may not match entirely

* [Dataproofer](http://dataproofer.org/)

* [Gender API](https://gender-api.com) and [Genderize.io](https://Genderize.io) (also available as [an R package](https://cran.r-project.org/web/packages/genderizeR/index.html)) can be used to classify names by gender

* [Parserator](https://parserator.datamade.us/) can be used to split addresses

* The [tidyverse](https://www.tidyverse.org/) in R: a collection of libraries for making data 'tidy'. Also [janitor](https://towardsdatascience.com/cleaning-and-exploring-data-with-the-janitor-package-ee4a3edf085e)

* Python has [pandas](https://pandas.pydata.org/), and [pyjanitor](https://github.com/pyjanitor-devs/pyjanitor)

* [Workbench](https://workbenchdata.com/) (no longer supported but can be run using Docker) - includes much of Refine's functionality including clustering

* [Data Wrangler](http://vis.stanford.edu/wrangler/) (no longer supported)

* I keep a [list of data cleaning tools on Pinboard](https://pinboard.in/u:paulbradshaw/t:cleaning+tools)

## Typical data problems

* Numbers/dates treated as strings (often because of currency or percentage signs, or even spaces - try find and replace)

* Strings treated as numbers: e.g. company ‘numbers’, phone numbers and codes often have leading zeroes removed when they are an integral part of the code.

* [Numbers and units combined in sentences structures](https://github.com/paulbradshaw/cleaning/blob/master/sentencestonumbers.md)

* Combined data (addresses)

* Different data in one column (country, region and authority, for example, with spaces or formatting used to indicate the difference)

* Variant spellings

* Inconsistently entered info (e.g. £5k vs £5,000)

* Different terms for same thing

* Mistypings - missing decimals etc.

* Merged cells

* Empty rows

* Headings across multiple rows

* Converted PDFs

* Missing information

* Duplicate information

* Format

* Need to extract information - e.g. first name/surname; street name/region; year/month

* Need to classify information - e.g. male vs female name

I keep a series of bookmarked materials on cleaning using Pinboard at https://pinboard.in/u:paulbradshaw/t:cleaning

## Examples of dirty data

See the [dirtydata](https://github.com/paulbradshaw/cleaning/tree/master/dirtydata) folder in this repo for examples of dirty data.

[This sample dirty dataset can be used for basic data cleaning in Open Refine](https://docs.google.com/spreadsheets/d/1CDWBeqpUTBd1TkmDz_M6UGRWdHgU7LOcoiGRTvIttKA/edit?usp=sharing#gid=0)

[The European Investment Bank database](http://www.eib.org/en/projects/loan/list/index.htm) can be downloaded (look for *Export to Excel* near the bottom) and provides a useful example of data where dates are formatted as strings.

I also bookmark examples of dirty data at https://pinboard.in/u:paulbradshaw/t:dirtydata

For working with XML files try the ones that can be [downloaded from the Food Standards Agency API page](http://ratings.food.gov.uk/open-data/en-GB)

For JSON files try [petition.parliament.uk](https://petition.parliament.uk/) - go to any petition and look for the JSON link at the bottom of the page.

## Tutorials: Open Refine

A [series of introductory guides to Open Refine can be found in the GitHub repo for one of my modules at Birmingham City University here](https://github.com/paulbradshaw/MED7369-Specialist-Investigative-Journalism/tree/master/cleaning)

* [Open Refine's 'common transforms' features for basic cleaning](https://onlinejournalismblog.com/2011/07/05/cleaning-data-using-google-refine-a-quick-guide-2/)

* [How to: clean up spreadsheet headings that run across multiple rows using Open Refine](https://onlinejournalismblog.com/2013/11/13/how-to-clean-up-spreadsheet-headings-that-run-across-multiple-rows-using-open-refine/)

* [Converting JSON or XML into spreadsheets using Open Refine](https://onlinejournalismblog.com/2015/10/21/how-to-convert-xml-or-json-into-spreadsheets-using-open-refine/)

* Tony Hirst explains [Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column](https://onlinejournalismblog.com/2012/07/30/data-shaping-in-google-refine-generating-new-rows-from-multiple-values-in-a-single-column/)

* I've written about [Scraping data with Google Refine](https://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/)

* Former MA Online Journalism student Ion Mates explains how he used some of Refine's more advanced functionality, and regex, in his post on [cleaning a converted PDF](https://onlinejournalismblog.com/2015/04/07/how-to-clean-a-converted-pdf-using-open-refine/)

* Former MA Online Journalism student Cristian Giuletti explains [How to: combine multiple rows in a dataset where text is split across them](https://onlinejournalismblog.com/2014/05/30/how-to-combine-multiple-rows-in-a-dataset-where-text-is-split-across-them-open-refine/)

* More at https://onlinejournalismblog.com/tag/google-refine/page/2/

## Tutorials: Cleaning in spreadsheets

* [Grabbing or checking the first, middle or last part of a piece of information: RIGHT, LEFT and MID](https://github.com/paulbradshaw/cleaning/blob/master/rightleft.md)

* [What day did that date fall on? Which year was the worst? Extracting days, months and years from full dates](https://github.com/paulbradshaw/cleaning/blob/master/dates.md)

* [Converting sentences to numbers using a spreadsheet](https://github.com/paulbradshaw/cleaning/blob/master/sentencestonumbers.md)

## Tutorials: Cleaning in R ([dedicated folder here](https://github.com/paulbradshaw/cleaning/tree/master/cleaning_r))

* [Cleaning and reshaping data on firefighters](https://github.com/paulbradshaw/cleaning/blob/master/cleaning_r/reshaping_fire.md)

* [Cleaning data while importing](https://github.com/paulbradshaw/cleaning/blob/master/cleaning_r/import_cleaning.md)

* [Reshaping wide to long](https://github.com/paulbradshaw/cleaning/blob/master/cleaning_r/police_reshape_long.md): police misconduct data

* [Reshaping long to wide and filtering data](https://github.com/paulbradshaw/cleaning/blob/master/cleaning_r/reshape.md) and calculating year on year changes and percentages

* Emily Halford's [guide to cleaning with the Janitor package](https://towardsdatascience.com/cleaning-and-exploring-data-with-the-janitor-package-ee4a3edf085e)

* Matt Waite's Gitbook on data journalism, which [covers cleaning in chapter 10](http://mattwaite.github.io/datajournalism/data-cleaning-part-i-data-smells.html)...

* ...[Janitor in chapter 11](http://mattwaite.github.io/datajournalism/data-cleaning-part-ii-janitor.html) and

* ...[Refinr, Open Refine in R in chapter 12](http://mattwaite.github.io/datajournalism/data-cleaning-part-iii-open-refine.html)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paulbradshaw/cleaning

Awesome Lists containing this project

README