{"id":20285438,"url":"https://github.com/mramshaw/data-cleaning","last_synced_at":"2025-08-21T03:32:43.662Z","repository":{"id":44611522,"uuid":"128695871","full_name":"mramshaw/Data-Cleaning","owner":"mramshaw","description":"Data Cleaning with Python","archived":false,"fork":false,"pushed_at":"2024-06-18T19:02:30.000Z","size":1225,"stargazers_count":41,"open_issues_count":9,"forks_count":14,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-12-08T16:52:58.656Z","etag":null,"topics":["data-cleaning","data-munging","data-wrangling","numpy","pandas","python","python3"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mramshaw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-09T01:04:49.000Z","updated_at":"2024-11-17T22:38:41.000Z","dependencies_parsed_at":"2024-11-14T14:27:40.560Z","dependency_job_id":"77fde4bf-a593-4540-9dd6-703015e5182b","html_url":"https://github.com/mramshaw/Data-Cleaning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FData-Cleaning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FData-Cleaning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FData-Cleaning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FData-Cleaning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mramshaw","download_url":"https://codeload.github.com/mramshaw/Data-Cleaning/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230487843,"owners_count":18233865,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","data-munging","data-wrangling","numpy","pandas","python","python3"],"created_at":"2024-11-14T14:26:42.555Z","updated_at":"2024-12-19T19:07:47.875Z","avatar_url":"https://github.com/mramshaw.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Cleaning with NumPy and Pandas\n\n[![Known Vulnerabilities](http://snyk.io/test/github/mramshaw/Data-Cleaning/badge.svg?style=plastic\u0026targetFile=requirements.txt)](http://snyk.io/test/github/mramshaw/Data-Cleaning?style=plastic\u0026targetFile=requirements.txt)\n\n\u003e let’s be honest, the vast majority of time a data scientist spends is not doing all the really cool modeling that we all wanna do,\n\u003e it’s doing the data prep, the manipulation, reporting, graphing… That’s 80%-90% of the job now.\n\n    Jared Lander - http://changelog.com/practicalai/7\n\nShamelessly stolen from the [CrowdFlower 2016 survey](http://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf):\n\n![What Data Scientists do](images/What_data_scientists_do.png)\n\n\u003e The things data scientists do most are the things they enjoy least.\n\nFrom the same survey:\n\n![What Data Scientists enjoy least](images/What_data_scientists_enjoy_least.png)\n\n[Note that the above graphics are based upon a __2016__ survey.]\n\nAt meetups, I have heard at least one data scientist say that most of their time is\nspent cleaning data so when I ran across this great\n[RealPython article](http://realpython.com/python-data-cleaning-numpy-pandas/)\nI decided to try it out (the article suggests about 80% of a data scientists\n time is spent cleaning data).\n\nThe recommendation is to use Jupyter notebooks but I chose to use IPython.\n\nI also created a batch version for fun: [winston_wolfe.py](winston_wolfe.py)\n\n## Other Terms\n\n___Data Cleaning___ is also referred to as ___Data Wrangling___,\n___Data Munging___, ___Data Janitor Work___ and ___Data Preparation___.\nAll of these refer to preparing data for ingestion into a data processing\nstream of some kind. Computers are very intolerant of format differences,\nso all of the data must be reformatted to conform to a standard\n(or \"clean\") format. Missing data and partial datasets can be\nproblematic, so an initial goal is to identify data deficiencies\nbefore they lead to spurious results.\n\n[Sometimes it is not mentioned at all, merely ___implied___.\n It is generally not possible to carry out an __ETL__ (Extract,\n Transform and Load) job without doing at least ___some___\n data cleaning. If you are asked for a time estimate for an ETL\n job, remember to factor in time for data examination \u0026 data\n cleaning. Not to mention how to handle [outliers](http://en.wikipedia.org/wiki/Outlier)\n (drop or not? if so, what is a good cutoff point? etc.).]\n\nOther requirements may including ___normalizing___ data sets,\nwhich generally means scaling the data to values between 0 and 1\n(this enables certain types of numerical analysis).\n\nThe end result may sometimes be referred to as ___tidy data___,\nhowever it is important to remember that data cleaning is not\nalways a one-time task. The further use of any given dataset\nmay well highlight details that need further cleaning.\n\n## Exploration\n\nLets start with our first dataset.\n\nThe first thing is to have a look at the data. Here we will use the `head()`\ncommand to inspect the first 5 records of our input file (`head` is an old\n\\*nix command meaning show the ___head___ of the specified file; and the `\\`\ncharacter has long been used in \\*nix as a continuation character; here the\ndata columns are broken up so as to not overflow the available screen width):\n\n``` Python\n\u003e\u003e\u003e import pandas as pd\n\u003e\u003e\u003e import numpy as np\n\u003e\u003e\u003e df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')\n\u003e\u003e\u003e df.head()\n   Identifier             Edition Statement      Place of Publication  \\\n0         206                           NaN                    London   \n1         216                           NaN  London; Virtue \u0026 Yorston   \n2         218                           NaN                    London   \n3         472                           NaN                    London   \n4         480  A new edition, revised, etc.                    London   \n\n  Date of Publication              Publisher  \\\n0         1879 [1878]       S. Tinsley \u0026 Co.   \n1                1868           Virtue \u0026 Co.   \n2                1869  Bradbury, Evans \u0026 Co.   \n3                1851          James Darling   \n4                1857   Wertheim \u0026 Macintosh   \n\n                                               Title     Author  \\\n0                  Walter Forbes. [A novel.] By A. A      A. A.   \n1  All for Greed. [A novel. The dedication signed...  A., A. A.   \n2  Love the Avenger. By the author of “All for Gr...  A., A. A.   \n3  Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.   \n4  [The World in which I live, and my place in it...  A., E. S.   \n\n                                   Contributors  Corporate Author  \\\n0                               FORBES, Walter.               NaN   \n1  BLAZE DE BURY, Marie Pauline Rose - Baroness               NaN   \n2  BLAZE DE BURY, Marie Pauline Rose - Baroness               NaN   \n3                   Appleyard, Ernest Silvanus.               NaN   \n4                           BROOME, John Henry.               NaN   \n\n   Corporate Contributors Former owner  Engraver Issuance type  \\\n0                     NaN          NaN       NaN   monographic   \n1                     NaN          NaN       NaN   monographic   \n2                     NaN          NaN       NaN   monographic   \n3                     NaN          NaN       NaN   monographic   \n4                     NaN          NaN       NaN   monographic   \n\n                                          Flickr URL  \\\n0  http://www.flickr.com/photos/britishlibrary/ta...   \n1  http://www.flickr.com/photos/britishlibrary/ta...   \n2  http://www.flickr.com/photos/britishlibrary/ta...   \n3  http://www.flickr.com/photos/britishlibrary/ta...   \n4  http://www.flickr.com/photos/britishlibrary/ta...   \n\n                            Shelfmarks  \n0    British Library HMNTS 12641.b.30.  \n1    British Library HMNTS 12626.cc.2.  \n2    British Library HMNTS 12625.dd.1.  \n3  British Library HMNTS 10369.bbb.15.  \n4     British Library HMNTS 9007.d.28.  \n\u003e\u003e\u003e\n```\n\n[For more data use `head(10)` instead.]\n\n## Disposal\n\nNow lets gets rid of any columns we don't need:\n\n``` Python\n\u003e\u003e\u003e to_drop = ['Edition Statement',\n...            'Corporate Author',\n...            'Corporate Contributors',\n...            'Former owner',\n...            'Engraver',\n...            'Contributors',\n...            'Issuance type',\n...            'Shelfmarks']\n\u003e\u003e\u003e df.drop(to_drop, inplace=True, axis=1)\n\u003e\u003e\u003e df.head()\n   Identifier      Place of Publication Date of Publication  \\\n0         206                    London         1879 [1878]   \n1         216  London; Virtue \u0026 Yorston                1868   \n2         218                    London                1869   \n3         472                    London                1851   \n4         480                    London                1857   \n\n               Publisher                                              Title  \\\n0       S. Tinsley \u0026 Co.                  Walter Forbes. [A novel.] By A. A   \n1           Virtue \u0026 Co.  All for Greed. [A novel. The dedication signed...   \n2  Bradbury, Evans \u0026 Co.  Love the Avenger. By the author of “All for Gr...   \n3          James Darling  Welsh Sketches, chiefly ecclesiastical, to the...   \n4   Wertheim \u0026 Macintosh  [The World in which I live, and my place in it...   \n\n      Author                                         Flickr URL  \n0      A. A.  http://www.flickr.com/photos/britishlibrary/ta...  \n1  A., A. A.  http://www.flickr.com/photos/britishlibrary/ta...  \n2  A., A. A.  http://www.flickr.com/photos/britishlibrary/ta...  \n3  A., E. S.  http://www.flickr.com/photos/britishlibrary/ta...  \n4  A., E. S.  http://www.flickr.com/photos/britishlibrary/ta...  \n\u003e\u003e\u003e\n```\n\nAnd now we are down to six columns.\n\nThe article suggests that if you know in advance which columns you’d like to use,\nanother option is to pass them to the `usecols` argument of `pd.read_csv`.\n\n## Indexing\n\nWhile we are now down to six columns, if we were to write this file now we would\nsee that `pandas` has prepended an index to each and every entry. It looks like\nthe `Identifier` column is unique, let's check this:\n\n``` Python\n\u003e\u003e\u003e df['Identifier'].is_unique\nTrue\n\u003e\u003e\u003e\n```\n\nOkay, it looks like we can use this column as an index:\n\n``` Python\n\u003e\u003e\u003e df = df.set_index('Identifier')\n\u003e\u003e\u003e df.head()\n                Place of Publication Date of Publication  \\\nIdentifier                                                 \n206                           London         1879 [1878]   \n216         London; Virtue \u0026 Yorston                1868   \n218                           London                1869   \n472                           London                1851   \n480                           London                1857   \n\n                        Publisher  \\\nIdentifier                          \n206              S. Tinsley \u0026 Co.   \n216                  Virtue \u0026 Co.   \n218         Bradbury, Evans \u0026 Co.   \n472                 James Darling   \n480          Wertheim \u0026 Macintosh   \n\n                                                        Title     Author  \\\nIdentifier                                                                 \n206                         Walter Forbes. [A novel.] By A. A      A. A.   \n216         All for Greed. [A novel. The dedication signed...  A., A. A.   \n218         Love the Avenger. By the author of “All for Gr...  A., A. A.   \n472         Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.   \n480         [The World in which I live, and my place in it...  A., E. S.   \n\n                                                   Flickr URL  \nIdentifier                                                     \n206         http://www.flickr.com/photos/britishlibrary/ta...  \n216         http://www.flickr.com/photos/britishlibrary/ta...  \n218         http://www.flickr.com/photos/britishlibrary/ta...  \n472         http://www.flickr.com/photos/britishlibrary/ta...  \n480         http://www.flickr.com/photos/britishlibrary/ta...  \n\u003e\u003e\u003e\n```\n\nYep, that works.\n\nThe article now points out that Pandas Indexes do not make any guarantees of uniqueness,\nalthough many indexing and merging operations will run faster if the Index __is__ unique.\n\nWe can now use `loc[]` to do key-based locating:\n\n``` Python\n\u003e\u003e\u003e df.loc[216]\nPlace of Publication                             London; Virtue \u0026 Yorston\nDate of Publication                                                  1868\nPublisher                                                    Virtue \u0026 Co.\nTitle                   All for Greed. [A novel. The dedication signed...\nAuthor                                                          A., A. A.\nFlickr URL              http://www.flickr.com/photos/britishlibrary/ta...\nName: 216, dtype: object\n\u003e\u003e\u003e\n```\n\nOr we could use `iloc[]` to access our entries by index (instead of by key):\n\n``` Python\n\u003e\u003e\u003e df.iloc[1]\nPlace of Publication                             London; Virtue \u0026 Yorston\nDate of Publication                                                  1868\nPublisher                                                    Virtue \u0026 Co.\nTitle                   All for Greed. [A novel. The dedication signed...\nAuthor                                                          A., A. A.\nFlickr URL              http://www.flickr.com/photos/britishlibrary/ta...\nName: 216, dtype: object\n\u003e\u003e\u003e\n```\n\nWe could also have set our index __in-place__:\n\n``` Python\ndf.set_index('Identifier', inplace=True)\n```\n\nInstead of:\n\n``` Python\n\u003e\u003e\u003e df = df.set_index('Identifier')\n```\n\n## Cleaning up data fields\n\nLets see what datatypes we have:\n\n``` Python\n\u003e\u003e\u003e df.get_dtype_counts()\nobject    6\ndtype: int64\n\u003e\u003e\u003e\n```\nOkay, so lets check for formatting issues:\n\n``` Python\n\u003e\u003e\u003e df.loc[1905:, 'Date of Publication'].head(10)\nIdentifier\n1905           1888\n1929    1839, 38-54\n2836           1897\n2854           1865\n2956        1860-63\n2957           1873\n3017           1866\n3131           1899\n4598           1814\n4884           1820\nName: Date of Publication, dtype: object\n\u003e\u003e\u003e\n```\n\nAnd we will need to clean up 'Date of Publication'. So we will\nuse a regular expression to extract our cleaned values:\n\n``` Python\n\u003e\u003e\u003e regex = r'^(\\d{4})'\n\u003e\u003e\u003e extr = df['Date of Publication'].str.extract(r'^(\\d{4})', expand=False)\n\u003e\u003e\u003e extr.head()\nIdentifier\n206    1879\n216    1868\n218    1869\n472    1851\n480    1857\nName: Date of Publication, dtype: object\n\u003e\u003e\u003e\n```\n\nNow lets convert these to a numeric type and copy them back:\n\n``` Python\n\u003e\u003e\u003e df['Date of Publication'] = pd.to_numeric(extr)\n\u003e\u003e\u003e df['Date of Publication'].dtype\ndtype('float64')\n\u003e\u003e\u003e\n```\n\nNote that floats have a decimal portion, which can look a little weird:\n\n``` Python\n\u003e\u003e\u003e df['Date of Publication'].head()\nIdentifier\n206    1879.0\n216    1868.0\n218    1869.0\n472    1851.0\n480    1857.0\nName: Date of Publication, dtype: float64\n\u003e\u003e\u003e\n```\n\nSo far we have just used `pandas`, lets move on to using `numpy`.\n\n## More cleaning up data fields (this time with `numpy`)\n\nLets have a look at our 'Place of Publication':\n\n``` Python\n\u003e\u003e\u003e df['Place of Publication'].head(10)\nIdentifier\n206                                  London\n216                London; Virtue \u0026 Yorston\n218                                  London\n472                                  London\n480                                  London\n481                                  London\n519                                  London\n667     pp. 40. G. Bryan \u0026 Co: Oxford, 1898\n874                                 London]\n1143                                 London\nName: Place of Publication, dtype: object\n\u003e\u003e\u003e\n```\n\nLets see if we can isolate London:\n\n``` Python\n\u003e\u003e\u003e pub = df['Place of Publication']\n\u003e\u003e\u003e london = pub.str.contains('London')\n\u003e\u003e\u003e london[:5]\nIdentifier\n206    True\n216    True\n218    True\n472    True\n480    True\nName: Place of Publication, dtype: bool\n\u003e\u003e\u003e\n```\n\nLets add Oxford and clean them both up:\n\n``` Python\n\u003e\u003e\u003e oxford = pub.str.contains('Oxford')\n\u003e\u003e\u003e df['Place of Publication'] = np.where(london, 'London',\n...                                       np.where(oxford, 'Oxford',\n...                                                pub.str.replace('-', ' ')))\n\u003e\u003e\u003e df['Place of Publication'].head()\nIdentifier\n206    London\n216    London\n218    London\n472    London\n480    London\nName: Place of Publication, dtype: object\n\u003e\u003e\u003e\n```\n\nWe *could* clean up leading and trailing whitespace with something like the following:\n\n\n``` Python\ndf[\"Publisher\"] = df[\"Publisher\"].map(str.strip)\n```\n\nBut simpler still to strip these on ingress:\n\n``` Python\ndf = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv', skipinitialspace=True)\n```\n\n## Destructuring data\n\nOn to the second dataset.\n\nThe `university_towns.txt` dataset is heavily-structured.\n\n```\n$ head Datasets/university_towns.txt\nAlabama[edit]\nAuburn (Auburn University)[1]\nFlorence (University of North Alabama)\nJacksonville (Jacksonville State University)[2]\nLivingston (University of West Alabama)[2]\nMontevallo (University of Montevallo)[2]\nTroy (Troy University)[2]\nTuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]\nTuskegee (Tuskegee University)[5]\nAlaska[edit]\n$\n```\n\nWe can destructure it as follows:\n\n``` Python\n\u003e\u003e\u003e university_towns = []\n\u003e\u003e\u003e with open('Datasets/university_towns.txt') as file:\n...     for line in file:\n...         if '[edit]' in line:\n...             # Remember this `state` until the next is found\n...             state = line\n...         else:\n...             # Otherwise, we have a city; keep `state` as last-seen\n...             university_towns.append((state, line))\n... \n\u003e\u003e\u003e university_towns[:5]\n[('Alabama[edit]\\n', 'Auburn (Auburn University)[1]\\n'), ('Alabama[edit]\\n', 'Florence (University of North Alabama)\\n'), ('Alabama[edit]\\n', 'Jacksonville (Jacksonville State University)[2]\\n'), ('Alabama[edit]\\n', 'Livingston (University of West Alabama)[2]\\n'), ('Alabama[edit]\\n', 'Montevallo (University of Montevallo)[2]\\n')]\n\u003e\u003e\u003e\n```\n\nAnd now we can create a dataframe:\n\n``` Python\n\u003e\u003e\u003e towns_df = pd.DataFrame(university_towns,\n...                         columns=['State', 'RegionName'])\n\u003e\u003e\u003e towns_df.head()\n             State                                         RegionName\n0  Alabama[edit]\\n                    Auburn (Auburn University)[1]\\n\n1  Alabama[edit]\\n           Florence (University of North Alabama)\\n\n2  Alabama[edit]\\n  Jacksonville (Jacksonville State University)[2]\\n\n3  Alabama[edit]\\n       Livingston (University of West Alabama)[2]\\n\n4  Alabama[edit]\\n         Montevallo (University of Montevallo)[2]\\n\n\u003e\u003e\u003e\n```\n\nLets create a function to clean up our data cells:\n\n``` Python\n\u003e\u003e\u003e def get_citystate(item):\n...     if ' (' in item:\n...         return item[:item.find(' (')]\n...     elif '[' in item:\n...         return item[:item.find('[')]\n...     else:\n...         return item\n... \n\u003e\u003e\u003e towns_df =  towns_df.applymap(get_citystate)\n\u003e\u003e\u003e towns_df.head()\n     State    RegionName\n0  Alabama        Auburn\n1  Alabama      Florence\n2  Alabama  Jacksonville\n3  Alabama    Livingston\n4  Alabama    Montevallo\n\u003e\u003e\u003e\n```\n\n[We could also do something about quotes (`\"`) but that would expose the embedded commas (`,`)\n and this would be a bigger problem.]\n\nThe article points out that the `applymap()` method will have a significant performance impact,\nand that if performance ___is___ a consideration, this type of thing should be submitted to\n`numpy` instead (in general, equivalent operations in numpy will significantly out-perform\nnative Python).\n\n## Dropping rows and renaming columns\n\nAnd now our third dataset. Lets see what we've got:\n\n```\n$ head -n 5 Datasets/olympics.csv\n0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15\n,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02 !,03 !,Total,? Games,01 !,02 !,03 !,Combined total\nAfghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2\nAlgeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15\nArgentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70\n$\n```\n\nLets read it in:\n\n``` Python\n\u003e\u003e\u003e olympics_df = pd.read_csv('Datasets/olympics.csv')\n\u003e\u003e\u003e olympics_df.head()\n                   0         1     2     3     4      5         6     7     8  \\\n0                NaN  ? Summer  01 !  02 !  03 !  Total  ? Winter  01 !  02 !   \n1  Afghanistan (AFG)        13     0     0     2      2         0     0     0   \n2      Algeria (ALG)        12     5     2     8     15         3     0     0   \n3    Argentina (ARG)        23    18    24    28     70        18     0     0   \n4      Armenia (ARM)         5     1     2     9     12         6     0     0   \n\n      9     10       11    12    13    14              15  \n0  03 !  Total  ? Games  01 !  02 !  03 !  Combined total  \n1     0      0       13     0     0     2               2  \n2     0      0       15     5     2     8              15  \n3     0      0       41    18    24    28              70  \n4     0      0       11     1     2     9              12  \n\u003e\u003e\u003e\n```\n\nThat first row looks pretty useless. So maybe drop it:\n\n``` Python\n\u003e\u003e\u003e olympics_df = pd.read_csv('Datasets/olympics.csv', header=1)\n\u003e\u003e\u003e olympics_df.head()\n                Unnamed: 0  ? Summer  01 !  02 !  03 !  Total  ? Winter  \\\n0        Afghanistan (AFG)        13     0     0     2      2         0   \n1            Algeria (ALG)        12     5     2     8     15         3   \n2          Argentina (ARG)        23    18    24    28     70        18   \n3            Armenia (ARM)         5     1     2     9     12         6   \n4  Australasia (ANZ) [ANZ]         2     3     4     5     12         0   \n\n   01 !.1  02 !.1  03 !.1  Total.1  ? Games  01 !.2  02 !.2  03 !.2  \\\n0       0       0       0        0       13       0       0       2   \n1       0       0       0        0       15       5       2       8   \n2       0       0       0        0       41      18      24      28   \n3       0       0       0        0       11       1       2       9   \n4       0       0       0        0        2       3       4       5   \n\n   Combined total  \n0               2  \n1              15  \n2              70  \n3              12  \n4              12  \n\u003e\u003e\u003e\n```\n\nThat's better. Now lets rename some columns:\n\n``` Python\n\u003e\u003e\u003e new_names = {'Unnamed: 0': 'Country',\n...              '? Summer': 'Summer Olympics',\n...              '01 !': 'Gold',\n...              '02 !': 'Silver',\n...              '03 !': 'Bronze',\n...              '? Winter': 'Winter Olympics',\n...              '01 !.1': 'Gold.1',\n...              '02 !.1': 'Silver.1',\n...              '03 !.1': 'Bronze.1',\n...              '? Games': '# Games',\n...              '01 !.2': 'Gold.2',\n...              '02 !.2': 'Silver.2',\n...              '03 !.2': 'Bronze.2'}\n\u003e\u003e\u003e olympics_df.rename(columns=new_names, inplace=True)\n\u003e\u003e\u003e olympics_df.head()\n                   Country  Summer Olympics  Gold  Silver  Bronze  Total  \\\n0        Afghanistan (AFG)               13     0       0       2      2   \n1            Algeria (ALG)               12     5       2       8     15   \n2          Argentina (ARG)               23    18      24      28     70   \n3            Armenia (ARM)                5     1       2       9     12   \n4  Australasia (ANZ) [ANZ]                2     3       4       5     12   \n\n   Winter Olympics  Gold.1  Silver.1  Bronze.1  Total.1  # Games  Gold.2  \\\n0                0       0         0         0        0       13       0   \n1                3       0         0         0        0       15       5   \n2               18       0         0         0        0       41      18   \n3                6       0         0         0        0       11       1   \n4                0       0         0         0        0        2       3   \n\n   Silver.2  Bronze.2  Combined total  \n0         0         2               2  \n1         2         8              15  \n2        24        28              70  \n3         2         9              12  \n4         4         5              12  \n\u003e\u003e\u003e\n```\n\nMuch better! Note that we have added postscripts to our repeated columns (Gold,\nSilver, Bronze) so that each column is unique within the dataframe.\n\nAnd thus ends the tutorial on cleaning data with Python.\n\n## Python and pydoc\n\nSome great stuff on documenting Python code here:\n\n    http://realpython.com/documenting-python-code/\n\nLets clean up the code comments so that `pydoc` displays cleanly:\n\n```\nHelp on module winston_wolfe:\n\nNAME\n    winston_wolfe - A quick and dirty 'cleaner' for some data files.\n\nFILE\n    /home/owner/Documents/Python/Data Cleaning/winston_wolfe.py\n\nDESCRIPTION\n    Three datasets will be cleaned, with cells reformatted as needed.\n\nFUNCTIONS\n    get_citystate(item)\n        A function to clean up data cells.\n\nDATA\n    DF =            Place of Publication  Date of Publica...s/britishlibra...\n    EXTRACT = Identifier\n    206        1879\n    216        1868\n    218  ... Date of ...\n    LONDON = Identifier\n    206         True\n    216         True\n    218...: Place of...\n    NEW_NAMES = {'01 !': 'Gold', '01 !.1': 'Gold.1', '01 !.2': 'Gold.2', '...\n    OLYMPICS_DF =                                           Countr...  607...\n    OXFORD = Identifier\n    206        False\n    216        False\n    218...: Place of...\n    PUB = Identifier\n    206                     London\n    216   ...Place of Publ...\n    TOWNS_DF =              State                    RegionName...        ...\n    TO_DROP = ['Edition Statement', 'Corporate Author', 'Corporate Contrib...\n    UNIVERSITY_TOWNS = [('Alabama[edit]\\n', 'Auburn (Auburn University)[1]...\n    line = 'Laramie (University of Wyoming)[5]\\n'\n    state = 'Wyoming[edit]\\n'\n    towns = \u003cclosed file 'Datasets/university_towns.txt', mode 'r'\u003e\n```\n\n## Reference\n\nTidy Data\n\n```\n@Article{tidy-data,\n  author = {Hadley Wickham},\n  issue = {10},\n  journal = {The Journal of Statistical Software},\n  selected = {TRUE},\n  title = {Tidy data},\n  url = {http://www.jstatsoft.org/v59/i10/},\n  volume = {59},\n  year = {2014},\n  bdsk-url-1 = {http://www.jstatsoft.org/v59/i10/},\n}\n```\n\nread_csv\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv\n\nread_pickle\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html#pandas.read_pickle\n\ndrop\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html#pandas.DataFrame.drop\n\nto_csv\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.to_csv.html#pandas.Series.to_csv\n\nto_pickle\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html#pandas.DataFrame.to_pickle\n\n## To Do\n\n- [x] Rephrase doc comments to conform to `pydocstyle`\n- [x] Add survey results from [CrowdFlower 2016 survey](http://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf)\n- [ ] Pickle everything instead of writing output files\n\n## Credits\n\nInspired by this great tutorial:\n\n    http://realpython.com/python-data-cleaning-numpy-pandas/\n\nI have been really impressed by the quality of the Real Python tutorials.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmramshaw%2Fdata-cleaning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmramshaw%2Fdata-cleaning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmramshaw%2Fdata-cleaning/lists"}