{"id":18704891,"url":"https://github.com/xiaohk/irs990","last_synced_at":"2025-11-09T05:30:29.516Z","repository":{"id":134719769,"uuid":"73587550","full_name":"xiaohk/IRS990","owner":"xiaohk","description":"📶 Play around with the public IRS 990 tax form data","archived":false,"fork":false,"pushed_at":"2016-11-24T21:21:42.000Z","size":9424,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-28T06:27:04.336Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xiaohk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-11-13T00:53:39.000Z","updated_at":"2017-02-27T15:55:17.000Z","dependencies_parsed_at":"2023-06-02T18:00:29.283Z","dependency_job_id":null,"html_url":"https://github.com/xiaohk/IRS990","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohk%2FIRS990","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohk%2FIRS990/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohk%2FIRS990/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohk%2FIRS990/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xiaohk","download_url":"https://codeload.github.com/xiaohk/IRS990/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239567843,"owners_count":19660559,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T12:09:00.649Z","updated_at":"2025-11-09T05:30:29.449Z","avatar_url":"https://github.com/xiaohk.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"*Download or click the Jupyter notebook [`IRS990.ipynb`](https://github.com/xiaohk/IRS990/blob/master/IRS990.ipynb) to see the summary in a better format*\n\n## Approaches\nAfter getting the data, I did some research about the 990 Form. I decide to do a mini-project studying the relations between the contribution and investment of organizations. There are many different approaches.\n\n1. Study the contribution changing rate with investment changing rate over 1, 2, or 3 years\n2. Study the contribution and investment in a specific year\n3. Study the contribution and investment in a specific area (i.e. Madison WI)\n4. Study the contribution and investment over different kinds of orgs\n5. $\\dots \\dots$\n\n## Steps\nFor this project, I would choose the second approach above over the 2015 year population. Following are the steps I am planning to take.\n\n1. Fairly create a sample of orgs\n    - With re-usable parameters(size, etc)\n2. Get the interested entries for the orgs created above\n    - With re-usable parameters(interested keys)\n    - Write the result into a file, so it is easier for analyzing\n3. Filter dirty data\n    - Remove unhelpful data from the sample file\n3. Analyze the data\n    - Basis measuring statistics\n    - Virtualization \n    - Explore relations of two keys\n\n## Sampling\nTo be fair, I would use random sampling. Since the population size varies from year to year, I would use reservoir sampling algorithm.\n\n\n```python\nimport csv\nimport random\nimport xml.etree.ElementTree as ET\nimport requests\n\ndef sampling(size, f_name):\n    \"\"\"\n    Random sampling `size` samples using the f_name csv file as population. \n    Reservoir sampling algorithm is used.\n\n    Args:\n        size(int) : sample size\n        f_name(string) : name of the population file\n\n    Returns:\n        int list : samples, each element is the unique identifier of the filing\n    \"\"\"\n    samples = []\n    counter = 0\n    with open(f_name, 'r') as fp:\n        first_line = fp.readline()\n        for line in fp:\n            counter += 1\n            # Fill in the samples \n            if len(samples) \u003c size:\n                samples.append(line.strip('\\n')[-18:])\n            # Dynamic probability of replacing samples with the new sample\n            else:\n                indicator = int(random.random() * counter)\n                # With size/counter probability\n                if indicator \u003c size:\n                    samples[indicator] = line.strip('\\n')[-18:]\n    return samples\n```\n\n## Fetching data\nOnce we have the organization samples, we can access the database to get the values we are interested. To make the function reusable, I would use a flexible approach, instead of hard code.\n\nThe flow is \n1. Use the unique identifier to locate the 990 form\n2. Parse the xml file\n3. Fetch the values with interested tags\n4. Organize and create local csv file\n\n\n```python\n\ndef get_data(samples, output, *interests):\n    \"\"\"\n    Accessing each sample, organize and write the interested entires.\n\n    Args:\n        samples(str list) : list of unique identifiers of the samples\n        output(str) : file name of the output\n        *interests : multiple string of the tags of interested entries. If the\n            tags are not in the 990 form, values will be replaced with empty string\n\n    \"\"\"\n    \n    with open(output, 'w') as out_csv:\n        # The headers are the identifier and interested tags\n        writer = csv.DictWriter(out_csv, fieldnames = ['id'] + list(interests), \n                                delimiter = '\\t')\n        writer.writeheader()\n\n        url_p = 'https://s3.amazonaws.com/irs-form-990/'\n        url_e = '_public.xml'\n        \n        # Use counter to track the task rate\n        counter = 0\n        total = float(len(samples))\n\n        for sample in samples:\n            counter += 1\n            print(\"Finished \" + str(counter / total * 100) + \"%\")\n            # Parse the xml file\n            xml_response = requests.get(url_p + sample + url_e)\n            root = ET.fromstring(xml_response.content)\n\n            # Get the data for interested tags\n            data = {'id' : sample}\n\n            for tag in interests:\n                # For missing data, we use empty string as replacement\n                try:\n                    data[tag] = next(root.iter('{http://www.irs.gov/efile}' + \n                                               tag)).text\n                except StopIteration:\n                    data[tag] = ''\n\n            writer.writerow(data)\n```\n\nNow we can write a script to do the real operations. \n\nI choose the sample size to be 30000, and interested keys are all keys related to contribution and investment. \n\nThis script has run for hours, so I used `ssh` and `screen` to execute it on the computers in the CS lab.\n\n\n```python\nimport sample\n\ncurr_sample = sample.sampling(30000, 'index_2015.csv')\nsample.get_data(curr_sample, 'sample_2015.csv', \n                'PYContributionsGrantsAmt', \n                'CYContributionsGrantsAmt', \n                'TotalContributionsAmt', \n                'ContributionsGiftsGrantsEtcAmt',\n                'PYInvestmentIncomeAmt', \n                'CYInvestmentIncomeAmt')\n```\n\n    Finished 100.0%\n\n\nThe result(first 15 lines) is shown below.\n\n\n```python\nwith open('sample_2015.csv', 'r') as fp:\n    length = 0\n    for line in fp:\n        # Illustrate one part of the csv file\n        if length \u003c 15:\n            print(line)\n        length += 1\n```\n\n    id\tPYContributionsGrantsAmt\tCYContributionsGrantsAmt\tTotalContributionsAmt\tContributionsGiftsGrantsEtcAmt\tPYInvestmentIncomeAmt\tCYInvestmentIncomeAmt\n    \n    201542249349300954\t344438\t68709\t68709\t\t385\t435\n    \n    201510659349200201\t\t\t\t4910\t\t\n    \n    201521699349200427\t\t\t\t310\t\t\n    \n    201530429349200853\t\t\t\t4903\t\t\n    \n    201531349349305353\t38335\t49927\t49927\t\t0\t0\n    \n    201542399349300724\t603535\t348126\t348126\t\t0\t9842\n    \n    201530629349100408\t\t\t\t\t\t\n    \n    201502529349100310\t\t\t\t\t\t\n    \n    201502239349301460\t137017\t210906\t210906\t\t0\t0\n    \n    201511359349101536\t\t\t\t\t\t\n    \n    201502309349200960\t\t\t\t\t\t\n    \n    201542029349300824\t\t989148\t989148\t\t\t131\n    \n    201502439349300735\t372522\t441452\t441452\t\t98\t-3174\n    \n    201531209349101023\t\t\t240000\t\t\t\n    \n\n\n\n```python\nprint(length)\n```\n\n    28259\n\n\n## Format the data\nIt ends the process of building a sample. Now I want to focus on the `PYContributionsGrantsAmt` and  `PYInvestmentIncomeAmt` two entries. I want to see whether there are some relations between of them.\n\nAlthough we have 28259 neat and clean entries with the key which we may be interested in, many of them do not have `PYContributionsGrantsAmt` and `PYInvestmentIncomeAmt`. To make reading csv easier, we need to erase the unvalid data from the sample file.\n\n\n```python\nwith open('sample_2015.csv', 'r') as fp:\n    with open('interest.csv', 'w') as out_fp:\n        # Header line\n        out_fp.write(fp.readline())\n        for line in fp:\n            entries = line.split('\\t')\n            \n            # We only want the data with  valid PYContributionsGrantsAmt \n            # and PYInvestmentIncomeAmt\n            if entries[1] and entries[1] != \"RESTRICTED\" and \\\n            entries[5] and entries[5] != \"RESTRICTED\":\n                out_fp.write(line)\n```\n\n\n```python\n# Ilustrate the final intesrest.csv (First 15 lines)\nwith open('interest.csv', 'r') as fp:\n    length = 0\n    for line in fp:\n        if length \u003c 15:\n            print(line)\n        length += 1\n```\n\n    id\tPYContributionsGrantsAmt\tCYContributionsGrantsAmt\tTotalContributionsAmt\tContributionsGiftsGrantsEtcAmt\tPYInvestmentIncomeAmt\tCYInvestmentIncomeAmt\n    \n    201542249349300954\t344438\t68709\t68709\t\t385\t435\n    \n    201531349349305353\t38335\t49927\t49927\t\t0\t0\n    \n    201542399349300724\t603535\t348126\t348126\t\t0\t9842\n    \n    201502239349301460\t137017\t210906\t210906\t\t0\t0\n    \n    201502439349300735\t372522\t441452\t441452\t\t98\t-3174\n    \n    201500729349300330\t3886855\t4286945\t4286945\t\t104638\t121881\n    \n    201522369349300207\t432619\t594609\t594609\t\t0\t0\n    \n    201501349349305660\t418960\t44427\t44427\t\t391823\t307090\n    \n    201522369349300307\t277082\t722789\t722789\t\t43051\t59512\n    \n    201541359349309199\t6017511\t6118975\t6118975\t\t5981\t2582\n    \n    201521599349300852\t294874\t254497\t254497\t\t181\t103\n    \n    201531359349306593\t0\t0\t\t\t270\t318\n    \n    201532299349302418\t447104\t431291\t431291\t\t10\t37\n    \n    201532299349302443\t1534385\t2240547\t2240547\t\t63\t0\n    \n\n\n\n```python\nprint(length)\n```\n\n    11748\n\n\n\n```python\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n# Constants for the key\nCONTRI = \"PYContributionsGrantsAmt\"\nINVEST = \"PYInvestmentIncomeAmt\"\n\n# Build data frame for analysis, from_csv() reader would maek some errors while parsing\ndf = pd.DataFrame.from_csv('interest.csv', sep = '\\t')\n\n# Vertualizing the Dataframe (first 15 rows)\n# The output is a little odd in jupyter\nprint(df[:16])\n```\n\n                        PYContributionsGrantsAmt  CYContributionsGrantsAmt  \\\n    id                                                                       \n    201542249349300954                    344438                     68709   \n    201531349349305353                     38335                     49927   \n    201542399349300724                    603535                    348126   \n    201502239349301460                    137017                    210906   \n    201502439349300735                    372522                    441452   \n    201500729349300330                   3886855                   4286945   \n    201522369349300207                    432619                    594609   \n    201501349349305660                    418960                     44427   \n    201522369349300307                    277082                    722789   \n    201541359349309199                   6017511                   6118975   \n    201521599349300852                    294874                    254497   \n    201531359349306593                         0                         0   \n    201532299349302418                    447104                    431291   \n    201532299349302443                   1534385                   2240547   \n    201521319349300912                     39290                     54034   \n    201532299349302518                         0                     41404   \n    \n                       TotalContributionsAmt  ContributionsGiftsGrantsEtcAmt  \\\n    id                                                                         \n    201542249349300954                 68709                             NaN   \n    201531349349305353                 49927                             NaN   \n    201542399349300724                348126                             NaN   \n    201502239349301460                210906                             NaN   \n    201502439349300735                441452                             NaN   \n    201500729349300330               4286945                             NaN   \n    201522369349300207                594609                             NaN   \n    201501349349305660                 44427                             NaN   \n    201522369349300307                722789                             NaN   \n    201541359349309199               6118975                             NaN   \n    201521599349300852                254497                             NaN   \n    201531359349306593                   NaN                             NaN   \n    201532299349302418                431291                             NaN   \n    201532299349302443               2240547                             NaN   \n    201521319349300912                 54034                             NaN   \n    201532299349302518                 41404                             NaN   \n    \n                        PYInvestmentIncomeAmt  CYInvestmentIncomeAmt  \n    id                                                                \n    201542249349300954                    385                    435  \n    201531349349305353                      0                      0  \n    201542399349300724                      0                   9842  \n    201502239349301460                      0                      0  \n    201502439349300735                     98                  -3174  \n    201500729349300330                 104638                 121881  \n    201522369349300207                      0                      0  \n    201501349349305660                 391823                 307090  \n    201522369349300307                  43051                  59512  \n    201541359349309199                   5981                   2582  \n    201521599349300852                    181                    103  \n    201531359349306593                    270                    318  \n    201532299349302418                     10                     37  \n    201532299349302443                     63                      0  \n    201521319349300912                   8923                  47214  \n    201532299349302518                  50005                  38501  \n\n\n## Analyzing the data\nAfter getting the 11747 entries from the valid data, we can start our analysis. \n\n1. Basis measuring statistics\n2. Virtualization \n3. Explore relations of two keys\n\n\n```python\nprint(df[CONTRI].mean())\nprint(df[CONTRI].median())\nprint(df[CONTRI].std())\n```\n\n    2142237.82617\n    18194432.1808\n    173226.0\n\n\n\n```python\nprint(df[INVEST].mean())\nprint(df[INVEST].median())\nprint(df[INVEST].std())\n```\n\n    359683.324849\n    3618067.79855\n\n\nThe standard deviation is so large that the mean is not accurate. Two medians can give us a big picture of two data. Next we can draw the box plot to further illustrate the measures.\n\n\n```python\n%matplotlib inline\ndf.boxplot(CONTRI)\n```\n\n\n\n\n    \u003cmatplotlib.axes._subplots.AxesSubplot at 0x10643a240\u003e\n\n\n\n\n![png](output_19_1.png)\n\n\n\n```python\ndf.boxplot(INVEST)\n```\n\n\n\n\n    \u003cmatplotlib.axes._subplots.AxesSubplot at 0x106acfa90\u003e\n\n\n\n\n![png](output_20_1.png)\n\n\nThe boxplots show that both data are highly skewed, and there are some outliers.\n\n\n```python\nsca = df.plot(x=INVEST, y=CONTRI, style='o')\nsca.set_ylim(0,10000000)\n```\n\n\n\n\n    (0, 10000000)\n\n\n\n\n![png](output_22_1.png)\n\n\nMany organize have a small (converges to $0$) investment but a significant contribution. We need to further understand the data before doing other analysis and operations.\n\nFrom the right part of the plot, we cannot see there exists a clear linear relation between investment and contribution. \n\n## Conclusion\nIt is very fun to do a real-world data analyzing. I have learned A LOT, including xml parsing, pandas data analyzing, from this mini project.\n\nThe result is not quite as what I expected. If I have more time, or I would redo this project, I would definitely:\n\n1. Do more research on the 990 Form, and come up with more helpful key entries to use.\n2. Try to clustering the population. Using random sampling is not very fair in this project. I would try to divide the organizations into different groups, and perform the analysis in and within those groups.\n3. Check the result of pandas csv reading result, it seems there are some wrongly-formated entries.\n4. Explore more about the relations between contribution and investment.\n5. Except the contribution and investment, I am also interested in some other cool studies. For example, how LGBTQ organizations develop in the United States from 2010.\n6. Add a documentation for the python functions","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaohk%2Firs990","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxiaohk%2Firs990","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaohk%2Firs990/lists"}