{"id":21293388,"url":"https://github.com/ejw-data/etl-election-reporting","last_synced_at":"2026-05-07T03:35:37.884Z","repository":{"id":48628626,"uuid":"516969434","full_name":"ejw-data/etl-election-reporting","owner":"ejw-data","description":"Used pandas to resolve discrepencies in real-time election poll results and moved data to postgres. ","archived":false,"fork":false,"pushed_at":"2022-07-27T17:03:33.000Z","size":2092,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-15T16:44:33.961Z","etag":null,"topics":["common-table-expression","pandas","postgresql","python","regular-expressions","sql"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ejw-data.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-23T05:43:39.000Z","updated_at":"2022-07-28T04:52:54.000Z","dependencies_parsed_at":"2022-09-02T12:20:31.569Z","dependency_job_id":null,"html_url":"https://github.com/ejw-data/etl-election-reporting","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ejw-data/etl-election-reporting","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fetl-election-reporting","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fetl-election-reporting/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fetl-election-reporting/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fetl-election-reporting/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ejw-data","download_url":"https://codeload.github.com/ejw-data/etl-election-reporting/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ejw-data%2Fetl-election-reporting/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263421921,"owners_count":23464048,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["common-table-expression","pandas","postgresql","python","regular-expressions","sql"],"created_at":"2024-11-21T13:54:33.893Z","updated_at":"2026-05-07T03:35:37.841Z","avatar_url":"https://github.com/ejw-data.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# etl-election-reporting  \n\nAuthor:  Erin James Wills, ejw.data@gmail.com\n\n![election polling project banner](./images/election-polling-etl.png)\n\u003ccite\u003ePhoto by \u003ca href=\"https://unsplash.com/@eagleboobs?utm_source=unsplash\u0026utm_medium=referral\u0026utm_content=creditCopyText\"\u003eElliott Stallion\u003c/a\u003e on \u003ca href=\"https://unsplash.com/s/photos/election?utm_source=unsplash\u0026utm_medium=referral\u0026utm_content=creditCopyText\"\u003eUnsplash\u003c/a\u003e\u003c/cite\u003e\n\n\u003cbr\u003e\n\n## Overview  \n\u003chr\u003e\n\n\u003eUsed pandas to resolve discrepencies in real-time election poll results and moved data to postgres.   \n\nI have always found election poll results to be interesting since there are always unexpected results despite the number of polls performed prior to the election.  When looking at data that was published by the Associated Press, I noticed that the data released as results came in during election night was much messier than expected and often times had obviously incorrect data (or I initially presumed it to be false).   \n\nThe data was coming from Associated Press releases who got their information from the state election boards and was information distributed through the New York Times API.  There is already a nice Github repo that has the 2020 Presendential election results.  Some of the reason for the odd data is that:  \n*  the information was not typed in correctly to the state updates that come from websites and twitter.  \n*  the new update combines data from a previous update but people are unaware.\n*  the new update corrects the numbers from the previous update and adds or subtracts votes.  Yes, there are times when the votes coming in to the API could be negative.  \n*  there are probably more reasons but that is a separate analysis \n\u003e Remember that none of this information is official on election night and usually several days go by before an official number is published.  \n\nThe goals are:\n1.  create a jupyter notebook to identify records that do not make sense and see if merging these results with other rows will resolve the inconsistency.  Essentially, I am reducing the number of reported poll reports (batches) so the errors do not have an effect.  \n2.  generate a dataset that is consistent \n3.  store the data in a postgreSQL database and develop queries that could be used for generating a report. \n4.  develop graphics and predictions based on the data that are time dependent (as if the results were being streamed and constantly updating).  \n\n**Note:**  I mostly want realistic data for parts 3 and 4.  The first goal is to resolve most of the issues to have a believeable dataset.  The last goal (#4) is not a high priority right now.  Goals 1 through 3 are to demonstrate the ETL process.\n\n\u003cbr\u003e\n\n## Technologies  \n*  Python\n*  PostgreSQL\n\n\u003cbr\u003e  \n\n## Data Source  \n\nThe dataset was obtained from:  \n*  [https://alex.github.io/nyt-2020-election-scraper/all-state-changes.html](https://alex.github.io/nyt-2020-election-scraper/all-state-changes.html)  \n\n\u003cbr\u003e\n\n**Note**:  The site data may be overwritten during each election.  The pickle file contains the data from the 2020 election and was captured in 2022 after no new results were being published.  \n\n\u003cbr\u003e\n\n## Analysis  \n\nOverall the first goal (resolve data errors) was accomplished.  The inconsistencies in the data were removed and method used to merge records created consistent data without distorting the actual results; it essentialy resulted in a loss of granularity while keeping the original granularity for most of the data.  \n\nTo evaluate the quality of the data for the second goal (show consistency across the dataset) and third goal (add content to database), the database query results were compared to the official results.  The data from the database showed good precision.  The table below shows a few results.  \n\nThe fourth goal (create time-dependent graphics) has not been started but the data preparation for it is complete.  \n\n| Query         |  Query Result         |   Official Result  |\n| ------------- | --------------------- | ------------------ |\n| Total Votes Nationally | 155,505,907  | 155,485,078        |\n| Total Votes Biden      | 81,243,830   | 81,268,924         |\n| Total Votes Trump      | 74,262,077   | 74,216,154         |\n| Total Votes Alabama Biden |  849,624  | 849,624            |\n| Total Votes Alabama Trump | 1,441,170 | 1,441,170          |\n| Final Vote Margin Alabama | 591,546   | 591,546            |\n\n\u003cbr\u003e\n\n## Results (examples)\n\nBelow are some details about the data process but details can be found in the commented jupyter notebook and SQL files.  \n\u003cbr\u003e\n\n### Notebook \n\nThe data comes in a very complicated structure that needs to be broken down into separate columns.  Below is an example of what one state's data looks like initially:\n \n![Original Data](./images/scraped_data.png)  \n\u003ccite\u003eFig 1. Original Data\u003c/cite\u003e  \n\n\u003cbr\u003e\n\nAfter origanizing the data and using regular expressions to extract out the values from the text, the data forms two data frames.\n![State Summary](./images/extracted_state_summary.png)  \n\u003ccite\u003eFig 2. State Summary Information\u003c/cite\u003e  \n\n\u003cbr\u003e\n\n![State Batch Records](./images/extracted_state_batch_data.png)\n\u003ccite\u003eFig 3. State Batch Records\u003c/cite\u003e \n\n\u003cbr\u003e\n\nSome of the data manipulations include:\n*  Removing zeros from the 'Change' column\n*  Removing values set as 'Unknown' from the 'Change' column\n*  Calculating the 'Change' column for values that say 'Unknown' or zero when there is adequate information about the record\n*  Remove records where the margin between batches is smaller than the batch size.\n\nIn the end, the following table was generated.  \n![Developed Dataframe](./images/accum_votes_added.png)\n\u003ccite\u003eFig 4. Final Dataframe\u003c/cite\u003e  \n\n\u003cbr\u003e\n\nThe last part of the notebook sets up a final dataframe that is in a good format for a relational database.  The format is just a simple table with 4 columns - time of batch records, Trump votes, Biden votes, and state.  In the future I would love to add the district and maybe assign some metadata to that district so I can have more specific data extraction.  Right now I do not have the distric or type information which is a bit disappointing since macro scale data is not as interesting when doing an analysis.  All other summarizies can be derived from this simple data structure.  \n\n### SQL  \n\nThe data from the notebook was formatted to show the votes for Biden and Trump as separate records with the philosophy that all future tables could be made from this simple data structure.  Below is an image of what the raw data in the table looks like.  `District` and `Type` (district metadata) are columns I would *love* to add to the data but do not have a source. \n\n![Vote_counts table](./images/vote_counts_table.png)\n\u003ccite\u003eFig 5. Table (vote_counts) with Imported Data  \n\n\u003cbr\u003e\n\nWith SQL, the following are examples of queries written to extract data from the `vote_counts` table.  \n\nOne table was generated from a query since this data would be considered very common.  This table ('margin_info') has the following columns:  'batch_id', 'datetime', 'state', 'biden_votes', 'trump_votes', 'batch_margin'.  The main difference in this table compared to the original table is that the data has been pivoted such that instead of having a row of biden data and a row of trump data for the same time period, those values have been put into their own column and the margin difference has also been calculated.  \n\n![Margin info table](./images/margin_info_table.png)  \n\u003ccite\u003eFig 6. Derived `margin_info` table\u003c/cite\u003e  \n\n\u003cbr\u003e  \n\n**Note**:  This table creation may seem trivial but I created this structure since I believe it is a very common process.  It makes sense to collect data in a table for records in rows so that each time a new row comes in then it can be logged.  Then periodically the individual records will go through a processing event where records are summarized in a table where sorting or aggregation or a combination of the two will be needed to get a more readable and useable data structure.   \n\nHere are some of the key queries:  \n  *  Accumulated values and percents for a specifc state\n  *  Vote totals for election as a whole, for Biden, for Trump\n  *  Votes for each candidate for a specific state\n  *  Votes for a candidate grouped by all states\n  *  Final vote margins by a specific state\n  *  Total votes per state\n  *  Percent of total votes cast per state\n  *  Filter records by time period\n  *  Modify records via a calculation (not normally used but for practice)  \n\n![Accumulated Votes](./images/accumulated_votes_margin_info_table.png)\n\u003ccite\u003eFig 7. Accumulated Values By State (Alabama)\u003c/cite\u003e  \n\n\u003cbr\u003e\n\n![Total Vote Percent](./images/state_percent_of_total_votes.png)  \n\u003ccite\u003eFig 8. Percent of Total Votes Cast\u003c/cite\u003e   \n\n\u003cbr\u003e\n\n## Python Setup and Installation  \n1. Environment needs the following:  \n    *  Python 3.6+  \n    *  numpy\n    *  pandas\n1. Activate your environment\n1. Clone the repo to your local machine\n1. Start Jupyter Notebook within the environment from the repo\n1. To run `election_data_extract.ipynb`  \n1. The above notebook produces a cleaned dataframe called `db_file.csv` inside the `data` folder.\n## PostgreSQL Setup and Installation\n1. Open pgAdmin and create a database named `uselections`\n1. The .sql files can be found in the `sql` folder\n1. Open a query tool and run the `create_vote_counts_table.sql` query to create the main table named `vote_counts`.\n1.  Import the `db_file.csv` file into the `vote_counts` table.\n\n## PostgreSQL Generate Other Tables\n1. In another query tool, run `create_margin_info_table.sql` to generate the `margin_info` table which manipulates and summarizes the `vote_counts` table.\n1. Running the `create_margins_views.sql` is optional.  This query creates a view that is equivalent to the `margin_info` table.\n## PostgreSQL Queries\n1. Examples of queries can be found in `vote_count_queries.sql` which has queries about the `vote_counts` table and the `margin_info_queries.sql` which has queries about the `margin_info` table.  \n\n**Note**:  The `margin_info` queries can be modified to use the `margins` view instead.  \n \n\n\u003cbr\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fejw-data%2Fetl-election-reporting","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fejw-data%2Fetl-election-reporting","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fejw-data%2Fetl-election-reporting/lists"}