{"id":19246478,"url":"https://github.com/edoaltamura/election-predictions","last_synced_at":"2026-05-14T08:41:01.261Z","repository":{"id":192774974,"uuid":"683698908","full_name":"edoaltamura/election-predictions","owner":"edoaltamura","description":"A framework for analysing polling data and predicting election outcomes.","archived":false,"fork":false,"pushed_at":"2023-09-07T15:10:52.000Z","size":1395,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-05T04:43:12.399Z","etag":null,"topics":["data-engineering","data-science","data-visualization","feature-engineering","machine-learning","political-science","the-economist"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edoaltamura.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-27T12:26:59.000Z","updated_at":"2024-06-14T23:32:24.000Z","dependencies_parsed_at":"2024-11-09T17:35:19.980Z","dependency_job_id":"84eaaddf-0240-4bb6-bc4d-8f826ded5b12","html_url":"https://github.com/edoaltamura/election-predictions","commit_stats":null,"previous_names":["edoaltamura/election-predictions"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edoaltamura%2Felection-predictions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edoaltamura%2Felection-predictions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edoaltamura%2Felection-predictions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edoaltamura%2Felection-predictions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edoaltamura","download_url":"https://codeload.github.com/edoaltamura/election-predictions/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240339489,"owners_count":19785956,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-science","data-visualization","feature-engineering","machine-learning","political-science","the-economist"],"created_at":"2024-11-09T17:32:18.852Z","updated_at":"2026-05-14T08:40:56.223Z","avatar_url":"https://github.com/edoaltamura.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Elections trends \u0026 forecast\n[![Python version](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10%20%7C%203.11-blue.svg)](https://pypi.org/project/swiftzoom/)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/edoaltamura/swiftzoom/blob/main/LICENSE.md)\n[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/7801/badge)](https://www.bestpractices.dev/projects/7801)\n\nA Python framework for analysing current trends and predicting election outcomes from polling data.\n\n![Test image](reports/test.png)\n\n## Get the code\nYou can download the code by cloning:\n```commandline\ngit clone https://github.com/edoaltamura/election-predictions.git\n```\n\n## Run the pipeline\nA demo version of the pipeline can be run via the `main.py` file. On Unix-based systems (and subsystems), you can type\n```shell\npython3 main.py\n```\nOn Windows-based systems, you can type\n```commandline\npython main.py\n```\n#### Where are my outputs?\nThe project structure, based on `cookiecutter` and `kedro`, demands that the final data products should be saved in `data/03_final/`, while the figures and insights should be saved in `reports/`.\n\nThe configuration of the projects structure is managed, at low level, by the `src/configuration.py` submodule.\n\n## Features and FAQs\n\n#### Some days have no polls, some have multiple\nFor each pollster, days without polls are interpolated if bracketed by days with polls. If different pollsters have provided answers on the same day, these data are considered separately when splitting by pollster. If the same pollster gave two answers on the same day (eg 'University of Bellville-sur-Mer'), then the most recent data is considered.\n\n#### Some pollsters do not include line items for all candidates\nThe missing candidates are assigned a zero weighting in the weighted average calculation. If one pollster gave two answers on the same day, one of which contains the missing information, then the entries are joined (`SQL UNION OUTER`). A related question, which would be instrumental for probing pollster bias would be \"**Why** did a pollster not include line items for all candidates?\" \n\n#### Some pollsters will conduct multiple polls with 'hypotheticals' (eg what if this candidate dropped out?)\nThis case could manifest itself as the pollster excluding a candidate from the survey, and consequently not providing data for that candidate. To detect and account for this case, more information about the pollster's decisions is required. \n\n#### The order of candidates on the page may change\nThe schema assumes a data warehouse, where the order is consistent. Given the current information, it is not possible to reliably determine whether the order of candidates was swapped. An estimate could be made by detecting outliers from the trend, however, this technique gives outcomes which are degenerate with a sudden change of political opinions. \n\nFor a datalake-like schema, additional features can guarantee that the data warehouse layer is consistent. This feature may be introduced with bindings to [schema](https://github.com/keleshev/schema).\n\n#### Formatting may be inconsistent\nThis behaviour is detected and accounted for by the `DataEngineering.clean_data()` method. Lines which are badly formatted (eg missing `%`) are regularised to the rest of the document. Informative messages are also printed to `sys.stdout`.\n\n#### Opinions can shift suddenly\nA rolling average (as in `data/03_final/trends.csv`) cannot promptly capture time-variation features happening quicker than the smoothing time-scale. For this reason, it is useful to consider then raw weighted average of the polling results without any smoothing, as reported in `data/03_final/polling_averages.csv`. This distribution is noisier, but is robust at capturing quick variations in public opinion.\n\n#### A candidate might drop out or join the race late\nDrop-outs up to two weeks are interpolated, provided that the candidate rejoins, and their electoral campaign contemplated by pollsters.\nIf a candidate joins late or drops out early, we assign zero weight to the dates where pollsters provided no data about the candidate. If a candidate drops out without intention of re-joining the run, then additional information is required (eg public announcement). In this case, the time-series can simply be truncated and the polling fractions re-normalised with the remaining candidates.  \n\n#### There may be big gaps in the polling record (for example, around Christmas or for this country’s two-week public holiday in June)\nWithout additional information of opinion shifts during this time, we can assume that opinions remain stationary, or vary smoothly. In the current implementation, the polling values during gaps of up to two weeks are interpolated. In the future, we will integrate a public-holiday list that enables multi-week interpolation only for public holidays, and not for instance when a candidate drops out and joins again (unless it coincides with a public holiday). \n\n#### There may be significant data entry errors\nThe `DataEngineering` class can promptly be extended with additional features that detect, optimise and resolve errors which are discovered during the data exploration phases. Some errors, such as logging, should be addressed in the future data lake schema, while programmatic cleaning should be implemented in the `src/data_engineering.py` submodule.\n\n#### Notes might be attached to specific polls or numbers\nThis behaviour is detected and accounted for by the `DataEngineering.clean_data()` method. Lines which contain special indicators (eg `*`) are flagged with a `bool` value. The boolean information is then appended onto a new column for further processing.\n\n## Set up\n### By creating a virtual environment\nAssuming you have installed `venv` and `virtualenv` and have an up-to-date version of `pip` (see [instructions](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/)), you can follow this two-step guide to set up this Python library and get it up and running in your system. The election-prediction code was developed to be compatible across all platform, however we will assume the Windows-native `python` run command in what follows, rather than the Unix-native `python3`.\n- Create a virtual environment for this project, and activate it.\n```commandline\npython -m venv env\n.\\env\\Scripts\\activate\n```\n- Install the required packages.\n```commandline\npython3 -m pip install -r requirements.txt\n```\nNow the code should be ready to run within the virtual environment. To exit the virtual environment, you can run the command:\n```commandline\ndeactivate\n```\n\n### Via PyCharm\nUnder the menu __Git | VCS__, click on __Clone...__ and enter the GitHub link to this repository. PyCharm can automatically detect the `requirements.txt` file and prompt the installation of the packages within it. We recommend selecting the creation of a project-specific virtual environment.\n\n**Development:** If you plan to introduce new packages or libraries, we recommend using the __Tools__ menu \u003e __Sync Python Requirements__ feature to update the `requirements.txt` file programmatically.\n\n### Via `setup.py` (beta)\nThis feature is currently under testing, and we recommend setting up the repository via the methods above. \n\nAfter cloning the repository to a local host, enter the `election-predictions` directory\n- `dir election-predictions` on Windows\n- `cd election-predictions` on Unix systems\n\nand run\n```commandline\npip install . \n```\n`pip` will use `setup.py` to install this module, without needed to call `setup.py` explicitly.\n\n\n**Note:** You should exclude your virtual environment directory from your version control system using `.gitignore` or similar.\n\n## Graphic design\n\nThe layout attempts to match the style of the plots in _The Economist_ 's [Graphic detail](https://www.economist.com/graphic-detail?utm_medium=cpc.adword.pd\u0026utm_source=google\u0026ppccampaignID=18151738051\u0026ppcadID=\u0026utm_campaign=a.22brand_pmax\u0026utm_content=conversion.direct-response.anonymous\u0026gclid=Cj0KCQjwgNanBhDUARIsAAeIcAvIXMRR0ecYNG1wFTmK1J8uDBE1V-EE7ic7LtNSwpX3vn8ITqD58BsaAiu3EALw_wcB\u0026gclsrc=aw.ds) using `matplotlib` style sheets and dynamic aspect ratio scaling. Find out more in the Matplotlib stylesheet at `src/mplstyles/economist_xyplot.mplstyle` and the `src/visualisation.py` submodule.\n\nExample usage:\n```python\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom src import PlotTimeSeries  # Import the PlotTimeSeries class\n\n# Create a sample DataFrame (replace this with your own data)\ndata = {\n    'Date': pd.date_range(start='2023-01-01', periods=365),\n    'Bulstrode': [0.1 * i for i in range(365)],\n    'Lydgate': [0.15 * i for i in range(365)],\n    'Vincy': [0.08 * i for i in range(365)],\n}\n\ndf = pd.DataFrame(data)\n\n# Create an instance of PlotTimeSeries\nplt_ts = PlotTimeSeries()\n\n# Get the figure and axes objects\nfig, ax = plt_ts.get_panels()\n\n# Plot the data for each candidate\nfor candidate in ['Bulstrode', 'Lydgate', 'Vincy']:\n    ax.plot(df['Date'], df[candidate], label=candidate)\n\n# Set titles and labels\nplt_ts.set_title('Candidate Polling Trends', subtitle='Fraction of candidate polling, %/100')\nplt_ts.set_source('Dataland political archive 2023', pad=0.15)\n\n# Add a legend\nax.legend(fontsize=10, loc='upper left', frameon=True, framealpha=0.85)\n\n# Save the plot\nplt_ts.savefig('my_awesome_economist_figure.png', dpi=300)\n\n# Show the plot (if you want to display it)\nplt.show()\n```\n\n## Expected output from `main.py`\n```commandline\n$ python main.py\n\n⏱ | Calling load_from_url()\n⏱ | Done: load_from_url() took 0.3231 sec\nFound 22 badly formatted rows in column 'Bulstrode':\nFound 22 badly formatted rows in column 'Lydgate':\nFound 90 badly formatted rows in column 'Chettam':\nFound 51 badly formatted rows in column 'Vincy':\nFound 38 badly formatted rows in column 'Casaubon':\nFound 22 badly formatted rows in column 'Others':\nSome pollsters have given multiple responses:\n\t['University of Bellville-sur-Mer']\nConsidering only the most recent information:\n\t[Timestamp('2024-03-22 00:00:00'), Timestamp('2024-02-23 00:00:00'), Timestamp('2024-01-26 00:00:00')]\nDropping rows: {first.index}\n\nA glimpse of the clean data:\n         Date          Pollster  ...  Others  Excludes overseas candidates\n0 2023-10-12  Bardi University  ...   0.171                         False\n1 2023-10-18  Bardi University  ...   0.078                         False\n2 2023-10-24  Bardi University  ...   0.074                         False\n3 2023-10-30  Bardi University  ...   0.083                         False\n4 2023-11-05  Bardi University  ...   0.155                         False\n5 2023-11-11  Bardi University  ...   0.112                         False\n6 2023-11-17  Bardi University  ...   0.099                         False\n7 2023-11-23  Bardi University  ...   0.091                         False\n8 2023-11-29  Bardi University  ...   0.141                         False\n9 2023-12-05  Bardi University  ...   0.115                         False\n\n[10 rows x 10 columns]\nSplitting 'Verity Insights': 100%|██████████| 11/11 [00:00\u003c00:00, 89.56it/s]\nCandidate average for 'Bulstrode':   0%|          | 0/6 [00:00\u003c?, ?it/s]\u003clocal-hot-directory\u003e\\election-predictions\\src\\data_science.py:284: RuntimeWarning: \nPollster 'Calvo Group' only gave 1 reports. This behaviour is accounted for in the weighted averages, but you should investigate this pollster's data further.\n  warn(\nCandidate average for 'Lydgate':  17%|█▋        | 1/6 [00:00\u003c00:03,  1.36it/s]  \u003clocal-hot-directory\u003e\\election-predictions\\src\\data_science.py:284: RuntimeWarning: \nPollster 'Calvo Group' only gave 1 reports. This behaviour is accounted for in the weighted averages, but you should investigate this pollster's data further.\n  warn(\nCandidate average for 'Vincy':  17%|█▋        | 1/6 [00:00\u003c00:03,  1.36it/s]  \u003clocal-hot-directory\u003e\\election-predictions\\src\\data_science.py:284: RuntimeWarning: \nPollster 'Calvo Group' only gave 1 reports. This behaviour is accounted for in the weighted averages, but you should investigate this pollster's data further.\n  warn(\nCandidate average for 'Casaubon':  50%|█████     | 3/6 [00:00\u003c00:00,  3.97it/s]\u003clocal-hot-directory\u003e\\election-predictions\\src\\data_science.py:284: RuntimeWarning: \nPollster 'Calvo Group' only gave 1 reports. This behaviour is accounted for in the weighted averages, but you should investigate this pollster's data further.\n  warn(\nCandidate average for 'Chettam':  50%|█████     | 3/6 [00:00\u003c00:00,  3.97it/s] \u003clocal-hot-directory\u003e\\election-predictions\\src\\data_science.py:284: RuntimeWarning: \nPollster 'Calvo Group' only gave 1 reports. This behaviour is accounted for in the weighted averages, but you should investigate this pollster's data further.\n  warn(\nCandidate average for 'Others':  83%|████████▎ | 5/6 [00:01\u003c00:00,  5.86it/s] \u003clocal-hot-directory\u003e\\election-predictions\\src\\data_science.py:284: RuntimeWarning: \nPollster 'Calvo Group' only gave 1 reports. This behaviour is accounted for in the weighted averages, but you should investigate this pollster's data further.\n  warn(\nCandidate average for 'Others': 100%|██████████| 6/6 [00:01\u003c00:00,  5.14it/s]\nWriting dataset 'polling_averages.csv' to: \u003e \u003clocal-hot-directory\u003e\\election-predictions\\data\\03_final\\polling_averages.csv\nWriting dataset 'trends.csv' to: \u003e \u003clocal-hot-directory\u003e\\election-predictions\\data\\03_final\\trends.csv\nFigure test.png saved in reports directory.\n```\n\n## Report a bug or request a new feature\nTo report bugs and request new updates, log your query in the GitHub Issues.\n\n## Contribute to this repository\nWe would love to receive your contributions! Fork this repository in you GitHub, develop the features you would like to see implemented, and then submit a pull request. \n\n## Cite this software\nYou can print an up-to-date `bibtex` citation handle via:\n```python\nfrom src import __cite__\n\nprint( __cite__ )\n```\nThis code dynamically allocates the current version of the code being uses and the date of the latest update, as given by the latest Git commit in the Git history.\n\nA template of the citation handle is illustrated below:\n```text\n@software{altamura_elections\n          author = {{Altamura}, Edoardo},\n          title = {\"An statistical machine learning framework for election predictions\"}\n          url = {https://github.com/edoaltamura/election-predictions}\n          version = {__version__}\n          date = {__date_last_update__}\n}\n```\n\n## Licence\n```text\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedoaltamura%2Felection-predictions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedoaltamura%2Felection-predictions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedoaltamura%2Felection-predictions/lists"}