https://github.com/sunlightpolicy/popular-data-sets

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/sunlightpolicy/popular-data-sets
Owner: sunlightpolicy
License: gpl-3.0
Created: 2017-08-18T18:12:16.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2017-09-12T16:45:53.000Z (over 7 years ago)
Last Synced: 2025-03-25T13:45:12.311Z (about 2 months ago)
Language: Jupyter Notebook
Size: 24.9 MB
Stars: 11
Watchers: 7
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# popular-data-sets

An analysis of the most popular open datasets on U.S. local and state data portals.

See our [blog post](https://sunlightfoundation.com/2017/09/11/whos-at-the-popular-table-our-analysis-found-which-open-data-the-public-likes/) for background info and a summary of results for this project.

Highlights of this repository:
- Our detailed methodology can be found in our Jupyter Notebook (["Socrata API Open Data Portal Analysis - Final Version.ipynb"](https://github.com/sunlightpolicy/popular-data-sets/blob/master/Socrata%20API%20Open%20Data%20Portal%20Analysis%20-%20Final%20Version.ipynb)), which contains both the code itself and explanations of the process and decisions that were made. If you want, you can use this to run your own analysis.
- The analysis generated a set of 52 dataset topics, each of which respresents a cluster of related datasets. Using a popularity measure that is explained in the Jupyter Notebook, we ranked these dataset topics by popularity. We have a table (["final_topic_ranks.csv"](https://github.com/sunlightpolicy/popular-data-sets/blob/master/final_topic_ranks.csv)) that has that ranked list.
- Note that the "Topic Content" is the set of words that the clustering algorithm used to define the cluster of related datasets for a topic. The words were of decreasing importance going from left to right in the list. For example, in the second-highest ranked topic (ID number 3), "transportation" was the most important word for the cluster while "bike" was the least important.
- If you're wondering which individual datasets where classified into which topics, go to the ["topic_datasets" folder](https://github.com/sunlightpolicy/popular-data-sets/tree/master/topic_datasets), which has a list of all the datasets that were part of the cluster for each topic. Note that the individual tables are large and best viewed in a spreadsheet program like Excel.

Thanks for checking out this repo, and let us know if you have any questions by opening an issue or emailing [email protected].

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sunlightpolicy/popular-data-sets

Awesome Lists containing this project

README