{"id":20714714,"url":"https://github.com/justingosses/repo_data_experiment","last_synced_at":"2026-04-21T11:04:46.862Z","repository":{"id":218253965,"uuid":"745941758","full_name":"JustinGOSSES/repo_data_experiment","owner":"JustinGOSSES","description":"An experiment for grabbing repository data ","archived":false,"fork":false,"pushed_at":"2024-06-18T22:51:05.000Z","size":10636,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-17T21:43:36.845Z","etag":null,"topics":["houstondataviz","metadata","repo-metadata"],"latest_commit_sha":null,"homepage":"https://justingosses.github.io/repo_data_experiment/framework/dist/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-sa-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JustinGOSSES.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-20T16:06:00.000Z","updated_at":"2024-03-16T14:47:09.000Z","dependencies_parsed_at":"2024-01-20T19:21:29.254Z","dependency_job_id":"c42f5c73-6022-42cc-8a4c-9564882ce002","html_url":"https://github.com/JustinGOSSES/repo_data_experiment","commit_stats":null,"previous_names":["justingosses/repo_data_experiment"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Frepo_data_experiment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Frepo_data_experiment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Frepo_data_experiment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JustinGOSSES%2Frepo_data_experiment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JustinGOSSES","download_url":"https://codeload.github.com/JustinGOSSES/repo_data_experiment/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242988242,"owners_count":20217537,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["houstondataviz","metadata","repo-metadata"],"created_at":"2024-11-17T02:33:46.355Z","updated_at":"2025-12-16T10:05:37.813Z","avatar_url":"https://github.com/JustinGOSSES.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# repo_data_experiment\nThis repository was an experiment for grabbing metadata about open source code \nrepositories across entire GitHub organizations with the idea that the data could be \nused as the dataset of a Houston Data Visualization meetup Saturday data jam, which \nit was on Saturday February 17th, 2024. \n\n## Structure of this repository\nAt a high level this repository is broken into 3 parts: \n1. The `data` directory holds the collected data. \n2. The `src` directory holds the python code used to harvest the data via the Ecosyste.ms API.\n3. The `framework` directory holds a quick experiment using the new Observable Framework library to create a static page that visualizes the data briefly. \n\nThere's also a index.html at the top level for quickly inspecting the data CSVs.\n\n## Data\n\n*The datasets can potentially be used in a Houston Data Jam.*\n\n### Example datasets\n#### data/nasa_repos.json\n\nThis data file was created by grabbing all the NASA repositories that Ecosyste.ms has data on (not every repository) \nfor the NASA organization on GitHub.com. Approximately 270 repositories, so not every repository. The ones with \nlow engagement are probably the skipped ones. \n\n#### data/nasa_repos_flat.csv \n\nThis is the same data as in `data/nasa_repos.json` but flattened into a CSV using the `flattenJSON()` function in `src/main.py`.\n\nThe CSV can be seen in an easy to see formatted manner on github.com direct link: https://github.com/JustinGOSSES/repo_data_experiment/blob/main/data/nasa_repos_flat.csv\n\n#### combined_org_data/all_orgs_merged_20240120.csv\n\nThe `all_orgs_merged_20240120.csv` file has 1111 repositories from the GitHub organizations \nnasa, CMSgov, airbnb, houstondataviz, home-assistant, NationalSecurityAgency.\n\nThese organizations were selected as they represented organizations with different histories or patterns of how they use GitHub \nfor open source. \n\n[The NASA GitHub organization](https://github.com/nasa) has a comparably longer history on GitHub for a \ngovernment organization. They also have more than normal suspect pattern of \"publishing\" code that then \nquickly has not other development happening with it due to the culture of \n\"publishing\" papers, reports, etc. that exists in the organization. \n\n[NationalSecurityAgency](https://github.com/NationalSecurityAgency) is the GitHub organization of \nthe US government's National Security Agency or NSA. It has a more narrow scope of the \ntype and reasons for open source and less suspected tendency to drop repositories without continued development. \n\n[Home-assistant is an extremely popular open source home automation](https://github.com/home-assistant) collection of products and tools with expected extremely diverse and large contribution community. \nThe GitHub organization ~ the product ~ the people organization. \n\n[AirBnB is a tech company](https://github.com/airbnb) founded as a digital first company. They also have an engineering blog and a record of contributing \nopen source used by others in some cases. \n\n[houstondataviz is the GitHub organization](https://github.com/houstondatavis) used by the Houston DataViz Meetup. Most of the use is associated with \nbrief one-time only data jams as opposed to being repositories of products, tools, packages, websites, etc. \n\n[CMSgov is the GitHub Organization](https://github.com/CMSgov) of the Centers for Medicare \u0026 Medicaid Services. \nIt is suspect to not have as long of a history on GitHub compared to NASA with more of a focus on \nactual products and services run by the organization with GitHub being used in part as a way to make it others \nto use, build upon, and contribute to the code bases. \n\nSee the combined CSV as an HTML table here: https://justingosses.github.io/repo_data_experiment/\n\n## Why repository metadata?\n\nThis work is motivated by the idea that a lot of understanding of open source \npresence and activity is prevented by the need to manually read so many repositories. \n\n### Context\n\nThere are times when it is useful to be able to generate high level descriptions of the types of \nrepositories in an organization. This can be useful to compare the types of open source an \norganization releases. It can also be useful for the organizations as it helps to identify \nrepositories that are highly used, build packages, are primarily samples, or any of a variety of \nother \"repository types\" that otherwise require a person to manually read the repository to figure out\nwhat is there, a task that isn't possible with hundreds or thousands of repositories. \n\n### Purpose of this experiment\n\nThe purpose of this repository is to test out functionality and performance of \nusing ecosyte.ms API for gathering repository metrics on all the repositories in an \norganization. \n\nIn past efforts to do this, I have used the GitHub API to gather data on an entire organization\nas seen in https://github.com/JustinGOSSES/awesome-list-visual-explorer-template/\nbut for large organizations in hundreds of organizations it would take dozens of minutes to \ngather all the data. \n\n## Early results so far...\n\n#### Ecosyste.ms API does not have data on all repositories in an organization\n\nEcosyst.ms [has 270 repositories](https://repos.ecosyste.ms/hosts/GitHub/owners/nasa) while the \n[number of repositories in the GitHub organization is 504.](https://github.com/orgs/nasa/repositories)\n\nIt seems likely based on the repositories that are captured in ecosyste.ms are limited to those that are\nmore active or used in terms of being source for package, stars, forks, etc. which makes sense as ecosyste.mss\nmight be trying to ignore the repositories without engagement that make up the bulk of the repositories on GitHub. \n\n#### Speed of getting data for Ecosyste.ms is far better than GitHub API past experience\n\nGathering basic repository metrics for the 270 repositories that Ecosyste.ms has took a couple seconds. \nPrevious experiences with the scripts on https://github.com/JustinGOSSES/awesome-list-visual-explorer-template/\nwas dozens of minutes.\n\n### Does ecosyste.ms repository API results have the right data to construct repository cohorts?\n\nRepository cohorts is a concept that forms the basis of a talk that has been submitted to the \nOpen Source Summit North America Conference. \n\nIt refers to the idea that it can be advantageous to have pre-calculated cohorts of repositories identified \nbased on threshold boundaries across key data dimensions. \n\n#### Repository cohort categories \n\nThese are potential thresholds you might use to create categorical data from continuous data. \n\n##### Age\n- Age: [YES, CAN CREATE WITH ECOSYSTE.MS]\n  - baby: 0-30 days\n  - toddler: 31-90 days\n  - teen:91-365 days\n  - adult: 366 - 1095 \n  - senior: \u003e1095 \n\n##### Activity\n\n- Last update in days: [YES, CAN CREATE WITH ECOSYSTE.MS]\n  - past 7 days\n  - past 8-30 days\n  - past 31-90 days\n  - past 90-365 days\n  - past 366-730 days\n  - past 731 + days\n\n- Number of commits in past 90 days: [NOT WITH FIRST SET OF API RESULTS?????? MAYBE REPO API ENDPOINT]\n  - 0\n  - 1-10\n  - 11-200\n  - 200+\n\n##### Community\n\n- Size of contributor community [NOT WITH FIRST SET OF API RESULTS?????? MAYBE REPO API ENDPOINT]\n  - 1\n  - 2-4\n  - 5-10\n  - 10-75\n  - 76+\n\n- External vs. internal contributors (probably not possible in this context)\n\n- Ehgbal type community types: [YES, CAN CREATE WITH ECOSYSTE.MS]\n  - toys (small size and low ratio of watches/stars are contributors)\n  - clubs (small size and high ratio of watches/stars are contributors)\n  - federation (large size and high ratio of watches/stars are contributors)\n  - stadium (large size and low ratio of watches/stars are contributors)\n\n##### Content\n\n- GitHub Actions [YES, CAN CREATE WITH ECOSYSTE.MS]\n  - True\n  - False\n\n- Samples [YES, with more work]\n  - True (based on seeing works like 'sample', 'demo', 'example' in org name or repo name)\n  - False\n\n\n### Potential questions for Houston Data Viz Meetup Data Jam\n1. How would you quickly summarize how each of these organizations' open source presence?\n2. What repositories are most impactful for each organization? What metrics could you pick for 'impact'?\n3. What are dimensions you might use to group similar organizations across GitHub? For example, how would you find all the organizations that are apparently trying to do similar things with their open source presence as the National Security Administration?\n4. What organization is most similar to CMSgov and why?\n5. Make a visualization that summarizes for management the organization's open source presence in order to give them a quick overview of the ways the GitHub organization is used and benefits for individuals and organization?\n\nOr whatever you want to answer or try.\n\n\n## Installation of Python virtual environment.\n\nClone repository: \n\n1. Run in terminal `git clone https://github.com/JustinGOSSES/repo_data_experiment.git`\n2. `cd repo_data_experiment`\n\nOnly basic python packages are used (pandas, requests, etc.) so you existing base environment might be fine.\nHowever, best practice is to you virtual environments.\n\n### Using conda\n\n1. Create a new conda environment:\n    ```shell\n    conda create --name myenv\n    ```\n\n2. Activate the environment:\n    ```shell\n    conda activate myenv\n    ```\n\n3. Install the required packages from the requirements.txt file:\n    ```shell\n    conda install --file requirements.txt\n    ```\n\n### Using virtualenv\n\n1. Create a new virtual environment:\n    ```shell\n    python -m venv myenv\n    ```\n\n2. Activate the environment:\n    - On Windows:\n      ```shell\n      myenv\\Scripts\\activate\n      ```\n    - On macOS and Linux:\n      ```shell\n      source myenv/bin/activate\n      ```\n\n3. Install the required packages from the requirements.txt file:\n    ```shell\n    pip install -r requirements.txt\n    ```\n## Usage (getting more data and processing data)\n\n### Getting data from another GitHub organization on the subset of repositories that Ecosyste.ms API has data on\n\nIn a terminal, call the functions like this replacing the string after --orgName, in this case `houstondatavis`.\n\n`Python src/main.py --orgName houstondatavis --function call_api`\n\n### Flattening the JSON that is returned in the last step into a flat CSV to make it easier to work with the data\nIn a terminal, call the functions like this replacing the strings after --inputFilePath and after --outputFilePath.\n\n`Python src/main.py --inputFilePath data/houstondatavis_repos.json --outputFilePath data/houstondatavis_repos_flat.csv  --function flattenJSON`\n\n### Creating a combined CSV files of all the org specific CSV files.\n\n`Python src/main.py --folderPathToLookForCSVsToMerge data --outputFilePath data/combined_org_data/all_orgs_merged_20240120.csv  --function mergeMultipleOrgCSV`\n\nSee the `src/main.py` file for how this all works.\n\n### Quickly checking out the data visually \nThere is a top-level `index.html` page which when stood up and viewed in a browser or as a GitHub pages page\nwill make it easy to see all the columns and the amount of empty cells. \n\nI have the node.js program `http-server` [installed globally](https://www.npmjs.com/package/http-server) \nso I start up a local server like `http-server` and then \nnavigate to `http://127.0.0.1:8080/` in a browser. A python option that does the same thing \nis [http.server](https://docs.python.org/3/library/http.server.html)\nThe GitHub pages URL is [https://justingosses.github.io/repo_data_experiment/](https://justingosses.github.io/repo_data_experiment/)\n\n## Explore this dataset on Observable.hq with SQL\n\nhttps://observablehq.com/@justingosses/analyzing-repositories-by-their-metadata-with-sql","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustingosses%2Frepo_data_experiment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjustingosses%2Frepo_data_experiment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustingosses%2Frepo_data_experiment/lists"}