{"id":26719007,"url":"https://github.com/psenger/github-scraper","last_synced_at":"2025-10-11T08:17:45.703Z","repository":{"id":147705954,"uuid":"344347160","full_name":"psenger/github-scraper","owner":"psenger","description":"github-scraper used to scan repos owned by an org, clone them locally, look for a Dockerfile, and extract the FROM into a nice CSV for management","archived":false,"fork":false,"pushed_at":"2025-02-18T19:36:02.000Z","size":30,"stargazers_count":0,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-27T17:51:57.743Z","etag":null,"topics":["dockerfile","extract","github","github-api","github-scraper"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/psenger.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-04T04:16:19.000Z","updated_at":"2021-03-31T07:52:19.000Z","dependencies_parsed_at":"2023-05-27T04:45:26.263Z","dependency_job_id":null,"html_url":"https://github.com/psenger/github-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/psenger/github-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psenger%2Fgithub-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psenger%2Fgithub-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psenger%2Fgithub-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psenger%2Fgithub-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/psenger","download_url":"https://codeload.github.com/psenger/github-scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/psenger%2Fgithub-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279006759,"owners_count":26084148,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dockerfile","extract","github","github-api","github-scraper"],"created_at":"2025-03-27T17:51:05.137Z","updated_at":"2025-10-11T08:17:45.694Z","avatar_url":"https://github.com/psenger.png","language":"JavaScript","readme":"# github-scraper\n\n- [github-scraper](#github-scraper)\n    * [Purpose](#purpose)\n    * [Running](#running)\n    * [Additional Docs](#additional-docs)\n    * [Variables](#variables)\n    * [Todo](#todo)\n  \n## Purpose\n\ngithub-scraper is used to scan repos owned by an org, clone them locally, look for a Dockerfile,\nextract the `FROM (build)` value into a nice CSV for management to use in its reports, or to find\na container that is running at the wrong version without asking the Dev Ops guys to do it.\n\n| Script                | Purpose                                                                                                                                                                                                       |\n|---------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `scraper.js`          | Pulls all the repo data belonging to the org ( as defined by type ) and stores the data in a file `./data/\u003cGITHUB-OUTFILE\u003e`. This file drives everything else.                                                |\n| `build-masterlist.js` | This just reads `./data/\u003cGITHUB-OUTFILE\u003e` and builds a CSV file `./data/\u003cGITHUB-CSVFILE\u003e`                                                                                                                     |\n| `build-inventory.js`  | Removes the directory `./out/` which will be the clone directory, once cloned, scans all files for a `Dockerfile`, reads them, and extracts `^FROM\\s+(.*)\\s*$` to a report called `./data/\u003cGITHUB-INVENTORY\u003e` |\n\n## Running\n\n**Required**\n\n* `A good internetnet connection`\n* `Node 15`\n\n**Steps**\n\n1. from the command prompt run `npm install`\n2. create a *[`.env`](https://github.com/motdotla/dotenv#readme)* file  with the environment variables listed in [Variables](#variables)\n3. from the command prompt run `npm run build-masterlist`\n4. from the command prompt run `npm run scraper`\n5. from the command prompt run `npm run build-inventory`\n6. send your report to your boss, and then drink some coffee or reach out to me Philip A Senger \u003cphilip.a.senger@cngrgroup.com\u003e for a job.\n\n## Additional Docs\n\nRefer to [OctoKit](https://octokit.github.io/rest.js/v18) for the Git hub api.\n\nRefer to [dotenv](https://github.com/motdotla/dotenv#readme) for a better understanding of `.env` files\n\nRefer to [Github Guides](https://guides.github.com/) for Github\n\nRefer to [Docker Docs](https://docs.docker.com/) for Docker\n\n## Variables\n\nThis project uses `.env`\n\n| Variable          \t| Required \t| Default             \t| Purpose                                                                                                                                                                                                                                                                                              \t|\n|-------------------\t|----------\t|---------------------\t|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\t|\n| GITHUB-PAL-TOKEN  \t| true     \t|                     \t| Personal access token ([create](https://github.com/settings/tokens/new))                                                                                                                                                                                                                             \t|\n| GITHUB-TIMEZONE   \t| true     \t|                     \t| The time zone ([list](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones))                                                                                                                                                                                                                 \t|\n| GITHUB-ORG        \t| true     \t|                     \t| The org to scan in the repos                                                                                                                                                                                                                                                                         \t|\n| GITHUB-TYPE       \t| true     \t|                     \t| Specifies the types of repositories you want returned. Can be one of all, public, private, forks, sources, member, internal. Default: all. If your organization is associated with an enterprise account using GitHub Enterprise Cloud or GitHub Enterprise Server 2.20+, type can also be internal. \t|\n| GITHUB-CSVFILE    \t| false    \t| ./data/data.csv     \t| Builds a CSV master list file ( when build-masterlist is executed )                                                                                                                                                                                                                                  \t|\n| GITHUB-OUTFILE    \t| false    \t| ./data/data.json    \t| Output from the scraper command, a full listing from github.                                                                                                                                                                                                                                         \t|\n| GITHUB-INVENTORY  \t| false    \t| ./data/inventory.csv \t| the results of scanning files in github ( in this repo it is the Dockerfile FROM command )                                                                                                                                                                                                           \t|\n| GITHUB-SKIP-NAMES \t| false    \t| ''                  \t| any repos you want to skip while building the inventory.                                                                                                                                                                                                                                             \t|\n\n## Todo\n\n* The environment variables and expected chaining of data files is problematic.\n* Might be nice to scan for repos owned by owners and or orgs.\n* I think extracting the shell commands would be good, so you can make the code more reusable\n* Naming convention is not so good. \n* linting and tests would be good.\n* update `build-masterlist` to use the csv module and extract fields to environment variables.\n* change `GITHUB-ORG` so it is defaulted to `all`\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsenger%2Fgithub-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpsenger%2Fgithub-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsenger%2Fgithub-scraper/lists"}