{"id":18941521,"url":"https://github.com/mchmarny/github-bigquery-exporter","last_synced_at":"2025-09-05T09:32:25.238Z","repository":{"id":77051448,"uuid":"155479523","full_name":"mchmarny/github-bigquery-exporter","owner":"mchmarny","description":"GitHub BigQuery Export Utility","archived":false,"fork":false,"pushed_at":"2019-05-29T00:01:34.000Z","size":14,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-08T12:38:35.437Z","etag":null,"topics":["bigdata","data","gcp","github-pages","sql"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mchmarny.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-31T01:24:18.000Z","updated_at":"2024-01-16T17:05:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"6447b680-0927-4b3d-b60b-28fc953bb20a","html_url":"https://github.com/mchmarny/github-bigquery-exporter","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgithub-bigquery-exporter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgithub-bigquery-exporter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgithub-bigquery-exporter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mchmarny%2Fgithub-bigquery-exporter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mchmarny","download_url":"https://codeload.github.com/mchmarny/github-bigquery-exporter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232034691,"owners_count":18463356,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","data","gcp","github-pages","sql"],"created_at":"2024-11-08T12:28:23.182Z","updated_at":"2024-12-31T22:43:07.784Z","avatar_url":"https://github.com/mchmarny.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# github-bigquery-exporter\n\nGitHub BigQuery export utility, for those times when more granular PR and Issue queries are required. This is also good way to query data for periods longer than the GitHub max of `30` days.\n\n## Requirements\n\n### token\n\nYou can run the export script without the GitHub API token but you will be subject to much stricter rate limits. To avoid this (important for larger organizations) get a personal API tokens by following [these instructions](https://blog.github.com/2013-05-16-personal-api-tokens/) and define it in an `GITHUB_ACCESS_TOKEN` environment variable\n\n```shell\nexport GITHUB_ACCESS_TOKEN=\"your long token string goes in here\"\n```\n\n\u003e Remember, you have to be org admin for this to work\n\n### json2csv\n\nGitHub API exports data in JSON format. The simplest way to import desired data elements is to convert the data into CSV using `json2csv`, a Node.js utility that converts JSON to CSV.\n\n```shell\nnpm install -g json2csv\n```\n\n## Configuration\n\nTo configure the export script you will need to define the `organization` and provide list of `repositories` in this organization.\n\n```shell\ndeclare -r org=\"my-org-name\"\ndeclare -a repos=(\"my-repo-1\"\n                  \"my-repo-2\"\n                  \"my-repo-3\")\n```\n\nOptionally, to configure the import script you can edit the data-set name and configure the `issue` and `pull` table name. This step is only required if you for some reason have name conflicts in your BiqQuery project.\n\n```shell\ndeclare -r ds=\"github\"\ndeclare -r issues_table=\"issues\"\ndeclare -r pulls_table=\"pulls\"\n```\n\n## Export\n\nTo execute the GitHub export script run this command:\n\n```shell\n./export\n```\n\nThe expected output should look something like this\n\n```shell\nDownloading issues for org/repo-1...\nDownloading prs for org/repo-1...\nDownloading issues for org/repo-2...\nDownloading prs for org/repo-2...\n```\n\n## Import\n\nTo execute the BigQuery import script run this command:\n\n```shell\n./import\n```\n\nThe expected output should look something like this\n\n```shell\nDataset 'project:github' successfully created.\nTable 'project:github.issues' successfully created.\nTable 'project:github.pulls' successfully created.\nWaiting on bqjob_..._1 ... (0s) Current status: DONE\nWaiting on bqjob_..._1 ... (0s) Current status: DONE\n```\n\n### Query\n\nWhen the above scripts completed successfully you should be able to query the imported data using SQL in BigQuery console. For example to find repositories with most issues over last 90 days:\n\n```sql\nselect\n  i.repo,\n  count(*) num_of_issues\nfrom gh.pulls i\nwhere date_diff(CURRENT_DATE(), date(i.ts), day) \u003c 90\ngroup by\n  i.repo\norder by 2 desc\n```\n\n### TODO\n\n* Add org user export/import\n* Sort out the 2nd run where tables have to be appended\n* Bash, really? Can I haz me a service?\n\n\n## Scratch\n\n### Users who have activity (pr/issue) but are NOT in the user table\n\n```sql\nwith active_users as (\n  select username\n  from gh.issues\n  group by username\n\n  union all\n\n  select p.username\n  from gh.pulls p\n  group by username\n)\nselect *\nfrom active_users\nwhere username not in (SELECT username from gh.users)\n```\n\n\u003e Export results as CSV and use them as input in `user-export` which will download the GitHUb data for each one of those users. Then, when done, run `user-import` to bring those users into\n\n### Activity breakdown by company\n\n```sql\nselect all_prs.company, all_prs.prs apr, coalesce(m3_prs.prs,0) rpr from (\n\n  select\n    COALESCE(u.company, 'Unknown') company,\n    COUNT(*) prs\n  from gh.pulls i\n  join gh.users u on i.username = u.username\n  group by company\n\n) all_prs\n\nleft join (\n\n  select\n    COALESCE(u.company, 'Unknown') company,\n    COUNT(*) prs\n  from gh.pulls i\n  join gh.users u on i.username = u.username\n  where i.ts \u003e \"2018-10-30 23:59:59\"\n  group by company\n\n) m3_prs on all_prs.company = m3_prs.company\n\norder by 2 desc\n```\n\n\n```sql\nselect u.company, count(*)\nfrom gh.pulls i join gh.users u on i.username = u.username\nwhere u.company is not null\ngroup by company order by 2 desc\n```\n\n\n```sql\nselect\n  pr_month,\n  sum(google_prs) as total_google_prs,\n  sum(non_google_prs) as total_non_google_prs\nfrom (\nselect\n  case when u.company = 'Google' then 1 else 0 end as google_prs,\n  case when u.company = 'Google' then 0 else 1 end as non_google_prs,\n  TIMESTAMP_TRUNC(i.`on`, MONTH) as pr_month\nfrom gh.pulls i\njoin gh.users u on i.username = u.username\nwhere u.company  is not null\n)\ngroup by pr_month\norder by 1\n```\n\n### PRs\n\n```sql\nselect\n  pr_month,\n  sum(google_prs) as total_google_prs,\n  sum(non_google_prs) as total_non_google_prs\nfrom (\nselect\n  case when u.company = 'Google' then 1 else 0 end as google_prs,\n  case when u.company = 'Google' then 0 else 1 end as non_google_prs,\n  TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month\nfrom gh.pulls p\njoin gh.users u on p.username = u.username\nwhere u.company \u003c\u003e ''\n)\ngroup by pr_month\norder by 1\n```\n\n### Issues\n\n```sql\nselect\n  pr_month,\n  sum(google_prs) as total_google_prs,\n  sum(non_google_prs) as total_non_google_prs\nfrom (\nselect\n  case when u.company = 'Google' then 1 else 0 end as google_prs,\n  case when u.company = 'Google' then 0 else 1 end as non_google_prs,\n  TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month\nfrom gh.issues i\njoin gh.users u on i.username = u.username\nwhere u.company \u003c\u003e ''\n)\ngroup by pr_month\norder by 1\n```\n\n\n```sql\nselect pr_month, repo, count(*) as prs\nfrom (\nselect\n  i.repo,\n  TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month\nfrom gh.pulls i\njoin gh.users u on i.username = u.username\nwhere u.company is not null\n)\ngroup by pr_month, repo\norder by 1, 3 desc\n```\n\n\n```sql\nselect\n  pr_month,\n  repo,\n  count(*) action\nfrom (\n\n  select\n    repo,\n    SUBSTR(CAST(TIMESTAMP_TRUNC(ts, MONTH) as STRING),0,7) as pr_month\n  from gh.issues\n\n  union all\n\n  select\n    repo,\n    SUBSTR(CAST(TIMESTAMP_TRUNC(ts, MONTH) as STRING),0,7) as pr_month\n  from gh.pulls\n\n)\nwhere repo = 'build' --'build-pipeline'\ngroup by repo, pr_month\norder by 1, 2\n```\n\n```sql\n select repo, count(*) from (\n select\n    repo\n  from gh.issues\n\n  union all\n\n  select\n    repo\n  from gh.pulls\n)\ngroup by repo\norder by 2 desc\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmchmarny%2Fgithub-bigquery-exporter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmchmarny%2Fgithub-bigquery-exporter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmchmarny%2Fgithub-bigquery-exporter/lists"}