{"id":17359961,"url":"https://github.com/tobked/fetch-apache-ga-stats","last_synced_at":"2025-02-26T11:31:48.714Z","repository":{"id":43099343,"uuid":"311953892","full_name":"TobKed/fetch-apache-ga-stats","owner":"TobKed","description":"Repository to make \"snapshots\" of GitHub Action queue for later analysis","archived":true,"fork":false,"pushed_at":"2023-12-17T00:37:47.000Z","size":107,"stargazers_count":6,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-21T09:43:17.728Z","etag":null,"topics":["bigquery","gcp","github","github-actions"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TobKed.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-11-11T11:43:50.000Z","updated_at":"2024-04-05T18:38:39.000Z","dependencies_parsed_at":"2023-11-12T02:25:36.077Z","dependency_job_id":"54d9060c-0c0a-47b0-bced-598664ac2577","html_url":"https://github.com/TobKed/fetch-apache-ga-stats","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TobKed%2Ffetch-apache-ga-stats","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TobKed%2Ffetch-apache-ga-stats/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TobKed%2Ffetch-apache-ga-stats/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TobKed%2Ffetch-apache-ga-stats/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TobKed","download_url":"https://codeload.github.com/TobKed/fetch-apache-ga-stats/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240843503,"owners_count":19866777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","gcp","github","github-actions"],"created_at":"2024-10-15T19:13:39.340Z","updated_at":"2025-02-26T11:31:48.370Z","avatar_url":"https://github.com/TobKed.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!--\n    Licensed to the Apache Software Foundation (ASF) under one\n    or more contributor license agreements.  See the NOTICE file\n    distributed with this work for additional information\n    regarding copyright ownership.  The ASF licenses this file\n    to you under the Apache License, Version 2.0 (the\n    \"License\"); you may not use this file except in compliance\n    with the License.  You may obtain a copy of the License at\n\n      http://www.apache.org/licenses/LICENSE-2.0\n\n    Unless required by applicable law or agreed to in writing,\n    software distributed under the License is distributed on an\n    \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n    KIND, either express or implied.  See the License for the\n    specific language governing permissions and limitations\n    under the License.\n--\u003e\n\n# Fetch Apache GitHub Actions Statistics\n\n![Test the build](https://github.com/TobKed/fetch-apache-ga-stats/workflows/Test%20the%20build/badge.svg)\n[![Fetch GitHub Action queue](https://github.com/TobKed/fetch-apache-ga-stats/actions/workflows/fetch-github-actions-queue.yml/badge.svg)](https://github.com/TobKed/fetch-apache-ga-stats/actions/workflows/fetch-github-actions-queue.yml)\n[![Fetch Apache Repositories with GA](https://github.com/TobKed/fetch-apache-ga-stats/actions/workflows/fetch-apache-repos-with-ga.yml/badge.svg)](https://github.com/TobKed/fetch-apache-ga-stats/actions/workflows/fetch-apache-repos-with-ga.yml)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n\u003c!-- START doctoc generated TOC please keep comment here to allow auto update --\u003e\n\u003c!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --\u003e\n**Table of Contents**\n\n- [Context and motivation](#context-and-motivation)\n- [Statistics](#statistics)\n    - [Json files](#json-files)\n    - [CSV file](#csv-file)\n    - [Processing existing json files to csv and pushing it to BigQuery](#processing-existing-json-files-to-csv-and-pushing-it-to-bigquery)\n- [Determining ASF repositories which uses GitHub Actions (matrix.json)](#determining-asf-repositories-which-uses-github-actions-matrixjson)\n- [GitHub Actions Secrets:](#github-actions-secrets)\n- [Google Cloud Platform infrastructure](#google-cloud-platform-infrastructure)\n\n\u003c!-- END doctoc generated TOC please keep comment here to allow auto update --\u003e\n\n## Context and motivation\n\nFor [The Apache Software Foundation [ASF]](https://github.com/apache/) the limit for concurrent jobs in GitHub Actions [GA] equals 180\n([usage limits](https://docs.github.com/en/free-pro-team@latest/actions/reference/usage-limits-billing-and-administration#usage-limits)).\nThe GItHub does not provide statistics related to GA and this repo was created to collect some basic data to make it possible.\n\n## Statistics\n\nStatistics data is fetched in the scheduled action [Fetch GitHub Action queue](.github/workflows/fetch-github-actions-queue.yml).\nThis action makes series of \"snapshots\" of GA workflow runs for every ASF repository which uses GA\n(list of them is stored in [matrix.json](./matrix.json), described [here]((#determining-asf-repositories-which-uses-github-actions-matrixjson))).\n\nStatistics consists of:\n\n * json files - workflow runs for every repo in seperate files (described [here]((#json-files)))\n * csv file - simple statistics in single file (described [here](#csv-file))\n\nThese files are uploaded as workflow artifact.\n\n#### Json files\n\nThe json files contain list of repository workflow runs in `queued` and `in_progress` state.\nFile titles contain timestamp when fetching this list started.\nThe json schema is described in GitHub API documentation [here](https://docs.github.com/en/free-pro-team@latest/rest/reference/actions#list-workflow-runs-for-a-repository).\n\n#### CSV file\n\nSingle `bq.csv` file is created and contains simple statistics for all fetched repositories.\nThis file is used in the [Fetch GitHub Action queue](.github/workflows/fetch-github-actions-queue.yml)\nto efficiently upload data to the BigQuery table.\n\nCSV file headers: `repository_owner`, `repository_name`, `queued`, `in_progress`, `in_progress`.\n\nExample content:\n\n```csv\nrepository_owner,repository_name,queued,in_progress,timestamp\napache,airflow,1,3,2020-11-19 17:53:24.139806+00:00\napache,beam,0,1,2020-11-19 17:53:39.171882+00:00\n```\n\n#### Processing existing json files to csv and pushing it to BigQuery\n\nHelper script `scripts/parse_existing_json_files.py` can be used to process existing json files into a single csv.\n\nExample use:\n```shell script\ngsutil -m cp -r gs://example-bucket-name/apache gcs\n\npython parse_existing_json_files.py \\\n    --input-dir gcs \\\n    --output bq_csv.csv\n\nbq load --autodetect \\\n    --source_format=CSV \\\n    dataset.table bq_csv.csv\n```\n\n## Determining ASF repositories which uses GitHub Actions (matrix.json)\n\nThere is no single endpoint to obtain a list of ASF repositories which uses GA and since ASF consists of 2000+\nrepositories it is not a trivial task to obtain it.\n\nThis list of repositories which uses GitHub Actions is stored in [matrix.json](./matrix.json)\nand can be updated in three ways:\n * manually editing `matrix.json` and committing changes\n * by using [fetch_apache_projects_with_ga.py](scripts/fetch_apache_projects_with_ga.py) python script and committing changes\n * automatically by [Fetch Apache Repositories with GA](.github/workflows/fetch-apache-repos-with-ga.yml) action (changes committed automatically when occur).\n\nRunning python script and action causes many requests on behalf of used GitHub Access tokens which may cause exceeding quota limits.\n\n## GitHub Actions Secrets:\n\n| Secret                  | Required | Description                                                                                                                                                                                                                                                                                                                                                                                                                           |\n|-------------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `PERSONAL_ACCESS_TOKEN` | True     | [Personal GitHub access token](https://docs.github.com/en/free-pro-team@latest/github/authenticating-to-github/creating-a-personal-access-token)(no need for additional permissions, don't have to select any checkboxes) used to authorize requests. It has bigger quota than [`GITHUB_TOKEN secret`](https://docs.github.com/en/free-pro-team@latest/actions/reference/authentication-in-a-workflow#about-the-github_token-secret). |\n| `GCP_PROJECT_ID`        | -        | Google Cloud Project ID.                                                                                                                                                                                                                                                                                                                                                                                                              |\n| `BQ_TABLE`              | -        | BigQuery table reference to which simple statistics will be pushed (e.g. `dataset.table`).                                                                                                                                                                                                                                                                                                                                            |\n| `GCP_SA_KEY`            | -        | Google Cloud Service Account key (Service Account with permissions to Google Cloud Storage and BigQuery).                                                                                                                                                                                                                                                                                                                             |\n| `GCP_SA_EMAIL`          | -        | Google Cloud Service Account email (Service Account with permissions to Google Cloud Storage and BigQuery).                                                                                                                                                                                                                                                                                                                           |\n\n## Google Cloud Platform infrastructure\n\nAll infrastructure components necessary to store statistics in BigQuery were wrapped in [./terraform](./terraform) folder.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftobked%2Ffetch-apache-ga-stats","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftobked%2Ffetch-apache-ga-stats","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftobked%2Ffetch-apache-ga-stats/lists"}