{"id":28535210,"url":"https://github.com/gerritcodereview/apps_analytics-etl","last_synced_at":"2025-07-03T04:32:24.231Z","repository":{"id":84111584,"uuid":"119593932","full_name":"GerritCodeReview/apps_analytics-etl","owner":"GerritCodeReview","description":"Spark ETL to extra analytics data from Gerrit Projects using the Analytics plugin - (mirror of https://gerrit.googlesource.com/apps_analytics-etl)","archived":false,"fork":false,"pushed_at":"2025-03-03T04:25:39.000Z","size":1268,"stargazers_count":5,"open_issues_count":0,"forks_count":7,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-06-09T17:15:56.597Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GerritCodeReview.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":"auditlog/scripts/gerrit-analytics-etl-auditlog.sh","citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-30T21:01:51.000Z","updated_at":"2025-01-22T12:53:39.000Z","dependencies_parsed_at":"2025-01-22T13:56:47.546Z","dependency_job_id":null,"html_url":"https://github.com/GerritCodeReview/apps_analytics-etl","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/GerritCodeReview/apps_analytics-etl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GerritCodeReview%2Fapps_analytics-etl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GerritCodeReview%2Fapps_analytics-etl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GerritCodeReview%2Fapps_analytics-etl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GerritCodeReview%2Fapps_analytics-etl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GerritCodeReview","download_url":"https://codeload.github.com/GerritCodeReview/apps_analytics-etl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GerritCodeReview%2Fapps_analytics-etl/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263258911,"owners_count":23438734,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-09T17:14:23.214Z","updated_at":"2025-07-03T04:32:24.219Z","avatar_url":"https://github.com/GerritCodeReview.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Intro\n\nThis repository provides a set of spark ETL jobs able to extract, transform and persist data from\ngerrit projects with the purpose of performing analytics tasks. \n\nEach job focuses on a specific dataset and it knows how to extract it, filter it, aggregate it,\ntransform it and then persist it.\n\nThe persistent storage of choice is *elasticsearch*, which plays very well with the *kibana* dashboard for\nvisualizing the analytics.\n\nAll jobs are configured as separate sbt projects and have in common just a thin layer of core\ndependencies, such as spark, elasticsearch client, test utils, etc.\n\nEach job can be built and published independently, both as a fat jar artifact or a docker image.  \n\n# Spark ETL jobs\n\nHere below an exhaustive list of all the spark jobs provided by this repo, along with their documentation. \n\n## Git Commits\n\nExtracts and aggregates git commits data from Gerrit Projects.\n\nRequires a [Gerrit 2.13.x](https://www.gerritcodereview.com/releases/README.md) or later\nwith the [analytics](https://gerrit.googlesource.com/plugins/analytics/)\nplugin installed and [Apache Spark 2.11](https://spark.apache.org/downloads.html) or later.\n\nJob can be launched with the following parameters:\n\n```bash\nbin/spark-submit \\\n    --class com.gerritforge.analytics.gitcommits.job.Main \\\n    --conf spark.es.nodes=es.mycompany.com \\\n    $JARS/analytics-etl-gitcommits.jar \\\n    --since 2000-06-01 \\\n    --aggregate email_hour \\\n    --url http://gerrit.mycompany.com \\\n    -e gerrit \\\n    --username gerrit-api-username \\\n    --password gerrit-api-password\n```\n\nYou can also run this job in docker:\n\n```bash\ndocker run -ti --rm \\\n    -e ES_HOST=\"es.mycompany.com\" \\\n    -e GERRIT_URL=\"http://gerrit.mycompany.com\" \\\n    -e ANALYTICS_ARGS=\"--since 2000-06-01 --aggregate email_hour -e gerrit\" \\\n    gerritforge/gerrit-analytics-etl-gitcommits:latest\n```\n\n### Parameters\n- since, until, aggregate are the same defined in Gerrit Analytics plugin\n    see: https://gerrit.googlesource.com/plugins/analytics/+/master/README.md\n- -u --url Gerrit server URL with the analytics plugins installed\n- -m --manifest Repo manifest XML path. Absolute path of the Repo manifest XML to import project\nfrom. Each project will be imported from the revision specified in the `revision` attribute.\n- -n --manifest-branch (*optional*) Manifest branch. Manifest file git branch.\n- -l --manifest-label (*optional*) Manifest label. A `manifest_label` is an aggregation of projects imported from the same manifest.\nAdd it to allow filtering by `manifest_label`.\n- -p --prefix (*optional*) Projects prefix. Limit the results to those projects that start with the specified prefix.\n- -e --elasticIndex Elastic Search index name. If not provided no ES export will be performed. _Note: ElastiSearch 6.x\nrequires this index format `name/type`, while from ElasticSearch 7.x just `name`_\n- -r --extract-branches Extract and process branches information (Optional) - Default: false\n- -o --out folder location for storing the output as JSON files\n    if not provided data is saved to \u003c/tmp\u003e/analytics-\u003cNNNN\u003e where \u003c/tmp\u003e is\n    the system temporary directory\n- -a --email-aliases (*optional*) \"emails to author alias\" input data path.\n- -k --ignore-ssl-cert allows to proceed even for server connections otherwise considered insecure.\n\n  CSVs with 3 columns are expected in input.\n\n  Here an example of the required files structure:\n  ```csv\n  author,email,organization\n  John Smith,john@email.com,John's Company\n  John Smith,john@anotheremail.com,John's Company\n  David Smith,david.smith@email.com,Indipendent\n  David Smith,david@myemail.com,Indipendent\n  ```\n\n  You can use the following command to quickly extract the list of authors and emails to create part of an input CSV file:\n  ```bash\n  echo -e \"author,email\\n$(git log --pretty=\"%an,%ae%n%cn,%ce\"|sort |uniq )\" \u003e /tmp/my_aliases.csv\n  ```\n  Once you have it, you just have to add the organization column.\n\n  *NOTE:*\n  * **organization** will be extracted from the committer email if not specified\n  * **author** will be defaulted to the committer name if not specified\n\n### Build\n\n#### JAR\nTo build the jar file, simply use\n\n`sbt analyticsETLGitCommits/assembly`\n\n#### Docker\n\nTo build the *gerritforge/gerrit-analytics-etl-gitcommits* docker container just run:\n\n`sbt analyticsETLGitCommits/docker`.\n\nIf you want to distribute use:\n\n`sbt analyticsETLGitCommits/dockerBuildAndPush`.\n\nThe build and distribution override the `latest` image tag too\nRemember to create an annotated tag for a release. The tag is used to define the docker image tag too\n\n## Audit Logs\n\nExtract, aggregate and persist auditLog entries produced by Gerrit via the [audit-sl4j](https://gerrit.googlesource.com/plugins/audit-sl4j/) plugin.\nAuditLog entries are an immutable trace of what happened on Gerrit and this ETL can leverage that to answer questions such as:\n\n- How is GIT incoming traffic distributed?\n- Git/SSH vs. Git/HTTP traffic\n- Git receive-pack vs. upload-pack\n- Top#10 users of receive-pack\n\nand many others questions related to the usage of Gerrit.\n\nJob can be launched, for example, with the following parameters:\n\n```bash\nspark-submit \\\n    --class com.gerritforge.analytics.auditlog.job.Main \\\n    --conf spark.es.nodes=es.mycompany.com \\\n    --conf spark.es.port=9200 \\\n    --conf spark.es.index.auto.create=true \\\n    $JARS/analytics-etl-auditlog.jar \\\n        --gerritUrl https://gerrit.mycompany.com \\\n        --elasticSearchIndex gerrit \\\n        --eventsPath /path/to/auditlogs \\\n        --ignoreSSLCert false \\\n        --since 2000-06-01 \\\n        --until 2020-12-01\n```\n\nYou can also run this job in docker:\n\n```bash\ndocker run \\\n    --volume \u003csource\u003e/audit_log:/app/events/audit_log -ti --rm \\\n    -e ES_HOST=\"\u003celasticsearch_url\u003e\" \\\n    -e GERRIT_URL=\"http://\u003cgerrit_url\u003e:\u003cgerrit_port\u003e\" \\\n    -e ANALYTICS_ARGS=\"--elasticSearchIndex gerrit --eventsPath /app/events/audit_log --ignoreSSLCert false --since 2000-06-01 --until 2020-12-01 -a hour\" \\\n    gerritforge/gerrit-analytics-etl-auditlog:latest\n```\n\n## Parameters\n\n* -u, --gerritUrl              - gerrit server URL (Required)\n* --username                   - Gerrit API Username (Optional)\n* --password                   - Gerrit API Password (Optional)\n* -i, --elasticSearchIndex     - elasticSearch index to persist data into (Required)\n* -p, --eventsPath             - path to a directory (or a file) containing auditLogs events. Supports also _.gz_ files. (Required)\n* -a, --eventsTimeAggregation  - Events of the same type, produced by the same user will be aggregated with this time granularity: 'second', 'minute', 'hour', 'week', 'month', 'quarter'. (Optional) - Default: 'hour'\n* -k, --ignoreSSLCert          - Ignore SSL certificate validation (Optional) - Default: false\n* -s, --since                  - process only auditLogs occurred after (and including) this date (Optional)\n* -u, --until                  - process only auditLogs occurred before (and including) this date (Optional)\n* -a, --additionalUserInfoPath - path to a CSV file containing additional user information (Optional). Currently it is only possible to add user `type` (i.e.: _bot_, _human_).\nIf the type is not specified the user will be considered _human_.\n\n  Here an additional user information CSV file example:\n  ```csv\n    id,type\n    123,\"bot\"\n    456,\"bot\"\n    789,\"human\"\n  ```\n\n### Build\n\n#### JAR\nTo build the jar file, simply use\n\n`sbt analyticsETLAuditLog/assembly`\n\n#### Docker\n\nTo build the *gerritforge/gerrit-analytics-etl-auditlog* docker image just run:\n\n`sbt analyticsETLAuditLog/docker`.\n\nIf you want to distribute it use:\n\n`sbt analyticsETLAuditLog/dockerBuildAndPush`.\n\nThe build and distribution override the `latest` image tag too.\n\n\n# Development environment\n\nA docker compose file is provided to spin up an instance of Elastisearch with Kibana locally.\nJust run `docker-compose up`.\n\n## Caveats\n\n* If you want to run the git ETL job from within docker against containerized elasticsearch and/or gerrit instances, you need\n  to make them reachable by the ETL container. You can do this by spinning the ETL within the same network used by your elasticsearch/gerrit container (use `--network` argument)\n\n* If elasticsearch or gerrit run on your host machine, then you need to make _that_ reachable by the ETL container.\n  You can do this by providing routing to the docker host machine (i.e. `--add-host=\"gerrit:\u003cyour_host_ip_address\u003e\"` `--add-host=\"elasticsearch:\u003cyour_host_ip_address\u003e\"`)\n\n  For example:\n\n  * Run gitcommits ETL:\n  ```bash\n  HOST_IP=`ifconfig en0 | grep \"inet \" | awk '{print $2}'` \\\n      docker run -ti --rm \\\n          --add-host=\"gerrit:$HOST_IP\" \\\n          --network analytics-etl_ek \\\n          -e ES_HOST=\"elasticsearch\" \\\n          -e GERRIT_URL=\"http://$HOST_IP:8080\" \\\n          -e ANALYTICS_ARGS=\"--since 2000-06-01 --aggregate email_hour -e gerrit\" \\\n          gerritforge/gerrit-analytics-etl-gitcommits:latest\n  ```\n\n  * Run auditlog ETL:\n    ```bash\n    HOST_IP=`ifconfig en0 | grep \"inet \" | awk '{print $2}'` \\\n        docker run -ti --rm --volume \u003csource\u003e/audit_log:/app/events/audit_log \\\n        --add-host=\"gerrit:$HOST_IP\" \\\n        --network analytics-wizard_ek \\\n        -e ES_HOST=\"elasticsearch\" \\\n        -e GERRIT_URL=\"http://$HOST_IP:8181\" \\\n        -e ANALYTICS_ARGS=\"--elasticSearchIndex gerrit --eventsPath /app/events/audit_log --ignoreSSLCert true --since 2000-06-01 --until 2020-12-01 -a hour\" \\\n        gerritforge/gerrit-analytics-etl-auditlog:latest\n    ```\n\n* If Elastisearch dies with `exit code 137` you might have to give Docker more memory ([check this article for more details](https://github.com/moby/moby/issues/22211))\n\n* Should ElasticSearch need authentication (i.e.: if X-Pack is enabled), credentials can be passed through the *spark.es.net.http.auth.pass* and *spark.es.net.http.auth.user* parameters.\n\n* If the dockerized spark job cannot connect to elasticsearch (also, running on docker) you might need to tell elasticsearch to publish\nthe host to the cluster using the \\_site\\_ address.\n\n```\nelasticsearch:\n    ...\n    environment:\n       ...\n      - http.host=0.0.0.0\n      - network.host=_site_\n      - http.publish_host=_site_\n      ...\n```\n\nSee [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#network-interface-values) for more info\n\n## Build all\n\nTo perform actions across all jobs simply run the relevant *sbt* task without specifying the job name. For example:\n\n* Test all jobs: `sbt test`\n* Build jar for all jobs: `sbt assembly`\n* Build docker for all jobs: `sbt docker`","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgerritcodereview%2Fapps_analytics-etl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgerritcodereview%2Fapps_analytics-etl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgerritcodereview%2Fapps_analytics-etl/lists"}