{"id":15048432,"url":"https://github.com/wolfeidau/arrow-gh-processor","last_synced_at":"2025-04-10T01:22:06.534Z","repository":{"id":177806490,"uuid":"660928364","full_name":"wolfeidau/arrow-gh-processor","owner":"wolfeidau","description":"This project illustrates how to build a data processor using a Go, Apache Arrow.","archived":false,"fork":false,"pushed_at":"2023-07-01T09:55:34.000Z","size":18,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-02T17:08:25.410Z","etag":null,"topics":["arrow","github","golang","json","parquet"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wolfeidau.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-01T08:45:08.000Z","updated_at":"2023-12-27T16:30:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"ace479bb-6ee0-4618-bebe-7cf86cdaae60","html_url":"https://github.com/wolfeidau/arrow-gh-processor","commit_stats":null,"previous_names":["wolfeidau/arrow-gh-processor"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wolfeidau%2Farrow-gh-processor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wolfeidau%2Farrow-gh-processor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wolfeidau%2Farrow-gh-processor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wolfeidau%2Farrow-gh-processor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wolfeidau","download_url":"https://codeload.github.com/wolfeidau/arrow-gh-processor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239108715,"owners_count":19583047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrow","github","golang","json","parquet"],"created_at":"2024-09-24T21:11:56.736Z","updated_at":"2025-02-16T08:34:18.889Z","avatar_url":"https://github.com/wolfeidau.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# arrow-gh-processor\n\nThis project illustrates how to build a data processor using a Go, Apache Arrow. This code reads the JSON lines data provided in compressed archives by GitHub and extracts events by type and stores them in a parquet file.\n\n# Overview\n\nCurrently this project extracts events of type `PullRequestEvent` and writes them out to parquet, the process looks something like this.\n\n```mermaid\nflowchart LR\n    A[Gunzip Stream] --\u003e B[Split Into Lines] --\u003e C[Extract Data using JSON Template] --\u003e D[Write to Parquet]\n```\n\n# Usage\n\n```\nUsage: arrow-gh-processor \u003csource\u003e \u003cdestination\u003e\n\nArguments:\n  \u003csource\u003e         Source github archive file containing JSON and compressed with Gzip\n  \u003cdestination\u003e    Destination parquet output file\n\nFlags:\n  -h, --help                             Show context-sensitive help.\n      --version\n      --event-type=\"PullRequestEvent\"\n```\n\n# Example\n\nDownload an archive from the GitHub archive website.\n\n```\ncurl -L -O https://data.gharchive.org/2023-06-26-14.json.gz\n```\n\nConvert it to parquet.\n\n```\narrow-gh-processor 2023-06-26-14.json.gz 2023-06-26-14.snappy.parquet\n```\n\nThe schema of the output parquet file will be as follows.\n\n```\nrepeated group field_id=-1 arrow_schema {\n  optional byte_array field_id=-1 id (String);\n  optional byte_array field_id=-1 type (String);\n  optional byte_array field_id=-1 actor (String);\n  optional byte_array field_id=-1 actor_url (String);\n  optional byte_array field_id=-1 repo (String);\n  optional byte_array field_id=-1 repo_url (String);\n  optional byte_array field_id=-1 pull_action (String);\n  optional int64 field_id=-1 pull_number (Int(bitWidth=64, isSigned=true));\n  optional byte_array field_id=-1 pull_state (String);\n  optional byte_array field_id=-1 pull_title (String);\n  optional byte_array field_id=-1 author_association (String);\n  optional int64 field_id=-1 created_at (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=true));\n  optional byte_array field_id=-1 pull_request (String);\n}\n```\n\nQuery the data using [duckdb](https://duckdb.org/).\n\n```sql\nSELECT actor, count(id) \nFROM read_parquet('2023-06-26-14.snappy.parquet') \nGROUP BY actor ORDER BY count(id) desc;\n```\n\nOutput will look something like.\n\n```\nduckdb\nv0.8.1 6536a77232\nEnter \".help\" for usage hints.\nConnected to a transient in-memory database.\nUse \".open FILENAME\" to reopen on a persistent database.\nD SELECT actor, count(id) FROM read_parquet('2023-06-26-14.snappy.parquet') GROUP BY actor ORDER BY count(id) desc;\n┌───────────────────────────┬───────────┐\n│           actor           │ count(id) │\n│          varchar          │   int64   │\n├───────────────────────────┼───────────┤\n│ dependabot[bot]           │      2392 │\n│ pull[bot]                 │       538 │\n│ renovate[bot]             │       511 │\n│ github-actions[bot]       │       278 │\n│ direwolf-github           │       101 │\n│ trunk-dev[bot]            │        81 │\n````\n\n# License\n\nThis project is released under Apache 2.0 license and is copyright [Mark Wolfe](https://www.wolfe.id.au).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwolfeidau%2Farrow-gh-processor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwolfeidau%2Farrow-gh-processor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwolfeidau%2Farrow-gh-processor/lists"}