https://github.com/wolfeidau/arrow-gh-processor
This project illustrates how to build a data processor using a Go, Apache Arrow.
https://github.com/wolfeidau/arrow-gh-processor
arrow github golang json parquet
Last synced: 10 months ago
JSON representation
This project illustrates how to build a data processor using a Go, Apache Arrow.
- Host: GitHub
- URL: https://github.com/wolfeidau/arrow-gh-processor
- Owner: wolfeidau
- License: apache-2.0
- Created: 2023-07-01T08:45:08.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-01T09:55:34.000Z (over 2 years ago)
- Last Synced: 2025-02-02T17:08:25.410Z (12 months ago)
- Topics: arrow, github, golang, json, parquet
- Language: Go
- Homepage:
- Size: 17.6 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# arrow-gh-processor
This project illustrates how to build a data processor using a Go, Apache Arrow. This code reads the JSON lines data provided in compressed archives by GitHub and extracts events by type and stores them in a parquet file.
# Overview
Currently this project extracts events of type `PullRequestEvent` and writes them out to parquet, the process looks something like this.
```mermaid
flowchart LR
A[Gunzip Stream] --> B[Split Into Lines] --> C[Extract Data using JSON Template] --> D[Write to Parquet]
```
# Usage
```
Usage: arrow-gh-processor
Arguments:
Source github archive file containing JSON and compressed with Gzip
Destination parquet output file
Flags:
-h, --help Show context-sensitive help.
--version
--event-type="PullRequestEvent"
```
# Example
Download an archive from the GitHub archive website.
```
curl -L -O https://data.gharchive.org/2023-06-26-14.json.gz
```
Convert it to parquet.
```
arrow-gh-processor 2023-06-26-14.json.gz 2023-06-26-14.snappy.parquet
```
The schema of the output parquet file will be as follows.
```
repeated group field_id=-1 arrow_schema {
optional byte_array field_id=-1 id (String);
optional byte_array field_id=-1 type (String);
optional byte_array field_id=-1 actor (String);
optional byte_array field_id=-1 actor_url (String);
optional byte_array field_id=-1 repo (String);
optional byte_array field_id=-1 repo_url (String);
optional byte_array field_id=-1 pull_action (String);
optional int64 field_id=-1 pull_number (Int(bitWidth=64, isSigned=true));
optional byte_array field_id=-1 pull_state (String);
optional byte_array field_id=-1 pull_title (String);
optional byte_array field_id=-1 author_association (String);
optional int64 field_id=-1 created_at (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=true));
optional byte_array field_id=-1 pull_request (String);
}
```
Query the data using [duckdb](https://duckdb.org/).
```sql
SELECT actor, count(id)
FROM read_parquet('2023-06-26-14.snappy.parquet')
GROUP BY actor ORDER BY count(id) desc;
```
Output will look something like.
```
duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D SELECT actor, count(id) FROM read_parquet('2023-06-26-14.snappy.parquet') GROUP BY actor ORDER BY count(id) desc;
┌───────────────────────────┬───────────┐
│ actor │ count(id) │
│ varchar │ int64 │
├───────────────────────────┼───────────┤
│ dependabot[bot] │ 2392 │
│ pull[bot] │ 538 │
│ renovate[bot] │ 511 │
│ github-actions[bot] │ 278 │
│ direwolf-github │ 101 │
│ trunk-dev[bot] │ 81 │
````
# License
This project is released under Apache 2.0 license and is copyright [Mark Wolfe](https://www.wolfe.id.au).