https://github.com/boweihan/schema-on-read
Application and deployment source for a schema-on-read data analytics solution
https://github.com/boweihan/schema-on-read
Last synced: 2 months ago
JSON representation
Application and deployment source for a schema-on-read data analytics solution
- Host: GitHub
- URL: https://github.com/boweihan/schema-on-read
- Owner: boweihan
- Created: 2018-09-24T04:14:45.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-10-11T02:18:13.000Z (over 6 years ago)
- Last Synced: 2025-01-31T09:45:40.217Z (4 months ago)
- Language: Go
- Size: 3.91 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SCHEMA-ON-READ
This project is an on-going experiment to set up an infrastructure for schema-on-read analytics.
### Pipeline
Ideal Properties for a Pipeline include:
* Low Event Latency
* Scalability
* Interative Querying
* Versioning
* Monitoring
* Testing
* Fault ToleranceTypes of Data
* Raw Data (JSON blob)
* Processed Data (Schema-applied)
* Cooked Data (Processed data aggregated/summarized)Eras
* Flat File Era
* Database Era
* Data Lake Era
* Serverless EraEvent Sourcing
* Stream Processing System (Kafka, Amazon Kinesis, Google PubSub)
* Message Encoding (JSON/Protocol Buffers)
* Handle Delivery Failure, Queueing, Batching, Prioritization
* Handle Auditing (lightweight sequential counting is probably enough)
* Strip PII for GDPR### Implementation
1. Accept events via API
2. Send to PubSub
3. Ingest PubSub messages using DataFlow
4. Group DataFlow events into fixed batches, convert to String, output to AVRO files on Storage
5. In parallel, pull DataFlow events / perform ETL / send to BigQuery
6. Autoscaling### Notes
Data gathering
* Tracking specifications as a best practice (Conditions/Properties/Definitions)
* Server vs Client tracking (Trusted Source/Ad Blocking/Versioning/Testing/Data Availability)