Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/googlecloudplatform/dataflow-pubsub-dedup
https://github.com/googlecloudplatform/dataflow-pubsub-dedup
apache-beam cloud-dataflow cloud-pubsub google-cloud-platform
Last synced: 3 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/googlecloudplatform/dataflow-pubsub-dedup
- Owner: GoogleCloudPlatform
- License: apache-2.0
- Created: 2019-08-13T14:10:39.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-09-25T14:36:33.000Z (3 months ago)
- Last Synced: 2024-12-18T08:40:18.804Z (4 days ago)
- Topics: apache-beam, cloud-dataflow, cloud-pubsub, google-cloud-platform
- Language: Java
- Size: 67.4 KB
- Stars: 14
- Watchers: 16
- Forks: 10
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
## Deduplication with Cloud PubSub and Cloud Dataflow on Google Cloud Platform
This is the source code that accompanies the solution: Deduplication of messages with Cloud PubSub and Cloud Dataflow. This sample code demonstrates three approaches for deduplication:
- PubSubIO: `com.google.examples.dfdedup.DedupWithPubSubIO`
- Distinct transform: `com.google.examples.dfdedup.DedupWithDistinct`
- Custom state based deduplication: `com.google.examples.dfdedup.DedupWithStateAndGC`## End to end pipeline
You can run the following end to end pipeline to explore deduplication behavior across all three approaches:
![End to end flow](images/endtoendflow.PNG)
### Setting up resources
***NOTE:***
If you're new to GCP, please see quickstarts for [Cloud PubSub](https://cloud.google.com/pubsub/docs/quickstarts), [BigQuery](https://cloud.google.com/bigquery/docs/quickstarts) and [Cloud Dataflow](https://cloud.google.com/dataflow/docs/quickstarts)#### BigQuery
Use the schema files under `bqschemas/` to create#### Cloud PubSub
### Running Python-based the data generator
Blah blah