{"id":23234982,"url":"https://github.com/googlecloudplatform/dataflow-pubsub-dedup","last_synced_at":"2025-08-19T21:32:15.369Z","repository":{"id":42119216,"uuid":"202158542","full_name":"GoogleCloudPlatform/dataflow-pubsub-dedup","owner":"GoogleCloudPlatform","description":null,"archived":false,"fork":false,"pushed_at":"2024-09-25T14:36:33.000Z","size":69,"stargazers_count":14,"open_issues_count":12,"forks_count":10,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-12-18T08:40:18.804Z","etag":null,"topics":["apache-beam","cloud-dataflow","cloud-pubsub","google-cloud-platform"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-08-13T14:10:39.000Z","updated_at":"2024-06-16T02:05:15.000Z","dependencies_parsed_at":"2023-01-24T14:15:17.664Z","dependency_job_id":null,"html_url":"https://github.com/GoogleCloudPlatform/dataflow-pubsub-dedup","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-pubsub-dedup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-pubsub-dedup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-pubsub-dedup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdataflow-pubsub-dedup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/dataflow-pubsub-dedup/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230369208,"owners_count":18215339,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-beam","cloud-dataflow","cloud-pubsub","google-cloud-platform"],"created_at":"2024-12-19T03:17:18.216Z","updated_at":"2024-12-19T03:17:18.895Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"Java","readme":"## Deduplication with Cloud PubSub and Cloud Dataflow on Google Cloud Platform\n\nThis is the source code that accompanies the solution: Deduplication of messages with Cloud PubSub and Cloud Dataflow. This sample code demonstrates three approaches for deduplication:\n\n- PubSubIO: `com.google.examples.dfdedup.DedupWithPubSubIO`\n- Distinct transform: `com.google.examples.dfdedup.DedupWithDistinct`\n- Custom state based deduplication: `com.google.examples.dfdedup.DedupWithStateAndGC`\n\n## End to end pipeline\n\nYou can run the following end to end pipeline to explore deduplication behavior across all three approaches:\n\n![End to end flow](images/endtoendflow.PNG)\n\n### Setting up resources\n\n***NOTE:***\nIf you're new to GCP, please see quickstarts for [Cloud PubSub](https://cloud.google.com/pubsub/docs/quickstarts), [BigQuery](https://cloud.google.com/bigquery/docs/quickstarts) and [Cloud Dataflow](https://cloud.google.com/dataflow/docs/quickstarts)\n\n#### BigQuery\nUse the schema files under `bqschemas/` to create\n\n#### Cloud PubSub\n\n\n\n### Running Python-based the data generator\nBlah blah","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdataflow-pubsub-dedup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooglecloudplatform%2Fdataflow-pubsub-dedup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdataflow-pubsub-dedup/lists"}