{"id":20736798,"url":"https://github.com/pirate-emperor/k2bq","last_synced_at":"2026-01-28T14:33:10.008Z","repository":{"id":263129890,"uuid":"889436490","full_name":"Pirate-Emperor/K2BQ","owner":"Pirate-Emperor","description":"K2BQ is a dataflow pipeline that streams data from Kafka to BigQuery. It uses Google Cloud’s managed Kafka, Dataflow for processing, and BigQuery for real-time analytics, offering scalable, automated data integration for fast insights.","archived":false,"fork":false,"pushed_at":"2024-11-16T11:37:38.000Z","size":22,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-30T05:41:12.156Z","etag":null,"topics":["bigquery","cloud-computing","cloud-infrastructure","data-integration","data-streaming","dataflow","google-cloud","infrastructure-as-code","kafka","python","realtime-analytics","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Pirate-Emperor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-16T11:34:38.000Z","updated_at":"2024-11-16T11:54:08.000Z","dependencies_parsed_at":"2024-11-16T12:27:49.477Z","dependency_job_id":"6a812d17-a1c4-4772-8677-a579e6b25d1b","html_url":"https://github.com/Pirate-Emperor/K2BQ","commit_stats":null,"previous_names":["pirate-emperor/k2bq"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FK2BQ","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FK2BQ/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FK2BQ/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FK2BQ/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Pirate-Emperor","download_url":"https://codeload.github.com/Pirate-Emperor/K2BQ/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250543743,"owners_count":21447959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","cloud-computing","cloud-infrastructure","data-integration","data-streaming","dataflow","google-cloud","infrastructure-as-code","kafka","python","realtime-analytics","terraform"],"created_at":"2024-11-17T06:11:50.212Z","updated_at":"2026-01-28T14:33:09.976Z","avatar_url":"https://github.com/Pirate-Emperor.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# K2BQ - Dataflow Kafka to BigQuery\n\n## Overview\n\n**K2BQ** is a dataflow pipeline designed to stream data from Kafka to Google BigQuery. The project provides a scalable and efficient solution for transferring large volumes of data from Kafka topics into BigQuery for real-time analytics. This project leverages Google Cloud’s managed Kafka service and Dataflow to enable seamless data integration, processing, and storage in BigQuery.\n\n## Features\n\n- **Kafka Integration**: Collect data from Kafka topics for real-time processing.\n- **BigQuery Integration**: Directly stream the processed data into BigQuery for fast analytics.\n- **Scalable Architecture**: Leverages Google Cloud's managed services for scaling based on demand.\n- **Automated Data Pipeline**: Automates the data transfer process from Kafka to BigQuery with minimal intervention.\n\n## Components\n\nThe pipeline consists of the following components:\n1. **Google Cloud Managed Kafka Cluster**: Used to store and stream data through Kafka topics.\n2. **Google Dataflow**: Used to process data from Kafka and load it into BigQuery.\n3. **BigQuery**: The final destination where the processed data is stored for analysis.\n\n## Installation\n\n### Prerequisites\n\n- Google Cloud Platform (GCP) account.\n- Terraform installed on your local machine.\n- Google Cloud SDK configured with the necessary permissions.\n- Access to Google BigQuery and Kafka services.\n\n### Setting Up the Project\n\n1. **Clone the repository:**\n\n   ```bash\n   git clone https://github.com/Pirate-Emperor/K2BQ.git\n   cd K2BQ\n   ```\n\n2. **Configure Terraform:**\n\n   Ensure that you have your `main.tf`, `variable.tf`, and any other required Terraform files set up as mentioned in the following sections.\n\n3. **Install required Terraform providers:**\n\n   In the root directory, run the following command to initialize Terraform providers:\n\n   ```bash\n   terraform init\n   ```\n\n4. **Apply the Terraform configuration:**\n\n   Run the following command to create the resources in your Google Cloud project:\n\n   ```bash\n   terraform apply\n   ```\n\n   This command will create the Kafka cluster, Kafka topics, and Dataflow pipeline. Review the changes and confirm the application when prompted.\n\n## Terraform Configuration Files\n\n### `main.tf`\n\nThis Terraform configuration sets up the Google Cloud resources required for the Kafka-to-BigQuery data pipeline.\n\n```hcl\nprovider \"google-beta\" {\n  project = data.google_project.project.project_id\n  region  = \"us-central1\"\n}\n\ndata \"google_project\" \"project\" {}\n\nmodule \"kafka_cluster\" {\n  source = \"./module/df_resource\"  # Update with the actual path to your module\n  cluster_id         = \"dataops-kafka\"\n  region             = \"us-central1\"\n  vcpu_count         = 4\n  memory_bytes       = 4294967296\n  subnet             = \"projects/valid-verbena-437709-h5/regions/us-central1/subnetworks/default\"\n  topic_id           = \"dataops-kafka-topic\"\n  partition_count    = 3\n  replication_factor = 3\n  cleanup_policy     = \"compact\"\n}\n\nresource \"google_managed_kafka_cluster\" \"cluster\" {\n  cluster_id = var.cluster_id\n  location   = var.region\n\n  capacity_config {\n    vcpu_count    = var.vcpu_count\n    memory_bytes  = var.memory_bytes\n  }\n\n  gcp_config {\n    access_config {\n      network_configs {\n        subnet = var.subnet\n      }\n    }\n  }\n}\n\nresource \"google_managed_kafka_topic\" \"example\" {\n  topic_id          = var.topic_id\n  cluster           = google_managed_kafka_cluster.cluster.cluster_id\n  location          = var.region\n  partition_count   = var.partition_count\n  replication_factor = var.replication_factor\n  configs = {\n    \"cleanup.policy\" = var.cleanup_policy\n  }\n}\n```\n\n### `variable.tf`\n\nThis file defines the necessary variables for the Kafka cluster configuration.\n\n```hcl\nvariable \"cluster_id\" {\n  description = \"The ID of the Kafka cluster.\"\n  type        = string\n  default     = \"my-cluster\"\n}\n\nvariable \"region\" {\n  description = \"The region to deploy the Kafka cluster.\"\n  type        = string\n  default     = \"us-central1\"\n}\n\nvariable \"vcpu_count\" {\n  description = \"Number of vCPUs for Kafka cluster capacity.\"\n  type        = number\n  default     = 3\n}\n\nvariable \"memory_bytes\" {\n  description = \"Memory in bytes for Kafka cluster capacity.\"\n  type        = number\n  default     = 3221225472\n}\n\nvariable \"subnet\" {\n  description = \"Subnetwork to attach the Kafka cluster to.\"\n  type        = string\n  default     = \"projects/valid-verbena-437709-h5/regions/us-central1/subnetworks/default\"\n}\n\nvariable \"topic_id\" {\n  description = \"The ID of the Kafka topic.\"\n  type        = string\n  default     = \"example-topic\"\n}\n\nvariable \"partition_count\" {\n  description = \"Number of partitions for the Kafka topic.\"\n  type        = number\n  default     = 2\n}\n\nvariable \"replication_factor\" {\n  description = \"Replication factor for the Kafka topic.\"\n  type        = number\n  default     = 3\n}\n\nvariable \"cleanup_policy\" {\n  description = \"Cleanup policy for the Kafka topic.\"\n  type        = string\n  default     = \"compact\"\n}\n```\n\n## Dataflow Pipeline\n\nThe Dataflow pipeline reads data from the Kafka topic and writes it to BigQuery. This setup allows for continuous streaming of data for analysis and reporting.\n\n### Key steps:\n1. **Reading Data**: Data is consumed from the Kafka topic.\n2. **Processing Data**: Data is processed and transformed as per the requirements (e.g., filtering, aggregation).\n3. **Writing to BigQuery**: After processing, the data is written to BigQuery tables for querying and analysis.\n\n## Conclusion\n\nThe **K2BQ** project simplifies the integration between Kafka and BigQuery, offering a robust solution for real-time data processing. With Terraform managing the infrastructure setup and Dataflow handling the pipeline, this solution provides scalability, performance, and ease of maintenance for your data streaming needs.\n\n## License\n\nThis project is licensed under the Pirate-Emperor License. See the [LICENSE](LICENSE) file for details.\n\n## Author\n\n**Pirate-Emperor**\n\n[![Twitter](https://skillicons.dev/icons?i=twitter)](https://twitter.com/PirateKingRahul)\n[![Discord](https://skillicons.dev/icons?i=discord)](https://discord.com/users/1200728704981143634)\n[![LinkedIn](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/in/piratekingrahul)\n\n[![Reddit](https://img.shields.io/badge/Reddit-FF5700?style=for-the-badge\u0026logo=reddit\u0026logoColor=white)](https://www.reddit.com/u/PirateKingRahul)\n[![Medium](https://img.shields.io/badge/Medium-42404E?style=for-the-badge\u0026logo=medium\u0026logoColor=white)](https://medium.com/@piratekingrahul)\n\n- GitHub: [Pirate-Emperor](https://github.com/Pirate-Emperor)\n- Reddit: [PirateKingRahul](https://www.reddit.com/u/PirateKingRahul/)\n- Twitter: [PirateKingRahul](https://twitter.com/PirateKingRahul)\n- Discord: [PirateKingRahul](https://discord.com/users/1200728704981143634)\n- LinkedIn: [PirateKingRahul](https://www.linkedin.com/in/piratekingrahul)\n- Skype: [Join Skype](https://join.skype.com/invite/yfjOJG3wv9Ki)\n- Medium: [PirateKingRahul](https://medium.com/@piratekingrahul)\n\n---","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpirate-emperor%2Fk2bq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpirate-emperor%2Fk2bq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpirate-emperor%2Fk2bq/lists"}