{"id":13783346,"url":"https://github.com/gojekfarm/beast","last_synced_at":"2025-10-04T19:31:30.690Z","repository":{"id":48483233,"uuid":"170804978","full_name":"gojekfarm/beast","owner":"gojekfarm","description":"[Deprecated] Load data from Kafka to any data warehouse. BQ sink is being supported in Firehose now. https://github.com/odpf/firehose","archived":true,"fork":false,"pushed_at":"2022-02-11T17:46:32.000Z","size":643,"stargazers_count":147,"open_issues_count":13,"forks_count":23,"subscribers_count":25,"default_branch":"master","last_synced_at":"2024-05-18T22:20:56.110Z","etag":null,"topics":["beast","bigquery","dataops","kafka","warehouse"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gojekfarm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-02-15T05:06:46.000Z","updated_at":"2023-07-12T08:12:05.000Z","dependencies_parsed_at":"2022-09-16T09:12:42.236Z","dependency_job_id":null,"html_url":"https://github.com/gojekfarm/beast","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojekfarm%2Fbeast","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojekfarm%2Fbeast/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojekfarm%2Fbeast/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gojekfarm%2Fbeast/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gojekfarm","download_url":"https://codeload.github.com/gojekfarm/beast/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234892696,"owners_count":18902907,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beast","bigquery","dataops","kafka","warehouse"],"created_at":"2024-08-03T19:00:19.361Z","updated_at":"2025-10-04T19:31:25.356Z","avatar_url":"https://github.com/gojekfarm.png","language":"Java","funding_links":[],"categories":["Development"],"sub_categories":["Connectors"],"readme":"\n# Beast is deprecated. Big Query sink is supported in [Firehose](https://github.com/odpf/firehose) now. \n\nBeast is not supported. We recommend using Firehose for sinking Kafka data to BigQuery. \n\n## Architecture\n\n* **Consumer**:\n    Consumes messages from kafka in batches, and pushes these batches to Read \u0026 Commit queues. These queues are blocking queues, i.e, no more messages will be consumed if the queue is full. (This is configurable based on poll timeout)\n* **BigQuery Worker**:\n    Polls messages from the read queue, and pushes them to BigQuery. If the push operation was successful, BQ worker sends an acknowledgement to the Committer.\n* **Committer**:\n    Committer receives the acknowledgements of successful push to BigQuery from BQ Workers. All these acknowledgements are stored in a set within the committer. Committer polls the commit queue for message batches. If that batch is present in the set, i.e., the batch has been successfully pushed to BQ, then it commits the max offset for that batch, back to Kafka, and pops it from the commit queue \u0026 set.\n\n\u003cbr\u003e\u003cdiv style=\"text-align:center;width: 90%; margin:auto;\"\u003e\u003cimg src=\"docs/images/architecture.png\" alt=\"\"\u003e\u003c/div\u003e\u003cbr\u003e\n\n* **Dead Letters**:\n    Beast provides a plugable GCS (Google Cloud Storage) component to store invalid out of bounds messages that are rejected by BigQuery. Primarily all messages that are partitioned on a timestamp field and those that contain out of ranges timestamps (year old data or 6 months in future) on the partition key are considered as invalid. Without an handler for these messages, Beast stops processing. The default behaviour is to stop processing on these out of range data. GCS component can be turned on by supplying an environment field as below.\n    ```\n    ENABLE_GCS_ERROR_SINK=true\n    GCS_BUCKET=\u003cgoogle cloud store bucket name\u003e\n    GCS_PATH_PREFIX=\u003cprefix path under the bucket\u003e\n    GCS_WRITER_PROJECT_NAME=\u003cgoogle project having bucket\u003e\n    ```\n    The handler partitions the invalid messages on GCS based on the message arrival date in the format `\u003cdt=yyyy-MM-dd\u003e`. The location of invalid messages on GCS would ideally be `\u003cGCS_WRITER_PROJECT_NAME\u003e/\u003cGCS_BUCKET\u003e/\u003cGCS_PATH_PREFIX\u003e/\u003cdt=yyyy-MM-dd\u003e/\u003ctopicName\u003e/\u003crandom-uuid\u003e` where\n    - `\u003ctopicName\u003e` - is the topic that has the invalid messages\n    - `\u003crandom-uuid\u003e` - name of the file\n\n## Building \u0026 Running\n\n### Prerequisite\n* A kafka cluster which has messages pushed in proto format, which beast can consume\n* should have BigQuery project which has streaming permission\n* create a table for the message proto\n* create configuration with column mapping for the above table and configure in env file\n* env file should be updated with bigquery, kafka, and application parameters\n\n## Run locally:\n```\ngit clone https://github.com/odpf/beast\nexport $(cat ./env/sample.properties | xargs -L1) \u0026\u0026 gradle clean runConsumer\n```\n\n## Run with Docker\nThe image is available in [odpf](https://hub.docker.com/r/odpf/beast) dockerhub.\n\n```\nexport TAG=release-0.1.1\ndocker run --env-file beast.env -v ./local_dir/project-secret.json:/var/bq-secret.json -it odpf/beast:$TAG\n```\n* `-v` mounts local secret file `project-sercret.json` to the docker mentioned location, and `GOOGLE_CREDENTIALS` should match the same `/var/bq-secret.json` which is used for BQ authentication.\n* `TAG`You could update the tag if you want the latest image, the mentioned tag is tested well.\n\n## Running on Kubernetes\n\nCreate a beast deployment for a topic in kafka, which needs to be pushed to BigQuery.\n* Deployment can have multiple instance of beast\n* A beast container consists of the following threads:\n  - A kafka consumer\n  - Multiple BQ workers\n  - A committer\n* Deployment also includes telegraf container which pushes stats metrics\nFollow the [instructions](https://github.com/gojektech/charts/tree/master/incubator/beast) in [chart](https://github.com/gojektech/charts) for helm deployment\n\n## BQ Setup:\nGiven a [TestMessage](./src/test/proto/TestMessage.proto) proto file, you can create bigtable with [schema](./docs/test_messages.schema.json)\n```\n# create new table from schema\nbq mk --table \u003cproject_name\u003e:dataset_name.test_messages ./docs/test_messages.schema.json\n\n# query total records\nbq query --nouse_legacy_sql 'SELECT count(*) FROM `\u003cproject_name\u003e:dataset_name.test_messages LIMIT 10'\n\n#  update bq schema from local schema json file\nbq update --format=prettyjson \u003cproject_name\u003e:dataset_name.test_messages  booking.schema\n\n# dump the schema of table to file\nbq show --schema --format=prettyjson \u003cproject_name\u003e:dataset_name.test_messages \u003e test_messages.schema.json\n```\n\n## Produce messages to Kafka\nYou can generate messages with TestMessage.proto with [sample-kafka-producer](https://github.com/gojekfarm/sample-kafka-producer), which pushes N messages\n\n## Running Stencil Server\n* run shell script `./run_descriptor_server.sh` to build descriptor in `build` directory, and python server on `:8000`\n* stencil url can be configured to `curl http://localhost:8000/messages.desc`\n\n\n# Contribution\n\n* You could raise issues or clarify the questions\n* You could raise a PR for any feature/issues\n* You could help us with documentation\n\nTo run and test locally:\n```\ngit clone https://github.com/odpf/beast\nexport $(cat ./env/sample.properties | xargs -L1) \u0026\u0026 gradlew test\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgojekfarm%2Fbeast","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgojekfarm%2Fbeast","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgojekfarm%2Fbeast/lists"}