{"id":19537081,"url":"https://github.com/badal-io/kylin-gcp","last_synced_at":"2025-04-26T14:37:16.211Z","repository":{"id":115366445,"uuid":"169698841","full_name":"badal-io/kylin-gcp","owner":"badal-io","description":"A POC showing how Apache Kylin can be integrated with Google Cloud Platform","archived":true,"fork":false,"pushed_at":"2019-02-12T23:36:46.000Z","size":66,"stargazers_count":5,"open_issues_count":0,"forks_count":3,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-03-14T12:15:18.696Z","etag":null,"topics":["apache","apache-kylin","bigtable","dataflow","google","google-cloud-platform","hbase"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/badal-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-08T07:25:35.000Z","updated_at":"2023-12-22T00:45:11.000Z","dependencies_parsed_at":null,"dependency_job_id":"8dd12d67-b4cd-4d77-8667-b7178a42e3d0","html_url":"https://github.com/badal-io/kylin-gcp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/badal-io%2Fkylin-gcp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/badal-io%2Fkylin-gcp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/badal-io%2Fkylin-gcp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/badal-io%2Fkylin-gcp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/badal-io","download_url":"https://codeload.github.com/badal-io/kylin-gcp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251002035,"owners_count":21521089,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","apache-kylin","bigtable","dataflow","google","google-cloud-platform","hbase"],"created_at":"2024-11-11T02:26:08.586Z","updated_at":"2025-04-26T14:37:16.178Z","avatar_url":"https://github.com/badal-io.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Kylin on GCP\nThis project demonstrates the use of [Apache Kylin](http://kylin.apache.org/) on\n[GCP](https://cloud.google.com/gcp/) backed by [Dataproc](https://cloud.google.com/dataproc/).\n\n## Prerequisites\nTo get started you will need a GCP project, to create one see [here](https://cloud.google.com/resource-manager/docs/creating-managing-projects).\nThis demo uses [Terraform](https://learn.hashicorp.com/terraform/getting-started/install.html)\nalong with [GCP provider](https://www.terraform.io/docs/providers/google/index.html)\ntool which can be installed on your machine directly or via docker accessed via [docker](https://hub.docker.com/r/hashicorp/terraform/tags). The demo can also\nbe run from [cloud shell](https://cloud.google.com/shell/).\n\n## Demo\n\n### Create Hadoop/Kylin cluster\nThe Terraform components assume that a GCP project exists from which to deploy\nthe Kylin cluster. It also assumes the existence of GCP bucket in order to store\nterraform state, also known as [backends](https://www.terraform.io/docs/backends/index.html)\nin Terraform. If you do not wish to use the included [GCS backend configuration](https://www.terraform.io/docs/backends/types/gcs.html)\nsimply comment-out the following in the `gcp.tf` file (note: backend configuration\ncannot be populated by TF-variables as of v0.11.11):\n```\n# comment this out to store Terraform state locally\n# otherwise, enter the name of the ops-bucket in this configuration\nterraform {\n  backend \"gcs\" {\n    bucket  = \"\u003cOPS_BUCKET_NAME\u003e\"\n    prefix  = \"terraform/state/kylin\"\n  }\n}\n```\nThe `kylin.tf` file contains the deployment configuration for the Kylin cluster.\nTo deploy it, first provide deployment info via environment variables (note: you\nmay elect to use a different method of providing TF-variables, see [here](https://www.terraform.io/docs/configuration/variables.html)):\n```\n# ID of existing project to deploy too\nexport GOOGLE_PROJECT=\"\u003cYOUR_PROJECT\u003e\"\n# name of resource bucket (created by deployment)\nexport TF_VAR_resource_bucket=\"${GOOGLE_PROJECT}\"\n# master node for shell access to dataproc cluster\nexport MASTER='kylin-m-2'\n```\nNow that the deployments have the parameters, verify the deployment plan using:\n```\nterraform init # only required once\nterraform plan\n```\nand deploy using:\n```\nterraform apply\n# use \"-auto-approve\" to skip prompt\n```\nIf successful, the deployment will deploy a resource bucket and upload the\nnecessary `init-scripts`, then create the Kylin cluster.\n\nOnce the cluster is up and running, start Kylin on one of the nodes; note a\nmaster node is being used in this example, however any node can be seleted:\n```sh\ngcloud compute ssh ${MASTER} --command=\"source /etc/profile \u0026\u0026 kylin.sh start\"\n```\nThis will start the Kylin service. With Kylin installed and running yo can now\ntunnel to the master node to bring up the kylin UI:\n```sh\ngcloud compute ssh ${MASTER} -- -L 7070:${MASTER}:7070\n# then open browser to http://localhost:7070/kylin\n# default creds ADMIN/KYLIN\n```\n\n### Writting data to cluster\nKylin gets its data from structured sources such as [Hive](https://hive.apache.org/)\nand other JDBC/ODBC compliant sources. Hive in turn relies on the Hadoop\nfilesystem [HDFS](https://hadoop.apache.org/) to store the unerlying data.\nIn standard (on-prem) Hadoop deployments, HDFS would be deployed accross the\ndisks attached to the workers of the Hadoop cluster. In GCP Dataproc however,\nthe HDFS filesystem can (and is by default) backed by the GCS object storage\nservice using the [Cloud Storage Connector](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage).\nThis provides the decoupling of storage and compute resources, providing the\ncost-benefits of both cheaper/scalable storage in comparison to persistant disks,\nand allows cluster nodes (or the at least the workers) to be ephemiral because\nthey are only used for computation and not for persistent storage.\n\nBuilding on the former, Hive can be used by creating table/schema definitions\nand then pointing them to \"directories\" in HDFS (GCS). Once a table is created,\ndata can be written to GCS from any source using the standard GCS APIs without\nthe need to use any hive/hadoop interfaces. To demonstrate this, create a table\nwith a schema pointing to a path in GCS:\n```sql\n# example creation of hive table\nhive -e \"CREATE EXTERNAL TABLE test\n(\nunique_key STRING,\ncomplaint_type STRING,\ncomplaint_description STRING,\nowning_department STRING,\nsource STRING,\nstatus STRING,\nstatus_change_date TIMESTAMP,\ncreated_date TIMESTAMP,\nlast_update_date TIMESTAMP,\nclose_date TIMESTAMP,\nincident_address STRING,\nstreet_number STRING,\nstreet_name STRING,\ncity STRING,\nincident_zip INTEGER,\ncounty STRING,\nstate_plane_x_coordinate STRING,\nstate_plane_y_coordinate FLOAT,\nlatitude FLOAT,\nlongitude FLOAT,\nlocation STRING,\ncouncil_district_code INTEGER,\nmap_page STRING,\nmap_tile STRING\n)\nrow format delimited\nfields terminated by ','\nLOCATION 'gs://${BUCKET}/data/311/csv/' ;\"\n\n```\nthe above can be run on one of the dataproc nodes (`gcloud compute ssh ${MASTER}`)\nor using `gcloud dataproc jobs submit ...`. Note that we are using CSV format\nfor this demo, however, there are many many options for serialization with\nHDFS/Hive. for more info see [SerDe](https://cwiki.apache.org/confluence/display/Hive/SerDe).\n\nTo populate the table location path with data, run the following dataflow\njob to which will extract public 311-service-request data from the city of\nAustin, TX which is stored in BigQuery:\n```sh\ncd dataflow\n./gradlew ronLocal # run Apache Beam job locally\n# OR\n./gradlew runFlow # submit job to run on GCP Dataflow\n```\nNote that this Beam job is simply extracting BigQuery data via query,\nconverting it to csv form, and writting it to GCS. Writting data to hive can\nalso be achieved in Beam using the [HCatalog IO](https://beam.apache.org/documentation/io/built-in/hcatalog/), however, is\nnot recommeded if the primary query engine is Kylin and not Hive.\n\nAs data is added to the path, new queries will pick up this new data:\n```sh\nhive -e \"select count(unique_key) from test;\"\n```\n\nWith tables created and populated, return to the Kylin web interface; if the\nsession has terminated restart it with the tunnel:\n```sh\ngcloud compute ssh ${MASTER} -- -L 7070:${MASTER}:7070\n# open browser to http://localhost:7070/kylin\n# default creds ADMIN/KYLIN\n```\n\nFrom here, the table can be loaded in Kylin:\n - `Model` tab -\u003e `Data Sources sub-tab`\n - `Load Table` (blue button)\n - enter table name \"test\"\n - Under `Tables` the hive database `DEFAULT` should be displayed with the `test` just loaded\n\nOnce tables are loaded you can continue to create models/cubes within the kylin\ninterface, for more see the [Kylin Docs](http://kylin.apache.org/docs/tutorial/create_cube.html).\n\n## Cleanup\nDeleting the Kylin cluster and associated resources:\n```\nterraform destroy #(yes at prompt)\n```\nOR delete the project entirely to ensure no other resources are incurring costs.\n\n## Future Options:\n- [ ] Secure cluster and access\n- [ ] configure Kylin init-action for HA-mode (with load-balancing)\n- [ ] Autoscaling setup\n- [ ] BigTable (HBase) cube storage substitution\n- [ ] persistant disk HDFS substitution\n- [ ] Spark substitution\n- [ ] Streaming cube (Kafka) sample\n- [ ] JDBC source sample\n- [ ] hadoop resource optimization (disk, cpu, preemptible workers, etc)\n- [ ] cube creation sample (real-world data)\n- [ ] [hive-metastore Cloud-SQL substition](https://cloud.google.com/solutions/using-apache-hive-on-cloud-dataproc)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbadal-io%2Fkylin-gcp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbadal-io%2Fkylin-gcp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbadal-io%2Fkylin-gcp/lists"}