{"id":20629684,"url":"https://github.com/agile-lab-dev/witboost-cdp-impala-specific-provisioner","last_synced_at":"2025-10-24T12:14:25.366Z","repository":{"id":222460116,"uuid":"753569949","full_name":"agile-lab-dev/witboost-cdp-impala-specific-provisioner","owner":"agile-lab-dev","description":null,"archived":false,"fork":false,"pushed_at":"2024-11-07T09:32:37.000Z","size":23780,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-01-17T07:05:53.051Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/agile-lab-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-06T11:38:58.000Z","updated_at":"2024-11-07T09:32:40.000Z","dependencies_parsed_at":"2024-06-12T20:02:06.297Z","dependency_job_id":"f33838b8-1012-4c03-ac61-8f162ff0846b","html_url":"https://github.com/agile-lab-dev/witboost-cdp-impala-specific-provisioner","commit_stats":null,"previous_names":["agile-lab-dev/witboost-cdp-impala-specific-provisioner"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fwitboost-cdp-impala-specific-provisioner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fwitboost-cdp-impala-specific-provisioner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fwitboost-cdp-impala-specific-provisioner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fwitboost-cdp-impala-specific-provisioner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/agile-lab-dev","download_url":"https://codeload.github.com/agile-lab-dev/witboost-cdp-impala-specific-provisioner/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242583021,"owners_count":20153368,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T14:05:40.427Z","updated_at":"2025-10-24T12:14:25.299Z","avatar_url":"https://github.com/agile-lab-dev.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cbr/\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://www.witboost.com/\"\u003e\n        \u003cimg src=\"docs/img/witboost_logo.svg\" alt=\"witboost\" width=600 \u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003cbr/\u003e\n\nDesigned by [Agile Lab](https://www.agilelab.it/), Witboost is a versatile platform that addresses a wide range of sophisticated data engineering challenges. It enables businesses to discover, enhance, and productize their data, fostering the creation of automated data platforms that adhere to the highest standards of data governance. Want to know more about Witboost? Check it out [here](https://www.witboost.com/) or [contact us!](https://witboost.com/contact-us)\n\nThis repository is part of our [Starter Kit](https://github.com/agile-lab-dev/witboost-starter-kit) meant to showcase Witboost's integration capabilities and provide a \"batteries-included\" product.\n\n# CDP Impala Specific Provisioner\n\n[![pipeline status](https://gitlab.com/AgileFactory/Witboost.Mesh/Provisioning/cdp-refresh/witboost.Mesh.Provisioning.OutputPort.CDP.Impala/badges/master/pipeline.svg)](https://gitlab.com/AgileFactory/Witboost.Mesh/Provisioning/cdp-refresh/witboost.Mesh.Provisioning.OutputPort.CDP.Impala/-/commits/master)  \n[![coverage report](https://gitlab.com/AgileFactory/Witboost.Mesh/Provisioning/cdp-refresh/witboost.Mesh.Provisioning.OutputPort.CDP.Impala/badges/master/coverage.svg?min_medium=60)](https://gitlab.com/AgileFactory/Witboost.Mesh/Provisioning/cdp-refresh/witboost.Mesh.Provisioning.OutputPort.CDP.Impala/-/commits/master)\n\n- [Overview](#overview)\n- [Building](#building)\n- [Running](#running)\n- [Configuring](#configuring)\n- [Deploying](#deploying)\n- [How it works](#how-it-works)\n- [HLD](docs/HLD.md)\n- [API specification](docs/API.md)\n\n## Overview\n\nThis project implements a Specific Provisioner deploying Output Ports and Storage Areas* (as External Tables or Views) on Apache Impala hosted on a Cloudera Data Platform environment. It supports both CDP Public Cloud with Cloudera Data Warehouse (CDW) using Impala and Amazon Web Services (AWS) S3 storage, and CDP Private Cloud using Impala and HDFS. After deploying this microservice and configuring witboost to use it, the platform can create Output Ports and Storage Areas* on existing csv or Parquet tables leveraging an existing Impala instance.\n\n\u003e As of now, this provisioner can only deploy View Output Ports and Storage Areas on CDP Private Cloud environments.\n\nSpecifically, this provisioner can create:\n- CDP Public:\n  - Output Ports as External table allowing to define a schema, a HDFS/S3 location, format of the data files, and extra TBLPROPERTIES.\n- CDP Private:\n  - All components deployable on CDP Public mentioned above\n  - Storage Areas as External table allowing to define a schema, a HDFS/S3 location, format of the data files, and extra TBLPROPERTIES.\n  - Storage Areas as views, defined by a custom SQL statement provided by the user\n  - Output Ports as simple 1:1 views from a source table, defining a schema as the set of columns to be queried. \n\n### What's a Specific Provisioner?\n\nA Specific Provisioner is a microservice which is in charge of deploying components that use a specific technology. When the deployment of a Data Product is triggered, the platform generates it descriptor and orchestrates the deployment of every component contained in the Data Product. For every such component the platform knows which Specific Provisioner is responsible for its deployment, and can thus send a provisioning request with the descriptor to it so that the Specific Provisioner can perform whatever operation is required to fulfill this request and report back the outcome to the platform.\n\nYou can learn more about how the Specific Provisioners fit in the broader picture [here](https://docs.witboost.agilelab.it/docs/p2_arch/p1_intro/#deploy-flow).\n\n### Software stack\n\nThis microservice is written in Scala 2.13, using HTTP4s and Guardrail for the HTTP layer. Project is built with SBT and supports packaging as JAR, fat-JAR and Docker image, ideal for Kubernetes deployments (which is the preferred option).\n\nThis is a multi module sbt project:\n\n* **api**: Contains the API layer of the service. The latter can be invoked synchronously in 3 different ways:\n    1. POST /provision: provision the impala output port/storage area specified in the payload request. It will synchronously call the `service` logic to perform the provisioning logic.\n    2. POST /validate: validate the payload request and return a validation result. It should be invoked before provisioning a resource in order to understand if the request is correct.\n    3. POST /updateacl: Updates the access to users to the provisioned resources, only for output ports. \n* **core**: Contains model case classes and shared logic among the projects\n* **service**: Contains the Provisioner Service logic. Is called from the API layer after some check on the request and return the deployed resource. This is the module on which we provision the output port/storage area\n\nIn this project we are using the following sbt plugins: \n1. **scalaformat**: To keep the scala style aligned with all collaborators\n2. **wartRemover**: To keep the code as functional as possible\n3. **scoverage**: To create a test coverage report\n4. **k8tyGitlabPlugin**: To publish the packages to Gitlab Package Registry\n\n### Artifacts\n\nWe produce two different artifacts on the CI/CD for this repository\n1. The scoverage report that you could download from the CI/CD and check the test coverage\n2. A docker image published in the Gitlab Container Registry\n3. A set of jars, one for each module published in the Maven Gitlab Package Registry\n\n\n## Building\n\n**Requirements:**\n\n- Java \u003e=11\n- sbt\n\nThis project also depends on Witboost library [scala-mesh-commons](https://github.com/agile-lab-dev/witboost-scala-mesh-commons), published Open-Source on Maven Central.\n\n**Generating sources:** this project uses OpenAPI as standard API specification and the [sbt-guardrail](https://github.com/guardrail-dev/sbt-guardrail) plugin to generate server code from the [specification](./api/src/main/openapi/interface-specification.yml).\n\nThe code generation is done automatically in the compile phase:\n\n```bash\nsbt compile\n```\n\n### Test\n\n**Tests:** are handled by the standard task as well:\n\n```bash\nsbt test\n```\n\n### CI/CD\n\nOnce you commit and push the CI/CD will be triggered, test and build phase are executed at each push. The CI/CD will use the job token to push the dependency libraries\nDev Deploy are executed only for master branch\nProd Deploy are executed only for release branch\nYou could double-check the artifacts that will be deployed downloading from the CI/CD artifacts.zip that was cached during the test/build stages\n\n### How to collaborate\n\nWe recommend using IntelliJ IDEA Community Edition for developing this project.\nYou are free to use your favorite IDE. Please remember to add on the .gitignore the IDE specific files.\n\nIf you fork this repository, please modify the [project settings](./project/Settings.scala) with the appropriate gitlab project id to avoid trying pushing artifacts to the wrong repository.\n\n\n### Scala style\n\nLeverage the scalaformat library to reformat the code while editing. This will apply the scala format specification written on the `.scalafmt.conf` and avoids fake changes on merge request.\n\nWe added additional compilation rules using the wartRemover library, so if any exceptions are raised during compile time please fix them.\n\n## Running\n\nTo run the server, you need to set up the necessary environment variables to access CDP and the AWS environment. This Specific Provisioner uses the followings SDK:\n\n- **CDP SDK**: please refer to the [official documentation](https://docs.cloudera.com/cdp-public-cloud/cloud/sdk/topics/mc-overview-of-the-cdp-sdk-for-java.html) to setup the access credentials (only required for CDP Public Cloud).\n- **AWS SDK**: please refer to the [official documentation](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/setup-basics.html) to setup the access credentials (only required for CDP Public Cloud).\n\nFor example, for local execution you need to set the following environment variables:\n\n```\n# AWS configuration is only required for CDP Public Cloud\nexport AWS_REGION=\u003caws_region\u003e\nexport AWS_ACCESS_KEY_ID=\u003caws_access_key_id\u003e\nexport AWS_SECRET_ACCESS_KEY=\u003caws_secret_access_key\u003e\nexport AWS_SESSION_TOKEN=\u003caws_session_token\u003e\n\nexport CDP_DEPLOY_ROLE_USER=\u003ccdp_user\u003e\nexport CDP_DEPLOY_ROLE_PASSWORD=\u003ccdp_password\u003e\nexport CDP_ACCESS_KEY_ID=\u003ccdp_user_access_key_id\u003e # Only required for CDP Public Cloud\nexport CDP_PRIVATE_KEY=\u003ccdp_user_private_key\u003e # Only required for CDP Public Cloud\n```\n\nThis provisioner uses two sets of credentials to perform operations on Apache Ranger and Apache Impala. The default configuration sets them both equal to the environment variables `CDP_DEPLOY_ROLE_USER` and `CDP_DEPLOY_ROLE_PASSWORD`, so that only one user is initially necessary, but the Ranger credentials can be overridden via configuration if they need to be different (see [Configuring](#configuring)).\n\nThe used CDP users must be `Machine User` and need to check some requirements depending on the type of CDP Cloud.\n\n### CDP Public Cloud\n\nOn **CDP Public** it needs to have at least the following roles:\n- Impala:\n  - DWAdmin\n  - DWUser\n- Ranger:\n  - EnvironmentAdmin\n  - EnvironmentUser\n  \nAlternatively to `EnvironmentAdmin` role, the Machine User for Ranger must have the necessary permissions to manage Ranger, specifically creating/updating/retrieving/deleting Security Zones, Roles, Resource based Policies and the resources related to them. If the same user is used for both services, it must have the four roles and/or permissions.\n\n### CDP Private Cloud\n\nOn **CDP Private**, the deploy user needs to have admin privileges on Ranger, as well as have the following permissions (e.g. through Ranger policies):\n- `read`, `write`, `execute` permissions on HDFS directory to be used\n- `all` permissions on Impala databases and tables to be used\n  \nHowever, if Impala is authenticated using Kerberos as it is in most cases, the only set of credentials needed will be used to access Ranger, whereas for Impala a valid keytab with a principal with service name `impala` will be necessary, accompanied by the necessary kerberos configuration files (see [Configuring](#configuring)). \n\nAfter this, execute:\n\n```bash\nsbt compile run\n```\n\nBy default, the server binds to port 8093 on localhost. After it's up and running you can make provisioning requests to this address.\n\n## Configuring\n\nMost application configurations are handled with the Typesafe Config library. You can find the default settings in the `reference.conf` of each module. Customize them and use the `config.file` system property or the other options provided by Typesafe Config according to your needs. The provided docker image expects the config file mounted at path `/config/application.conf`.\n\nEspecially for CDP Private Cloud, a set of required configuration fields must be modified, like Ranger and HDFS base URLs. \n\nFurthermore, you can specify via configuration a set of values called \"publicInfo\" and \"privateInfo\" to return valuable information to the user about the deployed resources by the provisioner at provision time.\n\nFor more information on the configuration and to understand how to set up the provisioner for a specific type of CDP Cloud, see [Configuring the Impala Specific Provisioner](docs/Configuration.md).\n\n### Helm chart configuration\n\n#### CDP Public v.s. CDP Private\n\nThe chart provides a couple of configurations to setup the provisioner to work on either CDP Public Cloud or CDP Private Cloud. `private.enabled` would set the necessary environment variables that the provisioner needs in order to work (see [Running](#running)). By setting it to `true` it will remove the Access Key and Private Key used by the Cloudera SDK to contact the public cloud.\n\nThe second configuration `kerberos.enabled` would set the necessary system properties needed for the provisioner to authenticate on a Kerberos system to services like Impala. For this, the provisioner expects a `jaas.conf` file and `krb5.conf`. For more information about these files see [Configuring the Impala Specific Provisioner](./docs/Configuration.md#jdbc-configuration). You can provide override values for these files using the `kerberos.krb5Override` and `kerberos.jaasOverride` fields.\n\n#### Custom Root CA\n\nThe chart provides the option `customCA.enabled` to add a custom Root Certification Authority to the JVM truststore. If this option is enabled, the chart will load the custom CA from a secret with key `cdp-private-impala-custom-ca`. The CA is expected to be in a format compatible with `keytool` utility (PEM works fine).\n\n## Deploying\n\nThis microservice is meant to be deployed to a Kubernetes cluster.\n\n## How it works\n\n1. Parse the request body\n2. Retrieve impala coordinator host and ranger host from either the CDP environment (CDP Public), or the provisioner configuration (CDP Private).\n3. Create the impala resource (table or view)\n4. Upsert the ranger security zone for the specific data product version\n5. Upsert ranger roles for owners of the component; and for Output Ports a role for users as well.\n6. Upsert access policies for said roles, granting read/write access to the owner role, and read-only to the user role\n7. Return the deployed resource\n\n## Descriptor Input\n\nThe Impala Specific Provisioner receives a yaml-descriptor containing a data contract schema and a specific field with the information of the table or view to be deployed. It allows defining\n\n- Data contract schema. OpenMetadata Column schema defining the schema of the table or view to be created\n- Database name: Database to be created to handle the component tables\n- Table name: Table name to be created, or when provisioning a view, the name of the table exposed by the view\n- View name: Sent when provisioning a view to define its name\n- Format: Format of the data files an external table exposes. Only required for table creation\n- Location: Location in S3 (CDP Public) or HDFS (CDP Private) where the data files are located\n- Partitions: List of columns used to partition the data\n- Table parameters: Extra table parameters to define TBLPROPERTIES, text file delimiter and header, etc.\n- Custom DML Statement: Storage Areas as views can be created by using a query provided by the user.\n\nFor the specification of schema of this object, check out [Descriptor Input](docs/DescriptorInput.md)\n\n## License\n\nThis project is available under the [Apache License, Version 2.0](https://opensource.org/licenses/Apache-2.0); see [LICENSE](LICENSE) for full details.\n\n## About Witboost\n\n[Witboost](https://witboost.com/) is a cutting-edge Data Experience platform, that streamlines complex data projects across various platforms, enabling seamless data production and consumption. This unified approach empowers you to fully utilize your data without platform-specific hurdles, fostering smoother collaboration across teams.\n\nIt seamlessly blends business-relevant information, data governance processes, and IT delivery, ensuring technically sound data projects aligned with strategic objectives. Witboost facilitates data-driven decision-making while maintaining data security, ethics, and regulatory compliance.\n\nMoreover, Witboost maximizes data potential through automation, freeing resources for strategic initiatives. Apply your data for growth, innovation and competitive advantage.\n\n[Contact us](https://witboost.com/contact-us) or follow us on:\n\n- [LinkedIn](https://www.linkedin.com/showcase/witboost/)\n- [YouTube](https://www.youtube.com/@witboost-platform)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagile-lab-dev%2Fwitboost-cdp-impala-specific-provisioner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagile-lab-dev%2Fwitboost-cdp-impala-specific-provisioner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagile-lab-dev%2Fwitboost-cdp-impala-specific-provisioner/lists"}