{"id":26252807,"url":"https://github.com/ricardolsmendes/gcp-datacatalog-python","last_synced_at":"2025-04-24T06:06:10.762Z","repository":{"id":45018149,"uuid":"192745241","full_name":"ricardolsmendes/gcp-datacatalog-python","owner":"ricardolsmendes","description":"Python samples to help Data Citizens who work with Google Cloud Data Catalog","archived":false,"fork":false,"pushed_at":"2022-03-27T04:22:26.000Z","size":108,"stargazers_count":10,"open_issues_count":0,"forks_count":7,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-24T06:06:04.221Z","etag":null,"topics":["bigdata","csv-import","data-governance","datacatalog","gcp","gcp-datacatalog","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ricardolsmendes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-06-19T14:09:56.000Z","updated_at":"2023-12-23T01:00:40.000Z","dependencies_parsed_at":"2022-09-24T21:31:36.547Z","dependency_job_id":null,"html_url":"https://github.com/ricardolsmendes/gcp-datacatalog-python","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ricardolsmendes%2Fgcp-datacatalog-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ricardolsmendes%2Fgcp-datacatalog-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ricardolsmendes%2Fgcp-datacatalog-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ricardolsmendes%2Fgcp-datacatalog-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ricardolsmendes","download_url":"https://codeload.github.com/ricardolsmendes/gcp-datacatalog-python/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250573354,"owners_count":21452350,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","csv-import","data-governance","datacatalog","gcp","gcp-datacatalog","python"],"created_at":"2025-03-13T17:28:18.433Z","updated_at":"2025-04-24T06:06:10.745Z","avatar_url":"https://github.com/ricardolsmendes.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gcp-datacatalog-python\n\nSelf-contained ready-to-use Python scripts to help Data Citizens who work with\n[Google Cloud Data Catalog][1].\n\n[![license](https://img.shields.io/github/license/ricardolsmendes/gcp-datacatalog-python.svg)](https://github.com/ricardolsmendes/gcp-datacatalog-python/blob/master/LICENSE)\n[![issues](https://img.shields.io/github/issues/ricardolsmendes/gcp-datacatalog-python.svg)](https://github.com/ricardolsmendes/gcp-datacatalog-python/issues)\n[![CircleCI][2]][3]\n\n\u003c!--\n  DO NOT UPDATE THE TABLE OF CONTENTS MANUALLY\n  run `npx markdown-toc -i README.md`.\n\n  Please stick to 100-character line wraps as much as you can.\n--\u003e\n\n## Table of Contents\n\n\u003c!-- toc --\u003e\n\n- [1. Get to know the concepts behind this code](#1-get-to-know-the-concepts-behind-this-code)\n- [2. Environment setup](#2-environment-setup)\n  * [2.1. Get the code](#21-get-the-code)\n  * [2.2. Auth credentials](#22-auth-credentials)\n  * [2.3. Virtualenv](#23-virtualenv)\n  * [2.4. Docker](#24-docker)\n  * [2.5. Integration tests](#25-integration-tests)\n- [3. Quickstart](#3-quickstart)\n  * [3.1. Integration tests](#31-integration-tests)\n  * [3.2. Run quickstart.py](#32-run-quickstartpy)\n- [4. Load Tag Templates from CSV files](#4-load-tag-templates-from-csv-files)\n  * [4.1. Provide CSV files representing the Template to be created](#41-provide-csv-files-representing-the-template-to-be-created)\n  * [4.2. Integration tests](#42-integration-tests)\n  * [4.3. Run load_template_csv.py](#43-run-load_template_csvpy)\n- [5. Load Tag Templates from Google Sheets](#5-load-tag-templates-from-google-sheets)\n  * [5.1. Enable the Google Sheets API in your GCP Project](#51-enable-the-google-sheets-api-in-your-gcp-project)\n  * [5.2. Provide Google Spreadsheets representing the Template to be created](#52-provide-google-spreadsheets-representing-the-template-to-be-created)\n  * [5.3. Integration tests](#53-integration-tests)\n  * [5.4. Run load_template_google_sheets.py](#54-run-load_template_google_sheetspy)\n- [6. How to contribute](#6-how-to-contribute)\n  * [6.1. Report issues](#61-report-issues)\n  * [6.2. Contribute code](#62-contribute-code)\n\n\u003c!-- tocstop --\u003e\n\n---\n\n## 1. Get to know the concepts behind this code\n\n- [Data Catalog hands-on guide: a mental model][4] @ Google Cloud Community / Medium\n\n- [Data Catalog hands-on guide: search, get \u0026 lookup with Python][5] @ Google Cloud Community /\n  Medium\n\n- [Data Catalog hands-on guide: templates \u0026 tags with Python][6] @ Google Cloud Community / Medium\n\n## 2. Environment setup\n\n### 2.1. Get the code\n\n```sh\ngit clone https://github.com/ricardolsmendes/gcp-datacatalog-python.git\ncd gcp-datacatalog-python\n```\n\n### 2.2. Auth credentials\n\n**2.2.1. Create a service account and grant it below roles**\n\n- BigQuery Admin\n- Data Catalog Admin\n\n**2.2.2. Download a JSON key and save it as**\n\n- `./credentials/datacatalog-samples.json`\n\n### 2.3. Virtualenv\n\nUsing _virtualenv_ is optional, but strongly recommended unless you use [Docker](#24-docker).\n\n**2.3.1. Install Python 3.6+**\n\n**2.3.2. Create and activate an isolated Python environment**\n\n```sh\npip install --upgrade virtualenv\npython3 -m virtualenv --python python3 env\nsource ./env/bin/activate\n```\n\n**2.3.3. Install the dependencies**\n\n```sh\npip install --upgrade -r requirements.txt\n```\n\n**2.3.4. Set environment variables**\n\n```sh\nexport GOOGLE_APPLICATION_CREDENTIALS=./credentials/datacatalog-samples.json\n```\n\n### 2.4. Docker\n\nDocker may be used to run all the scripts. In this case please disregard the\n[Set up Virtualenv](#23-virtualenv) install instructions.\n\n### 2.5. Integration tests\n\nIntegration tests help to make sure Google Cloud APIs and Service Accounts IAM Roles have been\nproperly set up before running a script. They actually communicate with the APIs and create\ntemporary resources that are deleted just after being used.\n\n## 3. Quickstart\n\n### 3.1. Integration tests\n\n- pytest\n\n```sh\nexport GOOGLE_CLOUD_TEST_ORGANIZATION_ID=\u003cYOUR-ORGANIZATION-ID\u003e\nexport GOOGLE_CLOUD_TEST_PROJECT_ID=\u003cYOUR-PROJECT-ID\u003e\n\npytest ./tests/integration/quickstart_test.py\n```\n\n- docker\n\n```sh\ndocker build --rm --tag gcp-datacatalog-python .\ndocker run --rm --tty \\\n  --env GOOGLE_CLOUD_TEST_ORGANIZATION_ID=\u003cYOUR-ORGANIZATION-ID\u003e \\\n  --env GOOGLE_CLOUD_TEST_PROJECT_ID=\u003cYOUR-PROJECT-ID\u003e \\\n  --volume \u003cCREDENTIALS-FILE-FOLDER\u003e:/credentials \\\n  gcp-datacatalog-python pytest ./tests/integration/quickstart_test.py\n```\n\n### 3.2. Run quickstart.py\n\n- python\n\n```sh\npython quickstart.py --organization-id \u003cYOUR-ORGANIZATION-ID\u003e --project-id \u003cYOUR-PROJECT-ID\u003e\n```\n\n- docker\n\n```sh\ndocker build --rm --tag gcp-datacatalog-python .\ndocker run --rm --tty gcp-datacatalog-python \\\n  --volume \u003cCREDENTIALS-FILE-FOLDER\u003e:/credentials \\\n  python quickstart.py --organization-id \u003cYOUR-ORGANIZATION-ID\u003e --project-id \u003cYOUR-PROJECT-ID\u003e\n```\n\n## 4. Load Tag Templates from CSV files\n\n### 4.1. Provide CSV files representing the Template to be created\n\n1. A **master file** named with the Template ID — i.e., `template-abc.csv` if your Template ID is\n   _template_abc_. This file may contain as many lines as needed to represent the template. The first\n   line is always discarded as it's supposed to contain headers. Each field line must have 3 values:\n   the first is the Field ID; second is its Display Name; third is the Type. Currently, types `BOOL`,\n   `DOUBLE`, `ENUM`, `STRING`, `TIMESTAMP`, and `MULTI` are supported. _`MULTI` is not a Data Catalog\n   native type, but a flag that instructs the script to create a specific template to represent\n   field's predefined values (more on this below...)_.\n1. If the template has **ENUM fields**, the script looks for a \"display names file\" for each of\n   them. The files shall be named with the fields' names — i.e., `enum-field-xyz.csv` if an ENUM Field\n   ID is _enum_field_xyz_. Each file must have just one value per line, representing a display name.\n1. If the template has **multivalued fields**, the script looks for a \"values file\" for each of\n   them. The files shall be named with the fields' names — i.e., `multivalued-field-xyz.csv` if a\n   multivalued Field ID is _multivalued_field_xyz_. Each file must have just one value per line,\n   representing a short description for the value. The script will generate Field's ID and Display\n   Name based on it.\n1. All Fields' IDs generated by the script will be formatted to snake case (e.g., foo_bar_baz), but\n   it will do the formatting job for you. So, just provide the IDs as strings.\n\n_TIP: keep all template-related files in the same folder ([sample-input/load-template-csv][7] for\nreference)._\n\n### 4.2. Integration tests\n\n- pytest\n\n```sh\nexport GOOGLE_CLOUD_TEST_PROJECT_ID=\u003cYOUR-PROJECT-ID\u003e\n\npytest ./tests/integration/load_template_csv_test.py\n```\n\n- docker\n\n```sh\ndocker build --rm --tag gcp-datacatalog-python .\ndocker run --rm --tty \\\n  --env GOOGLE_CLOUD_TEST_PROJECT_ID=\u003cYOUR-PROJECT-ID\u003e \\\n  --volume \u003cCREDENTIALS-FILE-FOLDER\u003e:/credentials \\\n  gcp-datacatalog-python pytest ./tests/integration/load_template_csv_test.py\n```\n\n### 4.3. Run load_template_csv.py\n\n- python\n\n```sh\npython load_template_csv.py \\\n  --template-id \u003cTEMPLATE-ID\u003e --display-name \u003cDISPLAY-NAME\u003e \\\n  --project-id \u003cYOUR-PROJECT-ID\u003e --files-folder \u003cFILES-FOLDER\u003e \\\n  [--delete-existing]\n```\n\n- docker\n\n```sh\ndocker build --rm --tag gcp-datacatalog-python .\ndocker run --rm --tty gcp-datacatalog-python \\\n  --volume \u003cCREDENTIALS-FILE-FOLDER\u003e:/credentials \\\n  python load_template_csv.py \\\n  --template-id \u003cTEMPLATE-ID\u003e --display-name \u003cDISPLAY-NAME\u003e \\\n  --project-id \u003cYOUR-PROJECT-ID\u003e --files-folder \u003cFILES-FOLDER\u003e \\\n  [--delete-existing]\n```\n\n## 5. Load Tag Templates from Google Sheets\n\n### 5.1. Enable the Google Sheets API in your GCP Project\n\nhttps://console.developers.google.com/apis/library/sheets.googleapis.com\n\n### 5.2. Provide Google Spreadsheets representing the Template to be created\n\n1. A **master sheet** named with the Template ID — i.e., `template-abc` if your Template ID is\n   _template_abc_. This sheet may contain as many lines as needed to represent the template. The first\n   line is always discarded as it's supposed to contain headers. Each field line must have 3 values:\n   column A is the Field ID; column B is its Display Name; column C is the Type. Currently, types\n   `BOOL`, `DOUBLE`, `ENUM`, `STRING`, `TIMESTAMP`, and `MULTI` are supported. _`MULTI` is not a Data\n   Catalog native type, but a flag that instructs the script to create a specific template to\n   represent field's predefined values (more on this below...)_.\n1. If the template has **ENUM fields**, the script looks for a \"display names sheet\" for each of\n   them. The sheets shall be named with the fields' names — i.e., `enum-field-xyz` if an ENUM Field ID\n   is _enum_field_xyz_. Each sheet must have just one value per line (column A), representing a\n   display name.\n1. If the template has **multivalued fields**, the script looks for a \"values sheet\" for each of\n   them. The sheets shall be named with the fields' names — i.e., `multivalued-field-xyz` if a\n   multivalued Field ID is _multivalued_field_xyz_. Each sheet must have just one value per line\n   (column A), representing a short description for the value. The script will generate Field's ID and\n   Display Name based on it.\n1. All Fields' IDs generated by the script will be formatted to snake case (e.g., foo_bar_baz), but\n   it will do the formatting job for you. So, just provide the IDs as strings.\n\n_TIP: keep all template-related sheets in the same document ([Data Catalog Sample Tag Template][8]\nfor reference)._\n\n### 5.3. Integration tests\n\n- pytest\n\n```sh\nexport GOOGLE_CLOUD_TEST_PROJECT_ID=\u003cYOUR-PROJECT-ID\u003e\n\npytest ./tests/integration/load_template_google_sheets_test.py\n```\n\n- docker\n\n```sh\ndocker build --rm --tag gcp-datacatalog-python .\ndocker run --rm --tty \\\n  --env GOOGLE_CLOUD_TEST_PROJECT_ID=\u003cYOUR-PROJECT-ID\u003e \\\n  --volume \u003cCREDENTIALS-FILE-FOLDER\u003e:/credentials \\\n  gcp-datacatalog-python pytest ./tests/integration/load_template_google_sheets_test.py\n```\n\n### 5.4. Run load_template_google_sheets.py\n\n- python\n\n```sh\npython load_template_google_sheets.py \\\n  --template-id \u003cTEMPLATE-ID\u003e --display-name \u003cDISPLAY-NAME\u003e \\\n  --project-id \u003cYOUR-PROJECT-ID\u003e --spreadsheet-id \u003cSPREADSHEET-ID\u003e \\\n  [--delete-existing]\n```\n\n- docker\n\n```sh\ndocker build --rm --tag gcp-datacatalog-python .\ndocker run --rm --tty gcp-datacatalog-python \\\n  --volume \u003cCREDENTIALS-FILE-FOLDER\u003e:/credentials \\\n  python load_template_google_sheets.py \\\n  --template-id \u003cTEMPLATE-ID\u003e --display-name \u003cDISPLAY-NAME\u003e \\\n  --project-id \u003cYOUR-PROJECT-ID\u003e --spreadsheet-id \u003cSPREADSHEET-ID\u003e \\\n  [--delete-existing]\n```\n\n## 6. How to contribute\n\nPlease make sure to take a moment and read the [Code of\nConduct](https://github.com/ricardolsmendes/gcp-datacatalog-python/blob/master/.github/CODE_OF_CONDUCT.md).\n\n### 6.1. Report issues\n\nPlease report bugs and suggest features via the [GitHub\nIssues](https://github.com/ricardolsmendes/gcp-datacatalog-python/issues).\n\nBefore opening an issue, search the tracker for possible duplicates. If you find a duplicate, please\nadd a comment saying that you encountered the problem as well.\n\n### 6.2. Contribute code\n\nPlease make sure to read the [Contributing\nGuide](https://github.com/ricardolsmendes/gcp-datacatalog-python/blob/master/.github/CONTRIBUTING.md)\nbefore making a pull request.\n\n[1]: https://cloud.google.com/data-catalog\n[2]: https://circleci.com/gh/ricardolsmendes/gcp-datacatalog-python.svg?style=svg\n[3]: https://circleci.com/gh/ricardolsmendes/gcp-datacatalog-python\n[4]: https://medium.com/google-cloud/data-catalog-hands-on-guide-a-mental-model-dae7f6dd49e\n[5]: https://medium.com/google-cloud/data-catalog-hands-on-guide-search-get-lookup-with-python-82d99bfb4056\n[6]: https://medium.com/google-cloud/data-catalog-hands-on-guide-templates-tags-with-python-c45eb93372ef\n[7]: https://github.com/ricardolsmendes/gcp-datacatalog-python/tree/master/sample-input/load-template-csv\n[8]: https://docs.google.com/spreadsheets/d/1DoILfOD_Fb1r5otEz2CUH8SKGkyV5juLakGODTTfOjY\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fricardolsmendes%2Fgcp-datacatalog-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fricardolsmendes%2Fgcp-datacatalog-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fricardolsmendes%2Fgcp-datacatalog-python/lists"}