{"id":19725981,"url":"https://github.com/mesmacosta/datacatalog-fileset-processor","last_synced_at":"2025-04-30T00:32:03.526Z","repository":{"id":46083157,"uuid":"259654021","full_name":"mesmacosta/datacatalog-fileset-processor","owner":"mesmacosta","description":"A package to manage Google Cloud Data Catalog Fileset scripts.","archived":false,"fork":false,"pushed_at":"2022-12-26T21:01:33.000Z","size":56,"stargazers_count":3,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-11-05T00:49:59.627Z","etag":null,"topics":["bigdata","bulk","cloud","csv","datacatalog","docker","filesets","metadata-management"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mesmacosta.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-28T14:04:07.000Z","updated_at":"2023-03-08T00:56:50.000Z","dependencies_parsed_at":"2023-01-31T01:45:49.524Z","dependency_job_id":null,"html_url":"https://github.com/mesmacosta/datacatalog-fileset-processor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mesmacosta%2Fdatacatalog-fileset-processor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mesmacosta%2Fdatacatalog-fileset-processor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mesmacosta%2Fdatacatalog-fileset-processor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mesmacosta%2Fdatacatalog-fileset-processor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mesmacosta","download_url":"https://codeload.github.com/mesmacosta/datacatalog-fileset-processor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224192121,"owners_count":17271186,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","bulk","cloud","csv","datacatalog","docker","filesets","metadata-management"],"created_at":"2024-11-11T23:33:59.799Z","updated_at":"2024-11-11T23:33:59.868Z","avatar_url":"https://github.com/mesmacosta.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Datacatalog Fileset Processor \n\n[![CircleCI][1]][2] [![PyPi][5]][6] [![License][7]][7] [![Issues][8]][9]\n\nA package to manage Google Cloud Data Catalog Fileset scripts.\n\n**Disclaimer: This is not an officially supported Google product.**\n\n\u003c!--\n  ⚠️ DO NOT UPDATE THE TABLE OF CONTENTS MANUALLY ️️⚠️\n  run `npx markdown-toc -i README.md`.\n\n  Please stick to 80-character line wraps as much as you can.\n--\u003e\n\n## Table of Contents\n\n\u003c!-- toc --\u003e\n\n- [Executing in Cloud Shell](#executing-in-cloud-shell)\n- [1. Environment setup](#1-environment-setup)\n  * [1.1. Python + virtualenv](#11-python--virtualenv)\n    + [1.1.1. Install Python 3.6+](#111-install-python-36)\n    + [1.1.2. Get the source code](#112-get-the-source-code)\n    + [1.1.3. Create and activate an isolated Python environment](#113-create-and-activate-an-isolated-python-environment)\n    + [1.1.4. Install the package](#114-install-the-package)\n  * [1.2. Docker](#12-docker)\n  * [1.3. Auth credentials](#13-auth-credentials)\n    + [1.3.1. Create a service account and grant it below roles](#131-create-a-service-account-and-grant-it-below-roles)\n    + [1.3.2. Download a JSON key and save it as](#132-download-a-json-key-and-save-it-as)\n    + [1.3.3. Set the environment variables](#133-set-the-environment-variables)\n- [2. Create Filesets from CSV file](#2-create-filesets-from-csv-file)\n  * [2.1. Create a CSV file representing the Entry Groups and Entries to be created](#21-create-a-csv-file-representing-the-entry-groups-and-entries-to-be-created)\n  * [2.2. Run the datacatalog-fileset-processor script - Create the Filesets Entry Groups and Entries](#22-run-the-datacatalog-fileset-processor-script---create-the-filesets-entry-groups-and-entries)\n  * [2.3. Run the datacatalog-fileset-processor script - Delete the Filesets Entry Groups and Entries](#23-run-the-datacatalog-fileset-processor-script---delete-the-filesets-entry-groups-and-entries)\n\n\u003c!-- tocstop --\u003e\n\n-----\n\n## Executing in Cloud Shell\n````bash\n# Set your SERVICE ACCOUNT, for instructions go to 1.3. Auth credentials\n# This name is just a suggestion, feel free to name it following your naming conventions\nexport GOOGLE_APPLICATION_CREDENTIALS=~/datacatalog-fileset-processor-sa.json\n\n# Install datacatalog-fileset-processor\npip3 install datacatalog-fileset-processor --user\n\n# Add to your PATH\nexport PATH=~/.local/bin:$PATH\n\n# Look for available commands\ndatacatalog-fileset-processor --help\n````\n\n## 1. Environment setup\n\n### 1.1. Python + virtualenv\n\nUsing [virtualenv][3] is optional, but strongly recommended unless you use [Docker](#12-docker).\n\n#### 1.1.1. Install Python 3.6+\n\n#### 1.1.2. Get the source code\n```bash\ngit clone https://github.com/mesmacosta/datacatalog-fileset-processor\ncd ./datacatalog-fileset-processor\n```\n\n_All paths starting with `./` in the next steps are relative to the `datacatalog-fileset-processor`\nfolder._\n\n#### 1.1.3. Create and activate an isolated Python environment\n\n```bash\npip install --upgrade virtualenv\npython3 -m virtualenv --python python3 env\nsource ./env/bin/activate\n```\n\n#### 1.1.4. Install the package\n\n```bash\npip install --upgrade .\n```\n\n### 1.2. Docker\n\nDocker may be used as an alternative to run the script. In this case, please disregard the\n[Virtualenv](#11-python--virtualenv) setup instructions.\n\n### 1.3. Auth credentials\n\n#### 1.3.1. Create a service account and grant it below roles\n\n- Data Catalog Admin\n\n#### 1.3.2. Download a JSON key and save it as\nThis name is just a suggestion, feel free to name it following your naming conventions\n- `./credentials/datacatalog-fileset-processor-sa.json`\n\n#### 1.3.3. Set the environment variables\n\n_This step may be skipped if you're using [Docker](#12-docker)._\n\n```bash\nexport GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-fileset-processor-sa.json\n```\n\n## 2. Create Filesets from CSV file\n\n### 2.1. Create a CSV file representing the Entry Groups and Entries to be created\n\nFilesets are composed of as many lines as required to represent all of their fields. The columns are\ndescribed as follows:\n\n| Column                        | Description               | Mandatory |\n| ---                           | ---                       | ---       |\n| **entry_group_name**          | Entry Group Name.         | Y         |\n| **entry_group_display_name**  | Entry Group Display Name. | N         |\n| **entry_group_description**   | Entry Group Description.  | N         |\n| **entry_id**                  | Entry ID.                 | Y         |\n| **entry_display_name**        | Entry Display Name.       | Y         |\n| **entry_description**         | Entry Description.        | N         |\n| **entry_file_patterns**       | Entry File Patterns.      | Y         |\n| **schema_column_name**        | Schema column name.       | N         |\n| **schema_column_type**        | Schema column type.       | N         |\n| **schema_column_description** | Schema column description.| N         |\n| **schema_column_mode**        | Schema column mode.       | N         |\n\nPlease note that the `schema_column_type` is an open string field and accept anything, if you want \nto use your fileset with Dataflow SQL, follow the data-types in the [official docs][10].\n\n### 2.2. Run the datacatalog-fileset-processor script - Create the Filesets Entry Groups and Entries\n\n- Python + virtualenv\n\n```bash\ndatacatalog-fileset-processor filesets create --csv-file CSV_FILE_PATH\n```\n\n### 2.3. Run the datacatalog-fileset-processor script - Delete the Filesets Entry Groups and Entries\n\n- Python + virtualenv\n\n```bash\ndatacatalog-fileset-processor filesets delete --csv-file CSV_FILE_PATH\n```\n\n*TIPS* \n- [sample-input/create-filesets][4] for reference;\n\n- If you want to create filesets without schema:\n[sample-input/create-filesets/fileset-entry-opt-1-all-metadata-no-schema.csv][4] for reference;\n\n[1]: https://circleci.com/gh/mesmacosta/datacatalog-fileset-processor.svg?style=svg\n[2]: https://circleci.com/gh/mesmacosta/datacatalog-fileset-processor\n[3]: https://virtualenv.pypa.io/en/latest/\n[4]: https://github.com/mesmacosta/datacatalog-fileset-processor/tree/master/sample-input/create-filesets\n[5]: https://img.shields.io/pypi/v/datacatalog-fileset-processor.svg?force_cache=true\n[6]: https://pypi.org/project/datacatalog-fileset-processor/\n[7]: https://img.shields.io/github/license/mesmacosta/datacatalog-fileset-processor.svg\n[8]: https://img.shields.io/github/issues/mesmacosta/datacatalog-fileset-processor.svg\n[9]: https://github.com/mesmacosta/datacatalog-fileset-processor/issues\n[10]: https://cloud.google.com/dataflow/docs/reference/sql/data-types\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmesmacosta%2Fdatacatalog-fileset-processor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmesmacosta%2Fdatacatalog-fileset-processor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmesmacosta%2Fdatacatalog-fileset-processor/lists"}