{"id":13569212,"url":"https://github.com/GoogleCloudPlatform/auto-data-tokenize","last_synced_at":"2025-04-04T05:31:45.393Z","repository":{"id":37089426,"uuid":"326861102","full_name":"GoogleCloudPlatform/auto-data-tokenize","owner":"GoogleCloudPlatform","description":"Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow","archived":false,"fork":false,"pushed_at":"2025-03-24T18:07:30.000Z","size":1536,"stargazers_count":43,"open_issues_count":9,"forks_count":20,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-03-30T15:42:27.362Z","etag":null,"topics":["cloud-migration","data-governance","data-loss-prevention","dataflow","deidentification","tokenization"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"code-of-conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-05T02:13:42.000Z","updated_at":"2025-03-22T15:31:41.000Z","dependencies_parsed_at":"2023-01-23T01:15:36.878Z","dependency_job_id":"1c847fa4-ed82-4015-a434-11e36ee49b45","html_url":"https://github.com/GoogleCloudPlatform/auto-data-tokenize","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fauto-data-tokenize","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fauto-data-tokenize/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fauto-data-tokenize/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fauto-data-tokenize/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/auto-data-tokenize/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247128702,"owners_count":20888232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud-migration","data-governance","data-loss-prevention","dataflow","deidentification","tokenization"],"created_at":"2024-08-01T14:00:37.089Z","updated_at":"2025-04-04T05:31:44.462Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"Java","readme":"# Automatic data tokenizing pipelines\n\n[![codecov](https://codecov.io/gh/GoogleCloudPlatform/auto-data-tokenize/branch/main/graph/badge.svg?token=2BUEEXNC1H)](https://codecov.io/gh/GoogleCloudPlatform/auto-data-tokenize)\n\nThis document discusses how to identify and tokenize data with an automated data transformation pipeline to detect sensitive data like personally identifiable information (PII), using Cloud Data Loss Prevention [(Cloud DLP)](https://cloud.google.com/dlp) and [Cloud KMS](https://cloud.google.com/kms). De-identification techniques like encryption lets you preserve the utility of your data for joining or analytics while reducing the risk of handling the data by obfuscating the raw sensitive identifiers.\n\nTo minimize the risk of handling large volumes of sensitive data, you can use an automated data transformation pipeline to create de-identified datasets that can be used for migrating from on-premise to cloud or keep a de-identified replica for Analytics. Cloud DLP can inspect the data for sensitive information when the dataset has not been characterized, by using [more than 100 built-in classifiers](https://cloud.google.com/dlp/docs/infotypes-reference).\n\nOne of the daunting challenges during data migration to cloud is to manage sensitive data. The sensitive data can be in structured forms like analytics tables or unstructured like chat history or transcription records. One needs to use Cloud DLP to identify sensitive data from both of these kinds of sources, followed by tokenizing the sensitive parts.\n\nTokenizing structured data can be optimized for cost and speed by using representative samples for each of the columns to categorize the kind of information, followed by bulk encryption of sensitive columns. This approach reduces cost of using Cloud DLP, by limiting the use to classification of a small representative sample, instead of all the records. The throughput and cost of tokenization can be optimized by using envelope encryption for columns classified as sensitive.\n\nThis document demonstrates a reference implementation of tokenizing structured data through two tasks: _sample and identify_, followed by _bulk tokenization_ using encryption.\n\nThis document is intended for a technical audience whose responsibilities include data security, data processing, or data analytics. This guide assumes that you're familiar with data processing and data privacy, without the need to be an expert.\n\nWatch the video to learn how the tool works:\n\n[![Understand code Youtube video](https://img.youtube.com/vi/S6fYkWvUBDo/default.jpg)](https://www.youtube.com/watch?v=S6fYkWvUBDo)\n\n## Architecture\n\nThe solution comprises two pipelines (one for each of the tasks):\n  1. Sample + Identify\n  1. Tokenize\n\n### Sample \u0026 Identify Pipeline\n\n![Identify pipeline](sampling_dlp_identify_catalog_architecture.svg)\n\nThe __sample \u0026 identify pipeline__ extracts a few sample records from the source files. The *identify* part of pipeline then decomposes each sample record into columns to categorize them into one of the [in-built infotypes](https://cloud.google.com/dlp/docs/infotypes-reference) or [custom infotypes](https://cloud.google.com/dlp/docs/creating-custom-infotypes) using Cloud DLP. The sample \u0026 identify pipeline outputs following files to Cloud Storage:\n  * Avro schema of the file\n  * Detected info-types for each of the input columns.\n\n### Tokenization Pipeline\n\n![Tokenization pipeline](AutoDLP_Encryption_Catalog_Architecture.svg)\n\nThe __tokenize pipeline__ then encrypts the user-specified source columns using the schema information from the _sample \u0026 identify pipeline_ and user provided enveloped data encryption key. The tokenizing pipeline performs following of transforms on each of the records in the source file:\n  1. Unwrap data-encryption key using Cloud KMS\n  1. un-nest each record into a flat record.\n  1. tokenize required values using [deterministic AEAD](https://github.com/google/tink/blob/master/docs/PRIMITIVES.md#deterministic-authenticated-encryption-with-associated-data) encryption.\n  1. re-nest the flat record into Avro record\n  1. Write Avro file with encrypted fields\n\n### Concepts\n* [Envelope Encryption](https://cloud.google.com/kms/docs/envelope-encryption) is a form of multi-layer encryption, involving use of multiple layers of keys for encrypting data. It is the process of encrypting the actual data encryption key with another key to secure the data-encryption key.\n* [Cloud KMS](https://cloud.google.com/kms) provides easy management of encryption keys at scale.\n* [Tink](https://github.com/google/tink) is an open-source library that provides easy and secure APIs for handling encryption/decryption. It\n  reduces common crypto pitfalls with user-centered design, careful implementation and code reviews, and extensive\n  testing. At Google, Tink is one of the standard crypto libraries, and has been deployed in hundreds of products and\n  systems. Tink natively integrates with Cloud KMS for use with envelope encryption technique.\n* [Deterministic AEAD encryption](https://github.com/google/tink/blob/master/docs/PRIMITIVES.md#deterministic-authenticated-encryption-with-associated-data) is used by to serve following purposes:\n  1. Permits use of the cipher-text as join keys. The deterministic property of the cipher ensures that cipher-text for the same plain-text is always the same. Using this property one can safely use the encrypted data for performing statistical analysis like cardinality analysis, frequency analysis etc.\n  1. store signed plain-text within the cipher to assert authenticity.\n  1. reversibility, use of 2-way encryption algorithm permits reversing the algorithm to obtain original plain-text. Hashing does not permit such operations.\n\n* [Cloud Data Loss Prevention](https://cloud.google.com/dlp) is a Google Cloud service providing data classification, de-identification and re-identification features, allowing you to easily manage sensitive data in your enterprise.\n\n* __Record Flattening__ is the process of converting nested/repeated records as flat table. Each leaf-node of the record gets a unique identifier. This flattening process enables sending data to DLP for identification purposes as the DLP API supports a simple [data-table](https://cloud.google.com/dlp/docs/examples-deid-tables).\n\n   Consider a contact record, for **Jane Doe**, it has a nested and repeated field `contacts`.\n   ```json\n   {\n      \"name\": \"Jane Doe\",\n      \"contacts\": [\n      {\n         \"type\": \"WORK\",\n         \"number\": 2124567890\n      },\n      {\n         \"type\": \"HOME\",\n         \"number\": 5304321234\n      }\n      ]\n   }\n   ```\n\n   Flattening this record yields a [FlatRecord](https://github.com/GoogleCloudPlatform/auto-data-tokenize/blob/master/proto-messages/src/main/resources/proto/google/cloud/autodlp/auto_tokenize_messages.proto#L103) with following data. Notice the `values` map, which demonstrates that each leaf node of the contact record is mapped using a [JsonPath](https://goessner.net/articles/JsonPath/) notation.\n   The `keySchema` shows a mapping between the leaf value's key's to a schema key to demonstrate that leaf-nodes of same type share the same key-schema, for example: `$.contacts[0].contact.number` is logically same as `$.contacts[1].contact.number` as both of them have the same key-schema `$.contacts.contact.number`.\n\n   ```json\n   {\n      \"values\": {\n       \"$.name\": \"Jane Doe\",\n       \"$.contacts[0].contact.type\": \"WORK\",\n       \"$.contacts[0].contact.number\": 2124567890,\n       \"$.contacts[1].contact.type\": \"WORK\",\n       \"$.contacts[1].contact.number\": 5304321234\n      },\n\n      \"keySchema\": {\n       \"$.name\": \"$.name\",\n       \"$.contacts[0].contact.type\": \"$.contacts.contact.type\",\n       \"$.contacts[0].contact.number\": \"$.contacts.contact.number\",\n       \"$.contacts[1].contact.type\": \"$.contacts.contact.type\",\n       \"$.contacts[1].contact.number\": \"$.contacts.contact.number\"\n      }\n   }\n   ```\n\n## Prerequisites\n\nThis tutorial assumes some familiarity with shell scripts and basic knowledge of Google Cloud.\n\n## Objectives\n\n1. Understand record sampling and identify sensitive columns using DLP.\n1. Use of symmetric encryption to tokenize data using KMS wrapped data-encryption key.\n\n## Costs\n\nThis tutorial uses billable components of Google Cloud, including the following:\n\n* [Dataflow](https://cloud.google.com/dataflow/pricing)\n* [Cloud Storage](https://cloud.google.com/storage/pricing)\n* [Cloud Data Loss Prevention](https://cloud.google.com/dlp/pricing)\n* [Cloud KMS](https://cloud.google.com/kms/pricing)\n\nUse the [pricing calculator](https://cloud.google.com/products/calculator) to generate a cost estimate based on your\nprojected usage.\n\n## Before you begin\n\nFor this tutorial, you need a Google Cloud [project](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#projects). To make\ncleanup easiest at the end of the tutorial, we recommend that you create a new project for this tutorial.\n\n1. [Create a Google Cloud project](https://console.cloud.google.com/projectselector2/home/dashboard).\n1. Make sure that [billing is enabled](https://support.google.com/cloud/answer/6293499#enable-billing) for your Google\n   Cloud project.\n1. [Open Cloud Shell](https://console.cloud.google.com/?cloudshell=true).\n\n   At the bottom of the Cloud Console, a [Cloud Shell](https://cloud.google.com/shell/docs/features) session opens and\n   displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including\n   the [gcloud](https://cloud.google.com/sdk/gcloud/) command-line tool, and with values already set for your current\n   project. It can take a few seconds for the session to initialize.\n\n1. Enable APIs for Cloud DLP, Cloud KMS, Compute Engine, Cloud Storage, Dataflow and BigQuery services with the following command:\n   ```shell script\n   gcloud services enable \\\n   dlp.googleapis.com \\\n   cloudkms.googleapis.com \\\n   compute.googleapis.com \\\n   storage.googleapis.com \\\n   dataflow.googleapis.com \\\n   bigquery.googleapis.com\n   ```\n\n## Setting up your environment\n\n1. In Cloud Shell, clone the source repository and go to the directory for this tutorial:\n   ```shell script\n   git clone https://github.com/GoogleCloudPlatform/auto-data-tokenize.git\n   cd auto-data-tokenize/\n   ```\n\n1. Use a text editor of your choice to modify the `set_variables.sh` file to set required environment variables.\n   ```shell script\n   # The Google Cloud project to use for this tutorial\n   export PROJECT_ID=\"\u003cyour-project-id\u003e\"\n\n   # The Compute Engine region to use for running Dataflow jobs and create a\n   # temporary storage bucket\n   export REGION_ID=\"\u003ccompute-engine-region\u003e\"\n\n   # define the GCS bucket to use as temporary bucket for Dataflow\n   export TEMP_GCS_BUCKET=\"\u003cname-of-the-bucket\u003e\"\n\n   # Name of the service account to use (not the email address)\n   export DLP_RUNNER_SERVICE_ACCOUNT_NAME=\"\u003cservice-account-name-for-runner\u003e\"\n\n   # Name of the GCP KMS key ring name\n   export KMS_KEYRING_ID=\"\u003ckey-ring-name\u003e\"\n\n   # name of the symmetric Key encryption kms-key-id\n   export KMS_KEY_ID=\"\u003ckey-id\u003e\"\n\n   # The JSON file containing the TINK Wrapped data-key to use for encryption\n   export WRAPPED_KEY_FILE=\"\u003cpath-to-the-data-encryption-key-file\u003e\"\n   ````\n\n1. Run the script to set the environment variables:\n   ```shell script\n   source set_variables.sh\n   ```\n\n## Creating resources\n\nThe tutorial uses following resources\n * _Service account_ to run data flow pipelines enabling fine-grain access control\n * A symmetric Cloud KMS managed _Key Encryption key_, this is used to wrap the actual data encryption key\n * _Cloud Storage bucket_ for temporary data storage and test data\n\n### Create service accounts\n\nWe recommend that you run pipelines with fine-grained access control to improve access partitioning. If\nyour project does not have a user-created service account, create one using following instructions.\n\nYou can use your browser by going to [**Service\naccounts**](https://console.cloud.google.com/projectselector/iam-admin/serviceaccounts?supportedpurview=project)\nin the Cloud Console.\n\n1. Create a service account to use as the user-managed controller service account for Dataflow:\n   ```shell script\n   gcloud iam service-accounts create  ${DLP_RUNNER_SERVICE_ACCOUNT_NAME} \\\n   --project=\"${PROJECT_ID}\" \\\n   --description=\"Service Account for Tokenizing pipelines.\" \\\n   --display-name=\"Tokenizing pipelines\"\n   ```\n1. Create a custom role with required permissions for accessing DLP, Dataflow and KMS:\n   ```shell script\n   export TOKENIZING_ROLE_NAME=\"tokenizing_runner2\"\n\n   gcloud iam roles create ${TOKENIZING_ROLE_NAME} \\\n   --project=${PROJECT_ID} \\\n   --file=tokenizing_runner_permissions.yaml\n   ```\n\n1. Apply the custom role to the service account:\n   ```shell script\n   gcloud projects add-iam-policy-binding ${PROJECT_ID} \\\n   --member=\"serviceAccount:${DLP_RUNNER_SERVICE_ACCOUNT_EMAIL}\" \\\n   --role=projects/${PROJECT_ID}/roles/${TOKENIZING_ROLE_NAME}\n   ```\n1. Assign the `dataflow.worker` role to allow the service account to allow it to run as a Dataflow worker:\n   ```shell script\n   gcloud projects add-iam-policy-binding ${PROJECT_ID} \\\n   --member=\"serviceAccount:${DLP_RUNNER_SERVICE_ACCOUNT_EMAIL}\" \\\n   --role=roles/dataflow.worker\n   ```\n\n### Create Key encryption key\n\nThe data would be encrypted using a Data Encryption Key (DEK). You will\nuse [envelope encryption](https://cloud.google.com/kms/docs/envelope-encryption) technique to encrypt the DEK using a\nKey in [Cloud KMS](https://cloud.google.com/kms), this ensures that the DEK can be safely stored without compromising\nit.\n\n1. Create KMS Key-ring\n   ```shell script\n   gcloud kms keyrings create --project ${PROJECT_ID} --location ${REGION_ID} ${KMS_KEYRING_ID}\n   ```\n1. Create KMS symmetric key to use for encrypting your data encryption key.\n   ```shell script\n   gcloud kms keys create --project ${PROJECT_ID} --keyring=${KMS_KEYRING_ID} --location=${REGION_ID} --purpose=\"encryption\" ${KMS_KEY_ID}\n   ```\n1. Download and unpack the latest version of [Tinkey](https://github.com/google/tink/blob/master/docs/TINKEY.md). Tinkey\n   is an open source utility to create wrapped encryption keys.\n   ```shell script\n   mkdir tinkey/\n   tar zxf tinkey-\u003cversion\u003e.tar.gz -C tinkey/\n   export TINKEY=\"${PWD}/tinkey/tinkey\"\n   alias tinkey=\"${TINKEY}\"\n   ```\n\n1. Create a new wrapped data encryption key.\n   ```shell script\n   tinkey create-keyset \\\n   --master-key-uri=\"${MAIN_KMS_KEY_URI}\" \\\n   --key-template=AES256_SIV \\\n   --out=\"${WRAPPED_KEY_FILE}\" \\\n   --out-format=json\n   ```\n\n### Create Cloud Storage bucket\n\nCreate a Cloud Storage bucket for storing test data and Dataflow staging location.\n\n```shell script\ngsutil mb -p ${PROJECT_ID} -l ${REGION_ID} \"gs://${TEMP_GCS_BUCKET}\"\n```\n\n### Copy test data to Cloud Storage\n\nYou can use your own file datasets or copy the included demo dataset (`userdata.avro` or `userdata.parquet`).\n\n```shell script\ngsutil cp userdata.avro gs://${TEMP_GCS_BUCKET}\n```\n\n## Compile modules\n\n### Prerequisites\n\nThe solution uses[Testcontainers](https://www.testcontainers.org/) for database dependent unit-tests.\nIf you wish to skip running tests add `-x test` flag to `gradle` command.\n\nInstall Docker:\n\n   1. Follow [distro specific steps](https://docs.docker.com/engine/install/)\n   1. For security purpose [setup Docker in Root-less mode](https://docs.docker.com/engine/security/rootless/#prerequisites)\n   1. Install Docker Compose using [steps](https://docs.docker.com/compose/install/)\n\nCheck the [supported](https://www.testcontainers.org/supported_docker_environment/) Docker versions for TestContainers before installing.\n\n### Compile\n\nYou need to compile all the modules to build executables for deploying the _sample \u0026 identify_ and _bulk tokenize_ pipelines.\n\n```shell script\n./gradlew clean buildNeeded shadowJar\n```\n\n## Run Sample and Identify pipeline\n\nRun the sample \u0026 identify pipeline to identify sensitive columns in the data you need to tokenize.\n\nThe pipeline extracts `sampleSize` number of records, flattens the record and identifies sensitive columns\nusing [Data Loss Prevention (DLP)](https://cloud.google.com/dlp). Cloud DLP provides functionality\nto [identify](https://cloud.google.com/dlp/docs/inspecting-text) sensitive information-types. The DLP identify methods\nsupports only flat tables, hence the pipeline flattens the Avro/Parquet records as they can contain nested and/or repeated fields.\n\n### Launch sample \u0026 identify pipeline\n\n```shell script\nsample_and_identify_pipeline --project=\"${PROJECT_ID}\" \\\n--region=\"${REGION_ID}\" \\\n--runner=\"DataflowRunner\" \\\n--serviceAccount=${DLP_RUNNER_SERVICE_ACCOUNT_EMAIL} \\\n--gcpTempLocation=\"gs://${TEMP_GCS_BUCKET}/temp\" \\\n--stagingLocation=\"gs://${TEMP_GCS_BUCKET}/staging\" \\\n--tempLocation=\"gs://${TEMP_GCS_BUCKET}/bqtemp\" \\\n--workerMachineType=\"n1-standard-1\" \\\n--sampleSize=500 \\\n--sourceType=\"AVRO\" \\\n--inputPattern=\"gs://${TEMP_GCS_BUCKET}/userdata.avro\" \\\n--reportLocation=\"gs://${TEMP_GCS_BUCKET}/dlp_report/\"\n```\n\n\u003e **Note:** Use `sampleSize=0` to process all records.\n\nThe pipeline supports multiple **Source Types**, use the following table to use the right combination of `sourceType` and `inputPattern` arguments.\n\n| Data source                                          | sourceType       | inputPattern                                                                                                                                              |\n|------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **Avro** file in Cloud Storage                       | `AVRO`           | `gs://\u003clocation of the file(s)`                                                                                                                           |\n| **Parquet** file in Cloud Storage                    | `PARQUET`        | `gs://\u003clocation of the file(s)`                                                                                                                           |\n| CSV Files in Cloud Storage                           | `CSV_FILE`       | `gs://\u003clocation of the files(s)`                                                                                                                          |\n| BigQuery table                                       | `BIGQUERY_TABLE` | `\u003cproject-id\u003e:\u003cdataset\u003e.\u003ctable\u003e`                                                                                                                          |\n| Query results in BigQuery                            | `BIGQUERY_QUERY` | BigQuery SQL statement in StandardSQL dialect.                                                                                                            |\n| Relational Databases (using JDBC)                    | `JDBC_TABLE`     | `[TABLE_NAME]`, use parameters `jdbcConnectionUrl` and `jdbcDriverClass` to specify the JDBC connection details. ([Details](sample_identify_and_tag.md))  |\n| Query Results from Relational Databases (using JDBC) | `JDBC_QUERY`     | `SELECT` query, use parameters `jdbcConnectionUrl` and `jdbcDriverClass` to specify the JDBC connection details.                                          |\n\nThe pipeline detects all the [standard infotypes](https://cloud.google.com/dlp/docs/infotypes-reference) supported by DLP.\nUse `--observableInfoTypes` to provide additional custom info-types that you need.\n\n\nPipeline options for Sample, Identify and Tag pipeline (`DlpSamplerPipelineOptions`):\n\n| Sampling Pipeline Options            | Description                                                                                                                                                                                                                                                                            |\n|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `sourceType`                         | The data source to analyse/inspect. One of: \\[AVRO, PARQUET, BIGQUERY_TABLE, BIGQUERY_QUERY, JDBC_TABLE\\]                                                                                                                                                                              |\n| `inputPattern`                       | The location of the datasource: for AVRO or PARQUET, the GCS file pattern to use as input.\u003cbr\u003eFor BIGQUERY_TABLE: Fully Qualified table name as {projectId}:{datasetId}.{tableId} format, for JDBC_TABLE, the name of the table.                                                       |\n| `sampleSize`                         | (Optional) The sample size to send to DLP. (Default:1000)                                                                                                                                                                                                                              |\n| `reportLocation`                     | (Optional) The GCS location to write the aggregated inspection results and the datasource's AVRO Schema. Atleast one of `reportLocation` or `reportBigQueryTable` must be specified.                                                                                                   |\n| `reportBigQueryTable`                | (Optional) The BigQuery table ({projectId}:{datasetId}.{tableId}) to write the aggregated inspection results, the table must exist. Atleast one of reportLocation or reportBigQueryTable must be specified.                                                                            |\n| `observableInfoTypes`                | (Optional) Provide a list of DLP InfoTypes to inspect the data with. Keeping EMPTY uses all DLP supported InfoTypes.                                                                                                                                                                   |\n| `jdbcConnectionUrl`                  | The Connection URL used for connecting to a SQL datasource using JDBC. (Required when `sourceType=JDBC_TABLE`)                                                                                                                                                                         |\n| `jdbcDriverClass`                    | The JDBC driver to use for reading from SQL datasource. (Required when `sourceType=JDBC_TABLE`)                                                                                                                                                                                        |\n| `jdbcFilterClause`                   | When using JDBC source, it is highly recommended to use a sampling filter to select random records, instead of fetching all the records from a relational database. The provided string is set as the WHERE clause of the query. (Optional when `sourceType=JDBC_TABLE`)               |\n| `csvHeaders`                         | (Optional) Provide column names when using `CSV_FILE` sourceType.                                                                                                                                                                                                                      |\n| `csvFirstRowHeaders`                 | (Optional) Direct to omit first row of each CSV file and use as header. (Default: `false`)                                                                                                                                                                                             |\n| `csvCharset`                         | (Optional) Specify the charset used for the CSV file. (Default: `UTF-8`)                                                                                                                                                                                                               |\n| `csvColumnDelimiter`                 | (Optional) Specify the character(s) used for delimiting column in a row. (Default: `,`)                                                                                                                                                                                                |\n| `csvFormatType`                      | (Optional) Specify the CSV format based on Apache Commons CSV. Choose from one of the [CSVFormat#Predefined](https://github.com/apache/commons-csv/blob/f6cdeac129665cf6f131b00678c9b4e824d758e5/src/main/java/org/apache/commons/csv/CSVFormat.java#L679) types. (Default: `Default`) |\n| `dataCatalogEntryGroupId`            | The Entry Group Id (/projects/{projectId}/locations/{locationId}/entryGroups/{entryGroupId}) to create a new Entry for inspected datasource. Provide to enable pipeline to create new entry in DataCatalog with schema. (Not used for `sourceType=BIGQUERY_TABLE`)                     |\n| `dataCatalogInspectionTagTemplateId` | The Datacatalog TempalteId to use for creating the sensitivity tags.                                                                                                                                                                                                                   |\n| `dataCatalogForcedUpdate`            | Force updates to Data Catalog Tags/Entry based on execution of this pipeline. (Default: `false`)                                                                                                                                                                                       |\n| `dlpRegion`                          | The DLP [processing location](https://cloud.google.com/dlp/docs/locations#regions) to use. (Default: `global`)                                                                                                                                                                         |\n\n\n### Sample \u0026 Identify pipeline DAG\n\nThe Dataflow execution DAG would look like following:\n\n![Sample and Identify Pipeline DAG](sampling_pipeline_with_catalog_jdbc.png)\n\n### Retrieve report\n\nThe sample \u0026 identify pipeline outputs the Avro schema (or converted for Parquet) of the files and one file for each of the columns detected to contain sensitive information. Retrieve the report to your local machine to have a look.\n\n```shell script\nmkdir -p dlp_report/ \u0026\u0026 rm dlp_report/*.json\ngsutil -m cp \"gs://${TEMP_GCS_BUCKET}/dlp_report/*.json\" dlp_report/\n```\n\nList all the column names that have been identified.\n\n```shell script\ncat dlp_report/col-*.json | jq .columnName\n```\nThe output will match the following list.\n\n```text\n\"$.kylosample.birthdate\"\n\"$.kylosample.cc\"\n\"$.kylosample.email\"\n\"$.kylosample.first_name\"\n\"$.kylosample.ip_address\"\n\"$.kylosample.last_name\"\n```\n\nYou can view the details of the identified column by issuing `cat` command for the file.\n\n```shell script\ncat dlp_report/col-kylosample-cc-00000-of-00001.json\n```\n\nFollowing is a snippet of the `cc` column.\n```json\n{\n  \"columnName\": \"$.kylosample.cc\",\n  \"infoTypes\": [\n    {\n      \"infoType\": \"CREDIT_CARD_NUMBER\",\n      \"count\": \"394\"\n    }\n  ]\n}\n```\n\n\u003e The `\"count\"` value will vary based on the randomly selected samples during execution.\n\n## Launch bulk tokenize pipeline\n\nThe sample \u0026 identify pipeline used few samples from the original dataset to identify sensitive information using\nDLP. The bulk tokenize pipeline processes the entire dataset and encrypts the desired columns using the provided Data\nEncryption Key (DEK).\n\n```shell script\ntokenize_pipeline --project=\"${PROJECT_ID}\" \\\n--region=\"${REGION_ID}\" \\\n--runner=\"DataflowRunner\" \\\n--tempLocation=\"gs://${TEMP_GCS_BUCKET}/bqtemp\" \\\n--serviceAccount=${DLP_RUNNER_SERVICE_ACCOUNT_EMAIL} \\\n--workerMachineType=\"n1-standard-1\" \\\n--schema=\"$(\u003cdlp_report/schema.json)\" \\\n--tinkEncryptionKeySetJson=\"$(\u003c${WRAPPED_KEY_FILE})\" \\\n--mainKmsKeyUri=\"${MAIN_KMS_KEY_URI}\" \\\n--sourceType=\"AVRO\" \\\n--inputPattern=\"gs://${TEMP_GCS_BUCKET}/userdata.avro\" \\\n--outputDirectory=\"gs://${TEMP_GCS_BUCKET}/encrypted/\" \\\n--tokenizeColumns=\"$.kylosample.cc\" \\\n--tokenizeColumns=\"$.kylosample.email\"\n```\n\n#### Parameter reference:\n\nThe encryption pipeline supports two encryption modes:\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n      \u003cth\u003eEncryption Mode\u003c/th\u003e\n      \u003cth\u003eParameter\u003c/th\u003e\n      \u003cth\u003eParameter information\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003ctd rowspan=\"3\"\u003eFixed encryption\u003c/td\u003e\n      \u003ctd\u003e\u003ccode\u003e--tinkEncryptionKeySetJson\u003c/code\u003e\u003c/td\u003e\n      \u003ctd\u003eThe wrapped encryption key details\u003c/td\u003e\n    \u003c/tr\u003e\n\u003ctr\u003e\n      \u003ctd\u003e\u003ccode\u003e--mainKmsKeyUri\u003c/code\u003e\u003c/td\u003e\n      \u003ctd\u003eThe KMS key used to wrap the tinkEncryption key. \u003cbr\u003eIn following format:\u003cbr\u003e\u003ccode\u003egcp-kms://projects/${PROJECT_ID}/locations/${REGION_ID}/keyRings/${KMS_KEYRING_ID}/cryptoKeys/${KMS_KEY_ID}\u003c/code\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n      \u003ctd\u003e\u003ccode\u003e--tokenizeColumns\u003c/code\u003e\u003c/td\u003e\n      \u003ctd\u003eOne or more logical column names in JSONPath format\u003c/td\u003e\n\u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003eDLP De-identify\u003c/td\u003e\n\u003ctd\u003e\u003ccode\u003e--dlpEncryptConfigJson\u003c/code\u003e\u003c/td\u003e\n\u003ctd\u003e\u003ca href=\"src/main/proto/google/cloud/autodlp/auto_tokenize_messages.proto\"\u003eDlpEncryptConfig\u003c/a\u003e\nJSON to provide \u003ca href=\"https://cloud.google.com/dlp/docs/reference/rest/v2/projects.deidentifyTemplates#DeidentifyTemplate.PrimitiveTransformation\"\u003ePrimitiveTransformation\u003c/a\u003e\nfor each \u003ccode\u003etokenizedColumn\u003c/code\u003e.\n\u003cbr\u003e\nUse the \u003ca href=\"email_cc_dlp_encrypt_config.json\"\u003esample\u003c/a\u003e configuration JSON for reference, and the \u003ca href=\"https://cloud.google.com/dlp/docs/transformations-reference\"\u003etransformation reference\u003c/a\u003e to understand each of the transformations.\n\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cbr\u003e\n\nThe pipeline supports following destinations for storing the tokenized output.\n\n| Destination | Description | Pipeline parameter |\n| --- | --- | --- |\n| File in Cloud Storage  | Stores as an AVRO file | `--outputDirectory=gs://\u003clocation of the directory/` |\n| BigQuery table | uses `WRITE_TRUNCATE` mode to write results to a BigQuery Table. | `--outputBigQueryTable=\u003cproject-id\u003e:\u003cdataset\u003e.\u003ctable\u003e` |\nYou can use one or both of them simultaneously.\n\nThe pipeline executes asynchronously on Dataflow. You can check the progress by following the JobLink printed in the following format:\n```\nINFO: JobLink: https://console.cloud.google.com/dataflow/jobs/\u003cyour-dataflow-jobid\u003e?project=\u003cyour-project-id\u003e\n```\nAdditional Pipeline options for Bulk Tokenize pipeline (`EncryptionPipelineOptions`):\n\n\n| Encryption Key parameters  | Description                                                                                                                                                                                          |\n|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `dlpEncryptConfigJson`     | Provide a valid JSON object as per DlpEncryptCongig Proto defining the DLP configuration                                                                                                             |\n| `tinkEncryptionKeySetJson` | When using TINK based encryption, provide the wrapped KeySet generated by `tinkey`                                                                                                                   |\n| `keyMaterial`              | The encryption key to use for custom encryption module as String                                                                                                                                     |\n| `keyMaterialType`          | One of `TINK_GCP_KEYSET_JSON`, `RAW_BASE64_KEY`, `RAW_UTF8_KEY`, `GCP_KMS_WRAPPED_KEY`. Default: `TINK_GCP_KEYSET_JSON`                                                                               |\n| `mainKmsKeyUri`            | The Google Cloud KMS key to use for decrypting the TINK Key or GCP_KMS_WRAPPED_KEY                                                                                                                   |\n| `valueTokenizerFactoryFullClassName` | The value tokenization class to use for non-DLP based tokenization. Default: `com.google.cloud.solutions.autotokenize.encryptors.DaeadEncryptingValueTokenizer$DaeadEncryptingValueTokenizerFactory` |\n**NOTE:** Provide only one of dlpEncryptConfigJson, tinkEncryptionKeySetJson or keyMaterial\n\n\nThe tokenize pipeline's DAG will look like following:\n![Encrypting Pipeline DAG](encryption_pipeline_dag.png)\n\n### Verify encrypted result\n\nLoad the bulk tokenize pipeline's output file(s) into BigQuery to verify that all the columns specified using `tokenizeColumns` flag have been encrypted.\n\n1. Create a BigQuery dataset for tokenized data\n   Replace \u003ci\u003e\u003cbigquery-region\u003e\u003c/i\u003e with a region of your choice. Ensure that BigQuery dataset region/multi-region is in the same region as the Cloud Storage bucket. You can read about [considerations for batch loading data](https://cloud.google.com/bigquery/docs/batch-loading-data) for more information.\n\n   ```shell script\n   bq --location=\u003cbigquery-region\u003e \\\n   --project_id=\"${PROJECT_ID}\" \\\n   mk --dataset tokenized_data\n   ```\n\n1. Load tokenized data to a BigQuery table.\n   ```shell script\n   bq load \\\n   --source_format=AVRO \\\n   --project_id=\"${PROJECT_ID}\" \\\n   \"tokenized_data.TokenizedUserdata\" \\\n   \"gs://${TEMP_GCS_BUCKET}/encrypted/*\"\n   ```\n1. Check some records to confirm that `email` and `cc` fields have been encrypted.\n   ```shell script\n   bq query \\\n   --project_id=\"${PROJECT_ID}\" \\\n   \"SELECT first_name, encrypted_email, encrypted_cc FROM tokenized_data.TokenizedUserdata LIMIT 10\"\n   ```\n\n## Cleaning up\n\nTo avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the project:\n\n1.  In the Cloud Console, go to the [**Manage resources** page](https://console.cloud.google.com/iam-admin/projects).\n1.  In the project list, select the project that you want to delete and then click **Delete** ![delete](bin_icon.png).\n1.  In the dialog, type the project ID and then click **Shut down** to delete the project.\n\n\n## What's next\n\n* Learn more about [Cloud DLP](https://cloud.google.com/dlp)\n* Learn more about [Cloud KMS](https://cloud.google.com/kms)\n* Learn about [Inspecting storage and databases for sensitive data](https://cloud.google.com/dlp/docs/inspecting-storage)\n* Handling [De-identification and re-identification of PII in large-scale datasets using DLP](https://cloud.google.com/solutions/de-identification-re-identification-pii-using-cloud-dlp)\n\n## Disclaimer\n**License**: Apache 2.0\n\nThis is not an official Google product.\n","funding_links":[],"categories":["Awesome Privacy Engineering [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)"],"sub_categories":["Tokenization"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGoogleCloudPlatform%2Fauto-data-tokenize","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGoogleCloudPlatform%2Fauto-data-tokenize","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGoogleCloudPlatform%2Fauto-data-tokenize/lists"}