{"id":15221965,"url":"https://github.com/googlecloudplatform/dlp-pdf-redaction","last_synced_at":"2025-09-09T05:43:11.872Z","repository":{"id":38329275,"uuid":"397686712","full_name":"GoogleCloudPlatform/dlp-pdf-redaction","owner":"GoogleCloudPlatform","description":"This solution provides an automated, serverless way to redact sensitive data from PDF files using Google Cloud Services like Data Loss Prevention (DLP), Cloud Workflows, and Cloud Run.","archived":false,"fork":false,"pushed_at":"2025-03-22T03:27:22.000Z","size":291,"stargazers_count":53,"open_issues_count":6,"forks_count":27,"subscribers_count":76,"default_branch":"main","last_synced_at":"2025-03-30T15:42:40.563Z","etag":null,"topics":["bigquery","cloud","cloudfunctions","cloudrun","cloudstorage","cloudworkflows","datalossprevention","dlp","documents","gcp","mask","ocr","pdf","redaction","serverless","terraform","tesseract","workflows"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-18T17:40:32.000Z","updated_at":"2025-01-10T00:55:18.000Z","dependencies_parsed_at":"2024-04-17T03:40:58.229Z","dependency_job_id":"1a4f63c8-b714-47c5-9487-5fd5cd0b831b","html_url":"https://github.com/GoogleCloudPlatform/dlp-pdf-redaction","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdlp-pdf-redaction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdlp-pdf-redaction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdlp-pdf-redaction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdlp-pdf-redaction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/dlp-pdf-redaction/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248027408,"owners_count":21035594,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","cloud","cloudfunctions","cloudrun","cloudstorage","cloudworkflows","datalossprevention","dlp","documents","gcp","mask","ocr","pdf","redaction","serverless","terraform","tesseract","workflows"],"created_at":"2024-09-28T15:09:28.220Z","updated_at":"2025-04-09T11:09:42.997Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"HCL","readme":"# Solution Guide\nThis solution provides an automated, serverless way to redact sensitive data from PDF files using Google Cloud Services like [Data Loss Prevention (DLP)](https://cloud.google.com/dlp), [Cloud Workflows](https://cloud.google.com/workflows), and [Cloud Run](https://cloud.google.com/run).\n\n\n## Solution Architecture Diagram\nThe image below describes the solution architecture of the pdf redaction process.\n\n![Architecture Diagram](./architecture-diagram.png)\n\n## Workflow Steps\nThe workflow consists of the following steps:\n1. The user uploads a PDF file to a GCS bucket\n1. A Workflow is triggered by [EventArc](https://cloud.google.com/eventarc/docs). This workflow orchestrates the PDF file redaction consisting of the following steps:\n    - Split the PDF into single pages, convert pages into images, and store them in a working bucket\n    - Redact each image using DLP Image Redact API\n    - Assemble back the PDF file from the list of redacted images and store it on GCS (output bucket)\n    - Write redacted quotes (findings) to BigQuery\n\n# Deploy PDF Redaction app\nThe `terraform` folder contains the code needed to deploy the PDF Redaction application.\n\n## What resources are created?\nMain resources:\n- Workflow\n- CloudRun services for each component with its service accounts and permissions\n  1. `pdf-spliter` - Split PDF into single-page image files\n  1. `dlp-runner` - Runs each page file through DLP to redact sensitive information\n  1. `pdf-merger` - Assembles back the pages into a single PDF\n  1. `findings-writer` - Writes findings into BigQuery\n- Cloud Storage buckets\n  - *Input Bucket* - bucket where the original file is stored\n  - *Working Bucket* - a working bucket in which all temp files will be stored as throughout the different workflow stages\n  - *Output Bucket* - bucket where the redacted file is stored\n- DLP template where InfoTypes and rules are specified. You can modify the `dlp.tf` file to specify your own INFO_TYPES and Rule Sets (refer to [terraform documentation for dlp templates](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/data_loss_prevention_inspect_template))\n- BigQuery dataset and table where findings will be written\n\n## How to deploy?\nThe following steps should be executed in Cloud Shell in the Google Cloud Console. \n\n### 1. Create a project and enable billing\nFollow the steps in [this guide](https://cloud.google.com/resource-manager/docs/creating-managing-projects).\n\n### 2. Get the code\nClone this github repository go to the root of the repository.\n\n``` \ngit clone https://github.com/GoogleCloudPlatform/dlp-pdf-redaction\ncd dlp-pdf-redaction\n```\n\n### 3. Build images for Cloud Run\nYou will first need to build the docker images for each microservice.\n\n```\nPROJECT_ID=[YOUR_PROJECT_ID]\nPROJECT_NUMBER=$(gcloud projects list --filter=\"PROJECT_ID=$PROJECT_ID\" --format=\"value(PROJECT_NUMBER)\")\nREGION=us-central1\nDOCKER_REPO_NAME=pdf-redaction-docker-repo\nCLOUD_BUILD_SERVICE_ACCOUNT=cloudbuild-sa\n\n# Enable required APIs\ngcloud services enable cloudbuild.googleapis.com artifactregistry.googleapis.com --project $PROJECT_ID\n\n# Create a Docker image repo to store apps docker images\ngcloud artifacts repositories create $DOCKER_REPO_NAME --repository-format=docker --description=\"PDF Redaction Docker Image repository\" --project $PROJECT_ID --location=$REGION\n\n# Create Service Account for CloudBuild and grant required roles\ngcloud iam service-accounts create $CLOUD_BUILD_SERVICE_ACCOUNT \\\n  --description=\"Service Account for CloudBuild created by PDF Redaction solution\" \\\n  --display-name=\"CloudBuild SA (PDF Readaction)\"\ngcloud projects add-iam-policy-binding $PROJECT_ID \\\n  --member=\"serviceAccount:$CLOUD_BUILD_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com\" \\\n  --role=\"roles/cloudbuild.serviceAgent\"\ngcloud projects add-iam-policy-binding $PROJECT_ID \\\n  --member=\"serviceAccount:$CLOUD_BUILD_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com\" \\\n  --role=\"roles/storage.objectUser\"\n\n# Build docker images of the app and store them in artifact registry repo\ngcloud builds submit \\\n  --config ./build-app-images.yaml \\\n  --substitutions _REGION=$REGION,_DOCKER_REPO_NAME=$DOCKER_REPO_NAME \\\n  --service-account=projects/$PROJECT_ID/serviceAccounts/$CLOUD_BUILD_SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com \\\n  --default-buckets-behavior=regional-user-owned-bucket \\\n  --project $PROJECT_ID\n```\nNote: If you receive a pop-up for permissions, you can authorize gcloud to request your credentials an make a GCP API call.\n\n\nThe above command will build 4 docker images and push them into Google Container Registry (GCR). Run the following command and confirm that the images are present in GCR.\n\n```\ngcloud artifacts docker images list $REGION-docker.pkg.dev/$PROJECT_ID/$DOCKER_REPO_NAME\n```\n\n### 4. Deploy the infrastructure using Terraform\n\nThis terraform deployment requires the following variables. \n\n- project_id            = \"YOUR_PROJECT_ID\"\n- region                = \"YOUR_REGION_REGION\"\n- docker_repo_name      = \"DOCKER_REPO_NAME\"\n- wf_region             = \"YOUR_WORKFLOW_REGION\"\n\nFrom the root folder of this repo, run the following commands:\n```\nexport TF_VAR_project_id=$PROJECT_ID\nexport TF_VAR_region=$REGION\nexport TF_VAR_wf_region=$REGION\nexport TF_VAR_docker_repo_name=$DOCKER_REPO_NAME\n\nterraform -chdir=terraform init\nterraform -chdir=terraform apply -auto-approve\n```\n\n**Notes:**\n  * If you get an error related to `eventarc` or `worklflows` provisioning, just give it a few seconds and rerun the `terraform -chdir=terraform apply -auto-approve` command. Explanation: Terraform enables some services like `eventarc` an `workflows` that might take a couple of minutes to finish provisioning resources and configuring permissions, simply re-runing the apply command should fix the issue.\n  * Region and Workflow region both default to `us-central1`. If you wish to deploy the resources in a different region, specify the `region` and the `wf_region` variables (ie. using `TF_VAR_region` and `TF_VAR_wf_region`). Cloud Workflows is only available in specific regions, for more information check the [documentation](https://cloud.google.com/workflows/docs/locations).\n  * If you come across an issue please check the [Issues section](https://github.com/GoogleCloudPlatform/dlp-pdf-redaction/issues). If your issue is not listed there, please report it as a new issue.\n\n\n\n### 5. Take note of Terraform Outputs\n\nOnce terraform finishes provisioning all resources, you will see its outputs. Please take note of `input_bucket` and `output_bucket` buckets. Files uploaded to the `input_bucket` bucket will be automatically processed and the redacted files will be written to the `output_bucket` bucket.\nIf you missed the outputs from the firs run, you can list the outputs by running\n\n```\nterraform -chdir=terraform output\n```\n\n### 6. Test\n\nUse the command below to upload the test file into the `input_bucket`. After a few seconds, you should see a redacted PDF file in the `output_bucket`.\n```\ngsutil cp ./test_file.pdf [INPUT_BUCKET_FROM_OUTPUT e.g. gs://pdf-input-bucket-xxxx]\n```\n\nIf you are curious about the behind the scenes, try:\n- Checkout the Redacted file in the `output_bucket`.\n\n  ```\n  gsutil ls [OUTPUT_BUCKET_FROM_OUTPUT e.g. gs://pdf-output-bucket-xxxx]\n  ```\n\n- Download the redacted pdf file, open it with your preferred pdf reader, and search for text in the PDF file.\n- Looking into [Cloud Workflows](https://console.cloud.google.com/workflows) in the GCP web console. You will see that a workflow execution was triggered when you uploaded the file to GCS.\n- Explore the `pdf_redaction_xxxx` dataset in BigQuery and check out the metadata that was inserted into the `findings` table.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdlp-pdf-redaction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooglecloudplatform%2Fdlp-pdf-redaction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdlp-pdf-redaction/lists"}