Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aditya-shrivastavv/sensihide-pdf
SensiHidePDF is an end-to-end solution for redacting sensitive information from PDF files (specially resumes) in bulk. It makes use of google data loss prevention API
https://github.com/aditya-shrivastavv/sensihide-pdf
cloud-run cloud-workflows data-loss-prevention dlp eventarc google-cloud-platform privacy resume terraform
Last synced: 11 days ago
JSON representation
SensiHidePDF is an end-to-end solution for redacting sensitive information from PDF files (specially resumes) in bulk. It makes use of google data loss prevention API
- Host: GitHub
- URL: https://github.com/aditya-shrivastavv/sensihide-pdf
- Owner: aditya-shrivastavv
- Created: 2024-08-20T12:27:52.000Z (3 months ago)
- Default Branch: master
- Last Pushed: 2024-10-29T02:01:09.000Z (22 days ago)
- Last Synced: 2024-10-29T02:21:48.712Z (22 days ago)
- Topics: cloud-run, cloud-workflows, data-loss-prevention, dlp, eventarc, google-cloud-platform, privacy, resume, terraform
- Language: HCL
- Homepage:
- Size: 123 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SensiHidePDF 🕵️♂️
An end-to-end solution to hide sensitive information in PDF files, primarily resumes.
![architecture diagram google cloud](./public/arch-1.png)
## Architecture 🏗️
The application is built natively on **Google Cloud Platform**, Leveraging various services like **Cloud Run**, **Cloud Workflows**, **Cloud Storage: Bucket**, **EventArc**, **Data Loss Prevention API** and **BigQuery**.
The entire application can be provisioned using **Terraform**, making it easy to deploy and manage.
### Here is how it works: 🤔
- Whenever a PDF file is uploaded on the **Cloud Storage Bucket** (input_bucket), an EventArc event is triggered.
- That runs a **Cloud Workflows** which does the following steps in sequence:
- First cloud run service downloads that PDF file and extracts text from it.
- Second service gets that text data and it sends it to **Data Loss Prevention API** to detect sensitive information. (For now, it is hardcoded to detect EMAIL_ADDRESS and PHONE_NUMBER)
- Third service is given the response from DLP API. It then downloads the PDF file and redacts the sensitive information from it. The redacted PDF is then uploaded to another **Cloud Storage Bucket** (output_bucket).
- Finally, the last service stores the sensitive information in **BigQuery** for further analysis.
- That's it! 🎉### Services 🛠️
| Service Name | Source Code | Infrastructure |
| --- | --- | --- |
| PDF To Text | [Code](./src/pdf-to-text/) | [Terraform](./terraform/redact-pdf/pdf-to-text.tf) |
| DLP Runner | [Code](./src/dlp-runner/) | [Terraform](./terraform/redact-pdf/dlp-runner.tf) |
| Redactor | [Code](./src/redactor/) | [Terraform](./terraform/redact-pdf/redact-pdf.tf) |
| Findings Writer | [Code](./src/findings-to-bigquery/) | [Terraform](./terraform/redact-pdf/findings-writer.tf) |Leave a ⭐ if you like this project!
Secret message
There are other solutions out there solving the same problem, namely from GoogleCloudPlatform itself. But there is a huge difference between my implementation and there's. There's implementation converts PDF into images and then gets the images redacted from the DLP API, but the drawback of this approach is that the redacted PDF generated after merging the images is not readable by screen readers, or even searchable making it less accessible at large scale.
I took a different approach, I didn't run DLP API on images. Instead I ran it directly on the text, upon receiving the findings, I did redactions by myself using a python library. This way the PDF remains searchable and ATS friendly.
If you wish you can keep this a secret 🤫