https://github.com/benderscript/dlpdatascraper

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/benderscript/dlpdatascraper
Owner: BenderScript
License: apache-2.0
Created: 2024-01-13T20:14:46.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-01-17T18:14:39.000Z (about 2 years ago)
Last Synced: 2025-05-03T11:02:46.633Z (10 months ago)
Language: Python
Size: 21.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# DLP Data Scraper and Generator

DLP Data Scraper and Generator is a Python tool designed to scrape and generate data for data loss prevention (DLP) purposes. It has two modules:

* It fetches DLP test sample data from specified URLs, saves the data in text format, and then converts these text files into PDFs.
* It uses an OpenAI Assistant to generate DLP mock data

The output is suitable for benchmarking DLP systems or Generative AI Language Learning Models (GenAI LLMs) for prompt injection testing.

## Features

- Web scraping from specified URLs.
- Data extraction and saving in text format.
- Conversion of text data to PDF format, ideal for benchmarking DLP systems or GenAI LLMs.

## Installation

To install DLP Data Scraper, clone the repository and install the required packages:

```bash
git clone https://github.com/BenderScript/DLPDataScraper.git
cd DLPDataScraper/dlp_data_scraper
pip3 install -r requirements.txt
```

## Usage

To use the DLP Data Scraper:

```python
from dlp_data_scraper import Umbrella

pdf_data = "pdf_data"
text_data = "text_data"
url = ('https://support.umbrella.com/hc/en-us/articles/4402023980692-Data-Loss-Prevention-DLP-Test-Sample-Data-for'
'-Built-In-Data-Identifiers')

scraper = Umbrella(url=url, text_data=text_data, pdf_data=pdf_data)
html_content = scraper.initialize_browser()
scraped_data = scraper.scrape_data()
scraper.save_data_to_files()
scraper.convert_txt_to_pdf()

print("Scraping and conversion to PDF completed.")
```

This example demonstrates initializing the scraper, scraping data from the specified URL, saving the data to text files, and then converting those text files to PDFs in specified directories.

## Contributing

Contributions to DLP Data Scraper are welcome. Please feel free to submit pull requests or open issues to improve the project.

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/benderscript/dlpdatascraper

Awesome Lists containing this project

README