Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ebowman/encrypted-pdf-finder

Traverses a filesystem looking for PDFs that require a password to open
https://github.com/ebowman/encrypted-pdf-finder

Last synced: about 1 month ago
JSON representation

Traverses a filesystem looking for PDFs that require a password to open

Host: GitHub
URL: https://github.com/ebowman/encrypted-pdf-finder
Owner: ebowman
License: mit
Created: 2024-05-19T10:33:38.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-05-20T20:49:54.000Z (7 months ago)
Last Synced: 2024-05-21T11:00:00.529Z (7 months ago)
Language: Scala
Size: 78.1 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Encrypted PDF Finder

## Overview

This project, **Encrypted PDF Finder**, provides a concurrent pipeline for processing PDF files within a directory structure. The main functionalities include traversing directories, filtering for PDF files, and identifying password-protected PDFs. The pipeline leverages concurrency to maximize throughput and minimize processing time.

IMO, there are two interesting parts of this program:

1. The ```parallelFindPDFs``` method in the ```PdfFileWorkflow``` trait. This method uses a concurrent queue to traverse directories and enqueue PDF files for processing. It is a good example of how to traverse a file system concurrently. It's much faster than the ```Files.walkFileTree``` method in Java.

2. The ```ConcurrentQueuePipelining``` trait is a somewhat novel way to pipeline processing steps so that each step can be executed in parallel to improve the throughput of the process.

## Project Structure

The project consists of three main components:

1. **PasswordPdfPipelineApp** - The main application initializes and runs the pipeline.

2. **ConcurrentQueuePipelining** - A trait providing core functionality for creating concurrent pipelines.

3. **PdfFileWorkflow** - A trait that defines the workflow for processing PDF files, including methods for traversing directories and identifying password-protected PDFs.

## Files

### 1. `PasswordPdfPipelineApp.scala`

This file contains the main application which sets up and runs the pipeline.

#### Key Components

- **Main Object**: Initializes the application and sets the logging level.

- **Pipeline Setup**: Defines the root directory, available processors for sizing the thread pools, and sets up the pipeline stages.

- **Execution**: Runs the pipeline and prints the processing time.

#### Usage

To run the application, simply execute the `PasswordPdfPipelineApp` object. It will process PDF files from the specified root directory and print out the password-protected PDFs.

### 2. `ConcurrentQueuePipelining.scala`

This trait provides functionality to pipeline operations on data items concurrently using a queue-based approach.

#### Key Components

- **PipelineQueue**: An implicit class that extends `LinkedBlockingQueue` with methods for chaining operations using the `>>` operator.

- **PipelineItem**: An implicit class that allows individual items to be processed through the pipeline stages.

- **Concurrency Execution**: Method to execute a processing function concurrently across multiple threads.

#### Usage

To use this trait, include it in your class or object and define your pipeline stages using the `>>` operator. Example usage is provided within the trait documentation.

### 3. `PdfFileWorkflow.scala`

This trait provides methods for processing PDF files within a directory structure, focusing on traversing directories, filtering for PDF files, and identifying password-protected PDFs.

#### Key Components

- **enqueuePasswordProtectedPdfs**: Checks if a PDF file is password protected and enqueues it if true.

- **parallelFindPDFs**: Recursively searches for PDF files starting from a given directory and enqueues them.

#### Usage

Include this trait in your class or object and use the provided methods as stages in your pipeline.

## Example Usage

Below is an example usage of the pipeline setup within the main application.

```scala

import java.io.File

import java.util.concurrent.LinkedBlockingQueue

import ie.boboco.cqp.ConcurrentQueuePipelining

import ie.boboco.cqp.pdf.PdfFileWorkflow

object PasswordPdfPipelineApp extends App with ConcurrentQueuePipelining with PdfFileWorkflow {

  val rootDir = new File("/Users/ebowman/src")

  val coreCount = Runtime.getRuntime.availableProcessors()

  val encPdfQueue = rootDir >> parallelFindPDFs >> (enqueuePasswordProtectedPdfs, coreCount)

  Iterator.continually(encPdfQueue.take())

    .takeWhile(_.isDefined)

    .flatten

    .foreach(println)

}

```

This setup will start from the root directory, find all PDF files, check if they are password-protected, and print the password-protected ones.

## Dependencies

- **Java 8 or higher**

- **Scala 3.4.1**

- **Apache PDFBox 3.0.2** for PDF processing

## Building and Running

1. Ensure you have Java and Scala installed.

2. Clone the repository.

3. Compile the project using `sbt compile`.

4. Run the main application using `sbt run [path]`.

5. To generate a coverage report: `sbt clean coverage test coverageReport` followed by `open $(find . -name index.html)`.

Here [path] needs to be a path to a directory to traverse. It will fail

if you don't include this argument (this is to keep the automated test

coverage at 100%). If you just want to see it work, try `$ sbt "run ."`.

This will print out the full path to any password-protected PDF files found within the directory tree 

under the user's home directory.

## Contributing

Contributions are welcome! Please fork the repository and create a pull request with your changes.

## License

This project is licensed under the MIT License.