https://github.com/elkronos/pdfscribe

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/elkronos/pdfscribe
Owner: elkronos
License: mit
Created: 2025-03-01T04:48:09.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-01T04:49:06.000Z (over 1 year ago)
Last Synced: 2025-03-01T05:25:31.553Z (over 1 year ago)
Language: R
Size: 0 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # PDFScribe

PDFScribe is an R-based package designed to process PDF documents. It extracts text (and images, if needed), samples content from the PDFs, and automatically builds structured requests for AI analysis. The package supports processing PDFs stored locally or on Amazon S3, leverages parallel processing to improve performance, and incorporates robust error handling and logging.

## Features

- **PDF Extraction:** Reads and validates PDF files using extraction tools.

- **Content Sampling:** Samples pages using reservoir sampling and extracts key “anchor” text.

- **AI Prompt Generation:** Automatically constructs structured prompts for AI analysis.

- **Local & S3 Integration:** Processes PDFs from local directories and S3.

- **Parallel Processing:** Utilizes multiple cores for concurrent PDF processing.

- **Robust Logging & Error Handling:** Provides detailed logs and retry mechanisms for API calls and file operations.

- **Comprehensive Testing:** Includes a suite of UAT tests using the `testthat` framework.

## Installation

1. **Clone the Repository:**

   ```bash

   git clone https://github.com/yourusername/PDFScribe.git

   cd PDFScribe

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/elkronos/pdfscribe

Awesome Lists containing this project

README