https://github.com/elkronos/pdfscribe
https://github.com/elkronos/pdfscribe
Last synced: 17 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/elkronos/pdfscribe
- Owner: elkronos
- License: mit
- Created: 2025-03-01T04:48:09.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-01T04:49:06.000Z (over 1 year ago)
- Last Synced: 2025-03-01T05:25:31.553Z (over 1 year ago)
- Language: R
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDFScribe
PDFScribe is an R-based package designed to process PDF documents. It extracts text (and images, if needed), samples content from the PDFs, and automatically builds structured requests for AI analysis. The package supports processing PDFs stored locally or on Amazon S3, leverages parallel processing to improve performance, and incorporates robust error handling and logging.
## Features
- **PDF Extraction:** Reads and validates PDF files using extraction tools.
- **Content Sampling:** Samples pages using reservoir sampling and extracts key “anchor” text.
- **AI Prompt Generation:** Automatically constructs structured prompts for AI analysis.
- **Local & S3 Integration:** Processes PDFs from local directories and S3.
- **Parallel Processing:** Utilizes multiple cores for concurrent PDF processing.
- **Robust Logging & Error Handling:** Provides detailed logs and retry mechanisms for API calls and file operations.
- **Comprehensive Testing:** Includes a suite of UAT tests using the `testthat` framework.
## Installation
1. **Clone the Repository:**
```bash
git clone https://github.com/yourusername/PDFScribe.git
cd PDFScribe