{"id":25110202,"url":"https://github.com/codad5/pdfz","last_synced_at":"2026-03-17T02:49:40.514Z","repository":{"id":275360084,"uuid":"918707641","full_name":"codad5/pdfz","owner":"codad5","description":"Your Rust PDF Document Text Extractor","archived":false,"fork":false,"pushed_at":"2025-02-13T17:39:10.000Z","size":119,"stargazers_count":11,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-02T09:31:09.937Z","etag":null,"topics":["pdf","pdf-extractor","pdfextraction","rabbitmq","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codad5.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-18T16:44:45.000Z","updated_at":"2025-02-13T22:40:49.000Z","dependencies_parsed_at":"2025-02-01T23:36:16.605Z","dependency_job_id":null,"html_url":"https://github.com/codad5/pdfz","commit_stats":null,"previous_names":["codad5/pdfz"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/codad5/pdfz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codad5%2Fpdfz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codad5%2Fpdfz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codad5%2Fpdfz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codad5%2Fpdfz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codad5","download_url":"https://codeload.github.com/codad5/pdfz/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codad5%2Fpdfz/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263895890,"owners_count":23526749,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf","pdf-extractor","pdfextraction","rabbitmq","rust"],"created_at":"2025-02-08T00:36:22.645Z","updated_at":"2026-03-17T02:49:40.472Z","avatar_url":"https://github.com/codad5.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDFz\n\nDeveloped by [codad5](https://github.com/codad5)\n\nPDFz streamlines the extraction and processing of text from PDF files so that you can manage and analyze large volumes of documents effortlessly. By leveraging a microservices architecture, PDFz achieves high performance through:\n\n- **Extractor Service (Rust):** Processes PDF files and extracts text using configurable extraction engines. While Tesseract OCR is supported, PDFz is designed to work with multiple extraction methods.\n- **API Service (Express \u0026 TypeScript):** Provides endpoints for file uploads, processing, progress tracking, and interacting with advanced extraction and model-based processing.\n- **Redis:** Caches and tracks file and model processing progress.\n- **RabbitMQ:** Manages message queuing between services.\n- **Model-Based Processing:** Integrate with engines like Ollama for advanced text processing using locally hosted large language models (LLMs).\n\n---\n\n## Features\n\n- **File Upload:** Send PDF files to the API.\n- **Multi-Engine File Processing:** Choose your extraction engine—whether Tesseract OCR, Ollama, or others—to process PDFs asynchronously.\n- **OCR \u0026 Model-Based Extraction:**  \n  - Use Tesseract OCR for traditional optical character recognition.\n  - Leverage model-based extraction (e.g., using Ollama) for advanced processing such as summarization, question-answering, or generating insights.\n- **Progress Tracking:** Monitor file processing progress in real time.\n- **Processed Content Retrieval:** Get back JSON with extracted content.\n- **Model Management:**  \n  - Pull and download a specified model if it isn’t available locally.\n  - Track model download progress.\n  - List available models for advanced extraction needs.\n\n---\n\n## Architecture\n\n- **API Service (Express \u0026 TypeScript):**  \n  Provides endpoints for:\n  - Web Interface files (`/web`)\n  - Uploading files (`/upload`)\n  - Initiating file processing (`/process/:id`)\n  - Checking file processing progress (`/progress/:id`)\n  - Retrieving processed content (`/content/:id`)\n  - Managing models (pulling via `/model/pull`, tracking progress with `/model/progress/:name`, and listing models with `/models`)\n\n- **Extractor Service (Rust):**  \n  Processes queued PDF files using the chosen extraction engine. It supports both traditional OCR (e.g., Tesseract) and model-based extraction (e.g., via Ollama) and interacts with Redis and RabbitMQ for job tracking.\n\n- **Redis:**  \n  Maintains state and progress information for file and model processing.\n\n- **RabbitMQ:**  \n  Facilitates job dispatching between the API and Extractor services.\n\n- **Ollama \u0026 Other Engines:**  \n  Provides advanced processing capabilities by serving locally hosted language models. The system is extensible to support additional extraction or processing engines in the future.\n\n---\n\n## API Endpoints\n\n### Welcome\n\n```http\nGET /\n```\n\nReturns a welcome message:\n```\nPDFz server is life 🔥🔥\n```\n\n---\n\n### Web Interface\n\n```http\nGET /web\n```\n- Shows the web interface \n\n---\n\n### Upload a File\n\n```http\nPOST /upload\n```\n\n**Request:** Multipart form-data containing a `pdf` file.\n\n**Response Example:**\n\n```json\n{\n  \"success\": true,\n  \"message\": \"File uploaded successfully\",\n  \"data\": {\n    \"id\": \"file-id\",\n    \"filename\": \"file.pdf\",\n    \"path\": \"/shared_storage/upload/pdf/file.pdf\",\n    \"size\": 12345\n  }\n}\n```\n\n---\n\n### Process a File\n\n```http\nPOST /process/:id\n```\n\n**Request:** JSON body with processing options:\n- `startPage` (default: 1)\n- `pageCount` (default: 0)\n- `priority` (default: 1)\n- `engine` — extraction engine (e.g., `\"tesseract\"` or `\"ollama\"`)\n- `model` — required if the selected engine is model-based (e.g., `\"ollama\"`)\n\nExamples:\n\nUsing Tesseract:\n```json\n{\n  \"startPage\": 1,\n  \"pageCount\": 10,\n  \"priority\": 1,\n  \"engine\": \"tesseract\"\n}\n```\n\nUsing Ollama:\n```json\n{\n  \"startPage\": 1,\n  \"pageCount\": 10,\n  \"priority\": 1,\n  \"engine\": \"ollama\",\n  \"model\": \"llama3.2-vision\"  // \":latest\" will be appended if no tag is provided\n}\n```\n\n**Response Example:**\n\n```json\n{\n  \"success\": true,\n  \"message\": \"File processing started\",\n  \"data\": {\n    \"id\": \"file-id\",\n    \"file\": \"file.pdf\",\n    \"options\": {\n      \"startPage\": 1,\n      \"pageCount\": 10,\n      \"priority\": 1\n    },\n    \"status\": \"queued\",\n    \"progress\": 0,\n    \"queuedAt\": \"2023-10-01T12:00:00Z\"\n  }\n}\n```\n\n---\n\n### Track File Processing Progress\n\n```http\nGET /progress/:id\n```\n\n**Response Example:**\n\n```json\n{\n  \"success\": true,\n  \"message\": \"Progress retrieved successfully\",\n  \"data\": {\n    \"id\": \"file-id\",\n    \"progress\": 50,\n    \"status\": \"processing\"\n  }\n}\n```\n\n---\n\n### Retrieve Processed Content\n\n```http\nGET /content/:id\n```\n\n**Response Example:**\n\n```json\n{\n  \"success\": true,\n  \"message\": \"Processed content retrieved successfully\",\n  \"data\": {\n    \"id\": \"file-id\",\n    \"content\": [\n      {\n        \"page_num\": 1,\n        \"text\": \"Text from page 1.\"\n      },\n      {\n        \"page_num\": 2,\n        \"text\": \"Text from page 2.\"\n      }\n    ],\n    \"status\": \"completed\"\n  }\n}\n```\n\n---\n\n### Pull a Model (for Model-Based Extraction)\n\n```http\nPOST /model/pull\n```\n\n**Request:** JSON body with the model name:\n\n```json\n{\n  \"model\": \"model-name\"\n}\n```\n\n**Response Examples:**\n\n- **If the model already exists:**\n\n  ```json\n  {\n    \"success\": true,\n    \"message\": \"Model already exists locally\",\n    \"model\": \"model-name\",\n    \"status\": \"exists\"\n  }\n  ```\n\n- **If the model is queued for download:**\n\n  ```json\n  {\n    \"success\": true,\n    \"message\": \"Model download queued successfully\",\n    \"model\": \"model-name\",\n    \"status\": \"queued\",\n    \"progress\": 0\n  }\n  ```\n\n---\n\n### Track Model Download Progress\n\n```http\nGET /model/progress/:name\n```\n\n**Response Example:**\n\n```json\n{\n  \"success\": true,\n  \"message\": \"Model progress retrieved successfully\",\n  \"data\": {\n    \"name\": \"model-name\",\n    \"progress\": 75,\n    \"status\": \"downloading\"\n  }\n}\n```\n\n---\n\n### List Available Models\n\n```http\nGET /models\n```\n\n**Response Example:**\n\n```json\n{\n  \"success\": true,\n  \"message\": \"Models retrieved successfully\",\n  \"data\": {\n    \"models\": [\n      {\n        \"name\": \"model1:latest\",\n        \"size\": \"1.2GB\",\n        \"modified_at\": \"2023-10-01T12:00:00Z\"\n      },\n      {\n        \"name\": \"model2:latest\",\n        \"size\": \"900MB\",\n        \"modified_at\": \"2023-09-28T08:30:00Z\"\n      }\n    ]\n  }\n}\n```\n\n---\n\n## Setup\n\n### Prerequisites\n\n#### For Docker Deployment:\n- Docker \u0026 Docker Compose\n\n#### For Local Development:\n\n**API Service (Node.js \u0026 Express):**\n- Node.js \u0026 npm\n- Redis\n- RabbitMQ\n\n**Extractor Service (Rust):**\n- Rust \u0026 Cargo\n- Redis\n- RabbitMQ\n- At least one extraction engine (e.g., Tesseract OCR or an alternative)\n\n**Ollama Service (for model-based extraction):**\n- Docker container (or a local installation of Ollama)\n\n---\n\n### Installation\n\n1. **Clone the Repository:**\n\n   ```sh\n   git clone https://github.com/codad5/pdfz.git\n   cd pdfz\n   ```\n\n2. **Create an `.env` File:**\n\n   ```sh\n   cp .env.example .env\n   ```\n\n3. **Update Environment Variables:**  \n   Modify the `.env` file to set your ports, RabbitMQ and Redis credentials, and extraction/model settings.\n\n4. **Build and Start the Services:**\n\n   ```sh\n   docker-compose up --build\n   ```\n\n---\n\n## Services \u0026 Environment Variables\n\n### Extractor Service (Rust)\n\n- `RUST_LOG=debug`  \n- `REDIS_URL` — Redis connection URL  \n- `RABBITMQ_URL` — RabbitMQ connection URL (e.g., `amqp://user:pass@rabbitmq:5672`)  \n- `EXTRACTOR_PORT` — Port for the Extractor Service  \n- `SHARED_STORAGE_PATH` — Mount point for file storage  \n- `TRAINING_DATA_PATH` — Path to training data for extraction engines  \n- `OLLAMA_BASE_URL` — Base URL for Ollama (e.g., `http://ollama:11434`)  \n- `OLLAMA_BASE_PORT` — Ollama port (e.g., `11434`)  \n- `OLLAMA_BASE_HOST` — Host for Ollama\n\n### API Service (Node.js)\n\n- `NODE_ENV=development`  \n- `REDIS_URL` — Redis connection URL  \n- `RABBITMQ_URL` — RabbitMQ connection URL  \n- `API_PORT` — Port for the API service  \n- `SHARED_STORAGE_PATH` — Mount point for file storage  \n- `RABBITMQ_EXTRACTOR_QUEUE` — Queue name for file extraction requests  \n- `OLLAMA_BASE_URL` — Base URL for Ollama  \n- `OLLAMA_BASE_PORT` — Ollama port  \n- `OLLAMA_BASE_HOST` — Host for Ollama\n\n---\n\n## Docker Compose Setup\n\nCheck the `docker-compose.yml` file to see the defined  services:\n\n\n---\n\n## Repository\n\nFor more details, visit the [GitHub repository](https://github.com/codad5/pdfz).\n\n---\n\n## Contributing\n\n1. Fork the repository and create a new branch.  \n2. Make changes and test locally.  \n3. Submit a pull request.\n\n---\n\n## License\n\nMIT License\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodad5%2Fpdfz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodad5%2Fpdfz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodad5%2Fpdfz/lists"}