https://github.com/umermansoor/autoscan
High Quality Documents OCR
https://github.com/umermansoor/autoscan
Last synced: 12 months ago
JSON representation
High Quality Documents OCR
- Host: GitHub
- URL: https://github.com/umermansoor/autoscan
- Owner: umermansoor
- License: mit
- Created: 2024-12-08T18:29:08.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-29T16:18:32.000Z (12 months ago)
- Last Synced: 2025-03-29T17:25:57.204Z (12 months ago)
- Language: Python
- Size: 11.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AutoScan
AutoScan is a tool that converts PDF files to Markdown using AI. It extracts text, handles images, and even decodes handwriting to produce clean and structured Markdown documents.
## Features
- **Accurate Conversion**: Faithfully translates complex PDF layouts into Markdown without losing formatting.
- **Image Transcription**: Provides descriptive summaries for images instead of embedding them.
- **Handwriting OCR**: Converts handwritten text into Markdown.
- **Multi-language Support**: Handles multiple languages.
- **Metadata Extraction**: Extracts titles, authors, and other metadata.


## Installation
To install the required dependencies, run:
```sh
poetry install
```
### Install `poppler`
#### On Mac
```sh
brew install poppler
```
#### On Linux:
```sh
sudo apt-get install poppler-utils
```
### To Test
To run tests:
```sh
poetry run pytest tests/
```
## Set `OPENAI_API_KEY`
### On Mac/Linux
```sh
export OPENAI_API_KEY=your_api_key
```
## Usage
To run the example script:
```sh
python example_usage.py
```
### Example
Here is an example of how to use AutoScan programmatically:
```python
import asyncio
from autoscan.autoscan import autoscan
async def main():
pdf_path = "path/to/your/pdf_file.pdf"
output = await autoscan(pdf_path)
print(output.markdown)
asyncio.run(main())
```
## How It Works
1. Convert PDF to Images: Each page of the PDF is converted into an image.
2. Process Images with LLM: The images are processed by a language model to generate Markdown.
3. Aggregate Markdown: The generated Markdown for each page is aggregated into a single file.
## Configuration
You can configure the model and other parameters in the `autoscan` function:
```python
async def autoscan(
pdf_path: str,
model_name: str = "gpt-4o",
transcribe_images: Optional[bool] = True,
output_dir: Optional[str] = None,
temp_dir: Optional[str] = None,
cleanup_temp: bool = True
) -> AutoScanOutput:
...
```
## Output
The output of the `autoscan` function includes:
- `completion_time`: Time taken to complete the conversion.
- `markdown_file`: Path to the generated Markdown file.
- `markdown`: The aggregated Markdown content.
- `input_tokens`: Number of input tokens used.
- `output_tokens`: Number of output tokens generated.
## Examples
You can find example Markdown files generated by AutoScan in the `examples` directory:
- [examples/helloworld2.md](./examples/helloworld2.md)
- [examples/helloworld.md](./examples/helloworld.md)