https://github.com/umermansoor/autoscan

High Quality Documents OCR
https://github.com/umermansoor/autoscan

Last synced: 3 months ago
JSON representation

High Quality Documents OCR

Host: GitHub
URL: https://github.com/umermansoor/autoscan
Owner: umermansoor
License: mit
Created: 2024-12-08T18:29:08.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-03-29T16:18:32.000Z (3 months ago)
Last Synced: 2025-03-29T17:25:57.204Z (3 months ago)
Language: Python
Size: 11.3 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # AutoScan

AutoScan is a tool that converts PDF files to Markdown using AI. It extracts text, handles images, and even decodes handwriting to produce clean and structured Markdown documents.

## Features

- **Accurate Conversion**: Faithfully translates complex PDF layouts into Markdown without losing formatting.

- **Image Transcription**: Provides descriptive summaries for images instead of embedding them.

- **Handwriting OCR**: Converts handwritten text into Markdown.

- **Multi-language Support**: Handles multiple languages.

- **Metadata Extraction**: Extracts titles, authors, and other metadata.

![Example 1 ](https://private-user-images.githubusercontent.com/862952/395720191-296f93c4-8f04-4771-887c-08c45fdd1d95.png)

![Example 2](https://private-user-images.githubusercontent.com/862952/395720236-44d3ea28-2ca8-4d86-ab79-29683e5529c1.png)

## Installation

To install the required dependencies, run:

```sh

poetry install

```

### Install `poppler`

#### On Mac

```sh

brew install poppler

```

#### On Linux: 

```sh

sudo apt-get install poppler-utils

```

### To Test

To run tests:

```sh

poetry run pytest tests/

```

## Set `OPENAI_API_KEY`

### On Mac/Linux

```sh

export OPENAI_API_KEY=your_api_key

```

## Usage

To run the example script:

```sh

python example_usage.py

```

### Example

Here is an example of how to use AutoScan programmatically:

```python

import asyncio

from autoscan.autoscan import autoscan

async def main():

    pdf_path = "path/to/your/pdf_file.pdf"

    output = await autoscan(pdf_path)

    print(output.markdown)

asyncio.run(main())

```

## How It Works

1. Convert PDF to Images: Each page of the PDF is converted into an image.

2. Process Images with LLM: The images are processed by a language model to generate Markdown.

3. Aggregate Markdown: The generated Markdown for each page is aggregated into a single file.

## Configuration

You can configure the model and other parameters in the `autoscan` function:

```python

async def autoscan(

    pdf_path: str, 

    model_name: str = "gpt-4o",

    transcribe_images: Optional[bool] = True,

    output_dir: Optional[str] = None,

    temp_dir: Optional[str] = None,

    cleanup_temp: bool = True

) -> AutoScanOutput:

    ...

```

## Output

The output of the `autoscan` function includes:

- `completion_time`: Time taken to complete the conversion.

- `markdown_file`: Path to the generated Markdown file.

- `markdown`: The aggregated Markdown content.

- `input_tokens`: Number of input tokens used.

- `output_tokens`: Number of output tokens generated.

## Examples

You can find example Markdown files generated by AutoScan in the `examples` directory:

- [examples/helloworld2.md](./examples/helloworld2.md)

- [examples/helloworld.md](./examples/helloworld.md)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/umermansoor/autoscan

Awesome Lists containing this project

README