Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/umermansoor/autoscan

High Quality Documents OCR
https://github.com/umermansoor/autoscan

Last synced: 25 days ago
JSON representation

High Quality Documents OCR

Awesome Lists containing this project

README

        

# AutoScan

AutoScan is a tool that converts PDF files to Markdown using AI. It extracts text, handles images, and even decodes handwriting to produce clean and structured Markdown documents.

## Features

- **Accurate Conversion**: Faithfully translates complex PDF layouts into Markdown without losing formatting.
- **Image Transcription**: Provides descriptive summaries for images instead of embedding them.
- **Handwriting OCR**: Converts handwritten text into Markdown.
- **Multi-language Support**: Handles multiple languages.
- **Metadata Extraction**: Extracts titles, authors, and other metadata.

![Example 1 ](https://private-user-images.githubusercontent.com/862952/395720191-296f93c4-8f04-4771-887c-08c45fdd1d95.png)

![Example 2](https://private-user-images.githubusercontent.com/862952/395720236-44d3ea28-2ca8-4d86-ab79-29683e5529c1.png)

## Installation

To install the required dependencies, run:

```sh
poetry install
```

### Install `poppler`

#### On Mac

```sh
brew install poppler
```

#### On Linux:

```sh
sudo apt-get install poppler-utils
```

### To Test

To run tests:

```sh
poetry run pytest tests/
```

## Set `OPENAI_API_KEY`

### On Mac/Linux

```sh
export OPENAI_API_KEY=your_api_key
```

## Usage

To run the example script:
```sh
python example_usage.py
```

### Example

Here is an example of how to use AutoScan programmatically:

```python
import asyncio
from autoscan.autoscan import autoscan

async def main():
pdf_path = "path/to/your/pdf_file.pdf"
output = await autoscan(pdf_path)
print(output.markdown)

asyncio.run(main())
```

## How It Works

1. Convert PDF to Images: Each page of the PDF is converted into an image.
2. Process Images with LLM: The images are processed by a language model to generate Markdown.
3. Aggregate Markdown: The generated Markdown for each page is aggregated into a single file.

## Configuration
You can configure the model and other parameters in the `autoscan` function:

```python
async def autoscan(
pdf_path: str,
model_name: str = "gpt-4o",
transcribe_images: Optional[bool] = True,
output_dir: Optional[str] = None,
temp_dir: Optional[str] = None,
cleanup_temp: bool = True
) -> AutoScanOutput:
...
```

## Output
The output of the `autoscan` function includes:

- `completion_time`: Time taken to complete the conversion.
- `markdown_file`: Path to the generated Markdown file.
- `markdown`: The aggregated Markdown content.
- `input_tokens`: Number of input tokens used.
- `output_tokens`: Number of output tokens generated.

## Examples

You can find example Markdown files generated by AutoScan in the `examples` directory:

- [examples/helloworld2.md](./examples/helloworld2.md)
- [examples/helloworld.md](./examples/helloworld.md)