Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/umermansoor/autoscan
High Quality Documents OCR
https://github.com/umermansoor/autoscan
Last synced: 25 days ago
JSON representation
High Quality Documents OCR
- Host: GitHub
- URL: https://github.com/umermansoor/autoscan
- Owner: umermansoor
- Created: 2024-12-08T18:29:08.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-12-12T07:22:36.000Z (about 1 month ago)
- Last Synced: 2024-12-12T08:24:51.733Z (about 1 month ago)
- Language: Python
- Size: 119 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AutoScan
AutoScan is a tool that converts PDF files to Markdown using AI. It extracts text, handles images, and even decodes handwriting to produce clean and structured Markdown documents.
## Features
- **Accurate Conversion**: Faithfully translates complex PDF layouts into Markdown without losing formatting.
- **Image Transcription**: Provides descriptive summaries for images instead of embedding them.
- **Handwriting OCR**: Converts handwritten text into Markdown.
- **Multi-language Support**: Handles multiple languages.
- **Metadata Extraction**: Extracts titles, authors, and other metadata.![Example 1 ](https://private-user-images.githubusercontent.com/862952/395720191-296f93c4-8f04-4771-887c-08c45fdd1d95.png)
![Example 2](https://private-user-images.githubusercontent.com/862952/395720236-44d3ea28-2ca8-4d86-ab79-29683e5529c1.png)
## Installation
To install the required dependencies, run:
```sh
poetry install
```### Install `poppler`
#### On Mac
```sh
brew install poppler
```#### On Linux:
```sh
sudo apt-get install poppler-utils
```### To Test
To run tests:
```sh
poetry run pytest tests/
```## Set `OPENAI_API_KEY`
### On Mac/Linux
```sh
export OPENAI_API_KEY=your_api_key
```## Usage
To run the example script:
```sh
python example_usage.py
```### Example
Here is an example of how to use AutoScan programmatically:
```python
import asyncio
from autoscan.autoscan import autoscanasync def main():
pdf_path = "path/to/your/pdf_file.pdf"
output = await autoscan(pdf_path)
print(output.markdown)asyncio.run(main())
```## How It Works
1. Convert PDF to Images: Each page of the PDF is converted into an image.
2. Process Images with LLM: The images are processed by a language model to generate Markdown.
3. Aggregate Markdown: The generated Markdown for each page is aggregated into a single file.## Configuration
You can configure the model and other parameters in the `autoscan` function:```python
async def autoscan(
pdf_path: str,
model_name: str = "gpt-4o",
transcribe_images: Optional[bool] = True,
output_dir: Optional[str] = None,
temp_dir: Optional[str] = None,
cleanup_temp: bool = True
) -> AutoScanOutput:
...
```## Output
The output of the `autoscan` function includes:- `completion_time`: Time taken to complete the conversion.
- `markdown_file`: Path to the generated Markdown file.
- `markdown`: The aggregated Markdown content.
- `input_tokens`: Number of input tokens used.
- `output_tokens`: Number of output tokens generated.## Examples
You can find example Markdown files generated by AutoScan in the `examples` directory:
- [examples/helloworld2.md](./examples/helloworld2.md)
- [examples/helloworld.md](./examples/helloworld.md)