https://github.com/aidalinfo/extract-kit
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
https://github.com/aidalinfo/extract-kit
ai-sdk document-processing pdf pdf-extraction vision-llm
Last synced: about 2 months ago
JSON representation
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
- Host: GitHub
- URL: https://github.com/aidalinfo/extract-kit
- Owner: aidalinfo
- License: mit
- Created: 2025-08-10T16:54:11.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-08-10T17:22:09.000Z (about 2 months ago)
- Last Synced: 2025-08-10T18:29:43.745Z (about 2 months ago)
- Topics: ai-sdk, document-processing, pdf, pdf-extraction, vision-llm
- Language: TypeScript
- Homepage:
- Size: 302 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Extract Kit 🚀
Welcome to **Extract Kit** - a powerful, modern solution for document data extraction and conversion, powered by AI vision models! Transform your PDFs and Office documents into structured, validated data with ease.
> 🌟 **Inspired by [Sparrow](https://github.com/katanaml/sparrow)** - This project builds upon the innovative concepts from Katana ML's Sparrow framework, reimagined with modern TypeScript, Bun runtime, and enhanced AI vision capabilities.
## 🎯 What is Extract Kit?
Extract Kit is a monorepo containing a suite of production-ready TypeScript/Bun-based tools that leverage cutting-edge AI vision models and advanced parsing techniques to intelligently extract and convert data from various document formats.
## 🛠️ Available Packages
This repository contains the following open-source packages:
### 1. PDF Processor
[](https://www.npmjs.com/package/@aidalinfo/pdf-processor)
[](https://www.npmjs.com/package/@aidalinfo/pdf-processor)A service that leverages cutting-edge AI vision models (Scaleway, Mistral, Ollama) to intelligently extract structured data from PDF documents. It's perfect for processing invoices, receipts, tables, or any custom document type.
- **NPM Package**: [`@aidalinfo/pdf-processor`](https://www.npmjs.com/package/@aidalinfo/pdf-processor)
- **Documentation**: [Read the PDF Processor README](./packages/pdf-processor/README.md)### 2. Office to Markdown
[](https://www.npmjs.com/package/@aidalinfo/office-to-markdown)
[](https://www.npmjs.com/package/@aidalinfo/office-to-markdown)A modern TypeScript library for converting Office documents (DOCX) to Markdown format. It's optimized for the Bun ecosystem and includes advanced support for mathematical equations (OMML) and tables.
- **NPM Package**: [`@aidalinfo/office-to-markdown`](https://www.npmjs.com/package/@aidalinfo/office-to-markdown)
- **Documentation**: [Read the Office to Markdown README](./packages/office-to-markdown/README.md)## ✨ Key Features
- **🤖 AI-Powered Extraction**: Harness the power of Scaleway Pixtral, Mistral AI, Ollama LLaVA, and custom AI providers for PDF processing.
- **🔄 Advanced DOCX Conversion**: Convert `.docx` files to clean Markdown, preserving structure, styles, and even complex mathematical equations.
- **📄 Smart PDF Processing**: Automatic PDF to optimized image conversion using Sharp for better AI recognition.
- **⚡ Lightning Fast**: Built on Bun runtime with parallel worker processing for maximum performance.
- **🔒 Type-Safe**: Full TypeScript implementation with Zod schema validation for reliable data extraction.
- **📦 Dual Mode**: Use as a standalone REST API microservice or as a library in your existing Node.js/Bun projects.## 🚀 Quick Start
For detailed instructions, please refer to the README file of the specific package you want to use:
- **For PDF Extraction**: [`packages/pdf-processor/README.md`](./packages/pdf-processor/README.md)
- **For DOCX Conversion**: [`packages/office-to-markdown/README.md`](https://github.com/aidalinfo/extract-kit/blob/main/packages/office-to-markdown/README.md)## 🤝 Contributing
We welcome contributions! Please read our contribution guidelines (coming soon). For now, you can:
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request## 📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
## 🙏 Acknowledgments
- Built with [Bun](https://bun.sh/) - The fast JavaScript runtime.
- PDF Processor is powered by [Scaleway AI](https://www.scaleway.com/en/ai/), [Mistral AI](https://mistral.ai/), and [Ollama](https://ollama.ai/).
- Office to Markdown is inspired by Microsoft's [MarkItDown](https://github.com/microsoft/markitdown).---
**Happy Extracting!** 🎉