https://github.com/elchemista/doc_dig
DocDig is an Elixir wrapper around the Rust-based extractous library, exposing high-performance document and web-page text extraction via Rustler NIFs.
https://github.com/elchemista/doc_dig
elixir parsing pdf rag rust
Last synced: 11 months ago
JSON representation
DocDig is an Elixir wrapper around the Rust-based extractous library, exposing high-performance document and web-page text extraction via Rustler NIFs.
- Host: GitHub
- URL: https://github.com/elchemista/doc_dig
- Owner: elchemista
- License: apache-2.0
- Created: 2025-05-11T14:58:05.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-05-12T11:31:01.000Z (about 1 year ago)
- Last Synced: 2025-07-05T00:37:28.127Z (12 months ago)
- Topics: elixir, parsing, pdf, rag, rust
- Language: Elixir
- Homepage:
- Size: 119 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DocDig
**DocDig** is an Elixir wrapper around the Rust-based [`extractous`](https://github.com/yobix-ai/extractous) library, exposing high-performance document and web-page text extraction via Rustler NIFs.
## Features
* Extract text from **local files**: PDF, DOCX, HTML, Markdown, etc.
* Fetch and extract from **URLs**.
* Extract from in-memory **binaries**.
* Perform **OCR** on image-only PDFs or images via Tesseract (customizable language).
* Optional **bang (`!`)** variants that raise on errors for concise workflows.
* Precompiled NIFs with [`rustler_precompiled`](https://github.com/philss/rustler_precompiled) support for zero‑toolchain installs.
## Installation
Add to your `mix.exs`:
```elixir
def deps do
[
{:doc_dig, github: "elchemista/doc_dig", branch: "master"}
]
end
```
Then fetch and compile:
```bash
mix deps.get
mix compile
```
## Usage Examples
```elixir
# Extract from a local Markdown file:
{:ok, {text, metadata}} = DocDig.extract_file("README.md")
IO.puts(text)
IO.inspect(metadata)
# Raise on failure:
{text, _meta} = DocDig.extract_file!("README.md")
# Extract from a URL:
{:ok, {html_text, _}} = DocDig.extract_url("https://example.com")
# Extract from in-memory binary (e.g. download via HTTPoison):
{:ok, file_bytes} = HTTPoison.get("https://example.com/sample.docx")
{:ok, {doc_text, _}} = DocDig.extract_bytes(file_bytes)
# Force OCR on a scanned PDF (German language):
{:ok, {ocr_text, _}} = DocDig.extract_file_ocr("invoice_scanned.pdf", "deu")
```
## Contributing
1. Fork the repo
2. Create a feature branch: `git checkout -b feature/my-addition`
3. Run tests: `mix test`
4. Submit a pull request
## Credits
* **extractous** by Yobix AI and contributors
* **Rustler** by the Rustler team
* **Tesseract OCR** for OCR support
* **Elixir** and **Erlang/OTP** community
## License
Apache-2.0 © [License](LICENSE)