https://github.com/harubi/bolivar

High-performance PDF table extraction library. Bindings for Python and JVM.
https://github.com/harubi/bolivar

jvm pdf pdf-parsing python rust table-extraction text-extraction

Last synced: 4 months ago
JSON representation

High-performance PDF table extraction library. Bindings for Python and JVM.

Host: GitHub
URL: https://github.com/harubi/bolivar
Owner: harubi
License: mit
Created: 2025-12-30T19:56:25.000Z (6 months ago)
Default Branch: master
Last Pushed: 2026-02-12T12:23:22.000Z (4 months ago)
Last Synced: 2026-02-12T21:22:31.150Z (4 months ago)
Topics: jvm, pdf, pdf-parsing, python, rust, table-extraction, text-extraction
Language: Rust
Homepage:
Size: 6.59 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # bolivar

Fast PDF text and table extraction. Written in Rust, drop-in compatible with pdfminer and pdfplumber.

## Install

```sh

pip install bolivar

```

```kotlin

implementation("sa.ingenious:bolivar:1.2.0")

```

```toml

[dependencies]

bolivar-core = "1.2"

```

## Extract text

Pull all text from a PDF in one call. The pdfplumber interface opens the file and iterates pages; the pdfminer interface returns the full text directly. Kotlin and Rust follow the same pattern with their respective APIs.

```python

import pdfplumber

with pdfplumber.open("doc.pdf") as pdf:

    for page in pdf.pages:

        print(page.extract_text())

```

```python

from pdfminer.high_level import extract_text

text = extract_text("doc.pdf")

```

```kotlin

import sa.ingenious.DocumentOptions

import sa.ingenious.bolivar

val doc = bolivar.open("doc.pdf", DocumentOptions {

    maxPages = 1

    layout {

        lineMargin = 0.5

        wordMargin = 0.1

    }

})

val text = doc.extractText()

```

```rust

use bolivar_core::high_level::extract_text;

fn main() -> bolivar_core::Result<()> {

    let data = std::fs::read("doc.pdf")?;

    let text = extract_text(&data, None)?;

    println!("{text}");

    Ok(())

}

```

## Extract tables

Detect and extract tabular data from each page. Bolivar returns structured tables with row and column counts, bounding boxes, and cell text so you can inspect or export them without manual parsing.

```python

import pdfplumber

with pdfplumber.open("doc.pdf") as pdf:

    for page in pdf.pages:

        for table in page.extract_tables():

            print(table)

```

```kotlin

import sa.ingenious.DocumentOptions

import sa.ingenious.bolivar

val doc = bolivar.open("doc.pdf", DocumentOptions {

    pages(1, 2)

})

val tables = doc.extractTables()

for (table in tables) {

    println("${table.rowCount}x${table.columnCount}")

}

```

```rust

use bolivar_core::high_level::{extract_tables_with_document, ExtractOptions};

use bolivar_core::pdfdocument::PDFDocument;

use bolivar_core::table::TableSettings;

fn main() -> bolivar_core::Result<()> {

    let data = std::fs::read("doc.pdf")?;

    let doc = PDFDocument::new(&data, "")?;

    let tables = extract_tables_with_document(

        &doc,

        ExtractOptions::default(),

        &TableSettings::default(),

    )?;

    Ok(())

}

```

## Iterate pages

Walk through pages one at a time to read metadata like page number, dimensions, and a text preview. This is useful when you need to locate content across a large document before extracting specific pages.

```python

import pdfplumber

with pdfplumber.open("doc.pdf") as pdf:

    for page in pdf.pages:

        print(page.page_number, page.width, page.height)

```

```python

from pdfminer.high_level import extract_pages

for page in extract_pages("doc.pdf"):

    print(page.pageid, page.width, page.height)

```

```kotlin

import sa.ingenious.DocumentOptions

import sa.ingenious.bolivar

val doc = bolivar.open("doc.pdf", DocumentOptions {

    maxPages = 3

})

val pages = doc.extractPageSummaries()

for (page in pages) {

    println("${page.pageNumber}: ${page.text.take(80)}")

}

```

```rust

use bolivar_core::high_level::extract_pages;

fn main() -> bolivar_core::Result<()> {

    let data = std::fs::read("doc.pdf")?;

    for page in extract_pages(&data, None)? {

        let page = page?;

        println!("{}", page.pageid);

    }

    Ok(())

}

```

## Async (Python)

Run extraction off the main thread in Python while keeping the same `pdfplumber` API.

```python

import pdfplumber

async with pdfplumber.open("doc.pdf") as pdf:

    for page in pdf.pages:

        for table in page.extract_tables():

            print(table)

```

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harubi/bolivar

Awesome Lists containing this project

README