https://github.com/harubi/bolivar
High-performance PDF table extraction library. Bindings for Python and JVM.
https://github.com/harubi/bolivar
jvm pdf pdf-parsing python rust table-extraction text-extraction
Last synced: 4 months ago
JSON representation
High-performance PDF table extraction library. Bindings for Python and JVM.
- Host: GitHub
- URL: https://github.com/harubi/bolivar
- Owner: harubi
- License: mit
- Created: 2025-12-30T19:56:25.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2026-02-12T12:23:22.000Z (4 months ago)
- Last Synced: 2026-02-12T21:22:31.150Z (4 months ago)
- Topics: jvm, pdf, pdf-parsing, python, rust, table-extraction, text-extraction
- Language: Rust
- Homepage:
- Size: 6.59 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# bolivar
Fast PDF text and table extraction. Written in Rust, drop-in compatible with pdfminer and pdfplumber.
## Install
```sh
pip install bolivar
```
```kotlin
implementation("sa.ingenious:bolivar:1.2.0")
```
```toml
[dependencies]
bolivar-core = "1.2"
```
## Extract text
Pull all text from a PDF in one call. The pdfplumber interface opens the file and iterates pages; the pdfminer interface returns the full text directly. Kotlin and Rust follow the same pattern with their respective APIs.
```python
import pdfplumber
with pdfplumber.open("doc.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
```
```python
from pdfminer.high_level import extract_text
text = extract_text("doc.pdf")
```
```kotlin
import sa.ingenious.DocumentOptions
import sa.ingenious.bolivar
val doc = bolivar.open("doc.pdf", DocumentOptions {
maxPages = 1
layout {
lineMargin = 0.5
wordMargin = 0.1
}
})
val text = doc.extractText()
```
```rust
use bolivar_core::high_level::extract_text;
fn main() -> bolivar_core::Result<()> {
let data = std::fs::read("doc.pdf")?;
let text = extract_text(&data, None)?;
println!("{text}");
Ok(())
}
```
## Extract tables
Detect and extract tabular data from each page. Bolivar returns structured tables with row and column counts, bounding boxes, and cell text so you can inspect or export them without manual parsing.
```python
import pdfplumber
with pdfplumber.open("doc.pdf") as pdf:
for page in pdf.pages:
for table in page.extract_tables():
print(table)
```
```kotlin
import sa.ingenious.DocumentOptions
import sa.ingenious.bolivar
val doc = bolivar.open("doc.pdf", DocumentOptions {
pages(1, 2)
})
val tables = doc.extractTables()
for (table in tables) {
println("${table.rowCount}x${table.columnCount}")
}
```
```rust
use bolivar_core::high_level::{extract_tables_with_document, ExtractOptions};
use bolivar_core::pdfdocument::PDFDocument;
use bolivar_core::table::TableSettings;
fn main() -> bolivar_core::Result<()> {
let data = std::fs::read("doc.pdf")?;
let doc = PDFDocument::new(&data, "")?;
let tables = extract_tables_with_document(
&doc,
ExtractOptions::default(),
&TableSettings::default(),
)?;
Ok(())
}
```
## Iterate pages
Walk through pages one at a time to read metadata like page number, dimensions, and a text preview. This is useful when you need to locate content across a large document before extracting specific pages.
```python
import pdfplumber
with pdfplumber.open("doc.pdf") as pdf:
for page in pdf.pages:
print(page.page_number, page.width, page.height)
```
```python
from pdfminer.high_level import extract_pages
for page in extract_pages("doc.pdf"):
print(page.pageid, page.width, page.height)
```
```kotlin
import sa.ingenious.DocumentOptions
import sa.ingenious.bolivar
val doc = bolivar.open("doc.pdf", DocumentOptions {
maxPages = 3
})
val pages = doc.extractPageSummaries()
for (page in pages) {
println("${page.pageNumber}: ${page.text.take(80)}")
}
```
```rust
use bolivar_core::high_level::extract_pages;
fn main() -> bolivar_core::Result<()> {
let data = std::fs::read("doc.pdf")?;
for page in extract_pages(&data, None)? {
let page = page?;
println!("{}", page.pageid);
}
Ok(())
}
```
## Async (Python)
Run extraction off the main thread in Python while keeping the same `pdfplumber` API.
```python
import pdfplumber
async with pdfplumber.open("doc.pdf") as pdf:
for page in pdf.pages:
for table in page.extract_tables():
print(table)
```
## License
MIT