https://github.com/ixalodecte/filestruct

A python package to structure files using visual and style informations
https://github.com/ixalodecte/filestruct

layout-analysis parser pdf

Last synced: 5 months ago
JSON representation

A python package to structure files using visual and style informations

Host: GitHub
URL: https://github.com/ixalodecte/filestruct
Owner: ixalodecte
License: gpl-3.0
Created: 2024-01-21T18:27:11.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-03-09T21:26:04.000Z (over 2 years ago)
Last Synced: 2025-12-05T16:14:10.275Z (7 months ago)
Topics: layout-analysis, parser, pdf
Language: Python
Homepage:
Size: 45.9 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # FileStruct

**FileStruct** is a high-level Python library that aims to extract the overall structure of documents, particularly PDFs, based on visual information such as size, color and font.

## How does it work ?

As clever human beings, we are able to detect titles, subtitles, and paragraphs using the visual appearence of the document. A big text in red most certainly represent a title (or subtitle). Using these heuristics, we are able to structure a document : _This paragraph belongs to this section_. The same method is used by this package to provide an automated, while realistic way to structure a document. The method is described bellow :

1.  **Text and style extraction :** We rely on lower level librairies (like PyMuPDF) for the extraction of the text and style information, and the ordering of each block of text.

2.  **Tree creation :** A tree is created, in which each block of text is a node of the tree. A child of a node in the tree is a subsection of a section in the document.

3.  **Data exportation :** The data can be exported in JSON format.

For now, filestruct can only read formats that are supported by PyMuPDF. This includes pdf, epub, xps, mobi, fb2, cbz and svg. I plan to add more file formats in the future.

## Installation

Install **FileStruct** using **pip** :

```sh

pip install filestruct

```

## Getting Started

Bellow, a basic usage for a PDF document :

```python

from filestruct.document import PDFDocument

doc = Document("PATH_TO_YOUR_FILE.pdf")

data = doc.to_json()   # Export the tree into json format

print(data)

print(doc)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ixalodecte/filestruct

Awesome Lists containing this project

README