https://github.com/taylor-eos/manual-classifier

Manual classifier GUI for PDF text extraction
https://github.com/taylor-eos/manual-classifier

Last synced: 19 days ago
JSON representation

Manual classifier GUI for PDF text extraction

Host: GitHub
URL: https://github.com/taylor-eos/manual-classifier
Owner: Taylor-eOS
Created: 2024-10-05T11:04:24.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-16T19:21:04.000Z (over 1 year ago)
Last Synced: 2025-02-16T20:25:49.338Z (over 1 year ago)
Language: Python
Homepage:
Size: 48.8 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

`manual-classifier` is a simple Python GUI tool for extracting and classifying text blocks from PDF files. The tool displays each block of text, allowing the user to select "Header," "Body," "Footer," or "Quote" using buttons or keyboard shortcuts (1-3, H, E). In the case of a mistake, the last classification can be undone (or the output file edited). The blocks are written into an output file in their appropriate tags for further automated processing, for instance with [txt-to-epub](https://github.com/Taylor-eOS/txt-to-epub).

### How to use:
- Make a project, venv environment, and install requirements. (A detailed guide for the whole installation process can be found [here](https://github.com/Taylor-eOS/whisper).
- Run `python manually_classify.py`.
- Enter input file basename when prompted.
- Use the GUI to classify each text block.
- The classifications are saved to `output.txt`.

This was originally meant to create training data for [a machine learning classifier](https://github.com/Taylor-eOS/bert-classifier). However that proved too unreliable, so the manual classification is preferred.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/taylor-eos/manual-classifier

Awesome Lists containing this project

README