https://github.com/taylor-eos/manual-classifier
Manual classifier GUI for PDF text extraction
https://github.com/taylor-eos/manual-classifier
Last synced: 19 days ago
JSON representation
Manual classifier GUI for PDF text extraction
- Host: GitHub
- URL: https://github.com/taylor-eos/manual-classifier
- Owner: Taylor-eOS
- Created: 2024-10-05T11:04:24.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-16T19:21:04.000Z (over 1 year ago)
- Last Synced: 2025-02-16T20:25:49.338Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 48.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
`manual-classifier` is a simple Python GUI tool for extracting and classifying text blocks from PDF files. The tool displays each block of text, allowing the user to select "Header," "Body," "Footer," or "Quote" using buttons or keyboard shortcuts (1-3, H, E). In the case of a mistake, the last classification can be undone (or the output file edited). The blocks are written into an output file in their appropriate tags for further automated processing, for instance with [txt-to-epub](https://github.com/Taylor-eOS/txt-to-epub).
### How to use:
- Make a project, venv environment, and install requirements. (A detailed guide for the whole installation process can be found [here](https://github.com/Taylor-eOS/whisper).
- Run `python manually_classify.py`.
- Enter input file basename when prompted.
- Use the GUI to classify each text block.
- The classifications are saved to `output.txt`.
This was originally meant to create training data for [a machine learning classifier](https://github.com/Taylor-eOS/bert-classifier). However that proved too unreliable, so the manual classification is preferred.