https://github.com/rikeshamin/pdf-keyword-finder
A lightweight command-line script designed for personal use during my PhD literature searches, enabling the extraction of key terms from multiple PDF files and exporting the results to a CSV file.
https://github.com/rikeshamin/pdf-keyword-finder
keyword keyword-extraction keywords-extraction literature literature-review literature-search pdf pdf-document phd phd-thesis python text
Last synced: 29 days ago
JSON representation
A lightweight command-line script designed for personal use during my PhD literature searches, enabling the extraction of key terms from multiple PDF files and exporting the results to a CSV file.
- Host: GitHub
- URL: https://github.com/rikeshamin/pdf-keyword-finder
- Owner: rikeshamin
- License: mit
- Created: 2025-01-23T20:26:03.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-01-23T21:12:15.000Z (3 months ago)
- Last Synced: 2025-02-03T23:52:13.784Z (3 months ago)
- Topics: keyword, keyword-extraction, keywords-extraction, literature, literature-review, literature-search, pdf, pdf-document, phd, phd-thesis, python, text
- Language: Python
- Homepage:
- Size: 9.77 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF Keyword Finder
## Overview
I developed this tool to streamline the process of conducting mass keyword searches in literature during my PhD. It allows for efficient identification of keywords across multiple PDF files in a single directory, saving time and effort. The tool generates a CSV file summarizing the matched keywords and their corresponding file names, enabling easy analysis and organization.
## Features
- Extracts text from PDF files using the `pdfplumber` library.
- Preprocesses text to remove line breaks, hyphens, and extra spaces.
- Searches for user-specified keywords within the extracted text.
- Saves the results (matching filenames and keywords) to a CSV file.---
## Installation
### Prerequisites
Ensure you have the following installed on your system:
- Python 3.7 or higher
- pip (Python package installer)### Required Libraries
Install the necessary Python libraries using pip:
```bash
pip install pdfplumber pandas
```---
## Usage
### Command-Line Arguments
The script requires the following command-line arguments:
1. **pdf_directory**: Path to the directory containing the PDF files.
2. **keywords**: Comma-separated list of keywords to search for.
3. **csv_path**: Path to save the output CSV file.### Example
To search for keywords "Python" and "Data" in PDF files located in `/path/to/pdfs` and save the results to `/path/to/output.csv`, run:
```bash
python pdf_keyword_finder.py "/path/to/pdfs" "Python,Data" "/path/to/output.csv"
```
---## Additional Ideas (when time permits)
- Integrate the OpenAI API to enhance usability and provide more accurate keyword matching by leveraging advanced natural language processing capabilities.
- Improve the tool's compatibility with scanned PDFs by incorporating Optical Character Recognition (OCR) functionality to handle files without extractable text.---
## License
This project is licensed under the MIT License. See the LICENSE file for details.---
## Contribution
Feel free to star, fork the repository, create issues, or submit pull requests to improve this tool. Suggestions and feedback are VERY welcome!