Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nanxstats/pdf-word-extraction
Extract meaningful words from a collection of PDF documents and count their frequencies
https://github.com/nanxstats/pdf-word-extraction
ftfy natural-language-processing pypdf research-paper spacy wordcloud
Last synced: about 2 months ago
JSON representation
Extract meaningful words from a collection of PDF documents and count their frequencies
- Host: GitHub
- URL: https://github.com/nanxstats/pdf-word-extraction
- Owner: nanxstats
- Created: 2023-06-22T01:58:38.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-14T01:15:48.000Z (7 months ago)
- Last Synced: 2024-06-15T02:24:41.332Z (7 months ago)
- Topics: ftfy, natural-language-processing, pypdf, research-paper, spacy, wordcloud
- Language: Python
- Homepage: https://nanx.me/blog/post/research-word-cloud/
- Size: 3.91 KB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PDF Word Extraction
This tool is designed to extract meaningful words from a collection of PDF
documents. The extracted words are processed and their frequencies are counted.
This frequency data can be used for various text analysis and visualization
tasks, such as generating word clouds or identifying common themes in the
document collection.The tool leverages the modern text data toolchain in Python:
- pypdf: for reading PDFs.
- ftfy: for text cleaning.
- SpaCy: for natural language processing such as
tokenization, lemmatization, and stop-word removal.The tool also provides customizable features such as the ability to specify
words for removal or replacement.## Workflow
Clone the repository:
```bash
git clone https://github.com/nanxstats/pdf-word-extraction.git
```Create a [virtual environment](https://docs.python.org/3/library/venv.html)
inside the cloned repository, activate it, and install the required Python
packages into the virtual environment:```bash
cd pdf-word-extraction
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```Put the PDF files under `pdf/`, run
```
python3 pdf_word_extraction.py
```If you use VS Code, open the project and select the recommended "venv"
Python interpreter. Edit the list of words to remove and replace in
`pdf_word_extraction.py`, save the file and run it again in terminal.