Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sudeatesoglu/nlp-document-processor
An NLP tool for processing documents in different formats with functionalities of similarity score detection, highlighting given pattern and similar words between PDFs, and NER extraction.
https://github.com/sudeatesoglu/nlp-document-processor
nlp spacy text-processing
Last synced: 6 days ago
JSON representation
An NLP tool for processing documents in different formats with functionalities of similarity score detection, highlighting given pattern and similar words between PDFs, and NER extraction.
- Host: GitHub
- URL: https://github.com/sudeatesoglu/nlp-document-processor
- Owner: sudeatesoglu
- License: mit
- Created: 2024-07-07T16:38:04.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-07-31T10:03:30.000Z (3 months ago)
- Last Synced: 2024-10-10T11:42:55.602Z (26 days ago)
- Topics: nlp, spacy, text-processing
- Language: Python
- Homepage:
- Size: 130 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Document Processing Tool
Document Processing tool is a simple NLP project that provides comparison of two documents in different formats (pdf, txt, or docx) in terms of similarity score calculation, categorical similarity detection, searching a pattern, highlighting similar words between PDF documents, named entity recognition (NER), and PCA visualization.
## Features
- **Similarity Score Calculation**: Calculates the similarity score between two documents.
- **Categorical Similarity**: Detects the similarity score of a document based on given keyword.
- **Highlighting Pattern**: Searchs for and highlights the tokens in PDF by given pattern, and shows the token place information of matched pattern.
- **Highlighting Similar Words**: Highlights similar words between two PDF documents.
- **Named Entity Recognition**: Extracts and displays named entities from documents.
- **PCA Visualization**: Visualizes document word vectors using PCA.
- **Gradio Integration**: Uses Gradio for an interactive user interface, allowing user to upload documents and view results by selected feature. Gradio is an open-source Python package that provides creating interactive web applications.## Installation
1. Clone the repository:
```bash
git clone https://github.com/sudeatesoglu/nlp-document-processing.git
cd nlp-document-processing
```2. Create a virtual environment and activate it:
```bash
python -m venv venv
source venv/Scripts/activate # On Windows
source venv/bin/activate # On macOS/Linux
```3. Install the required dependencies:
```bash
pip install -r requirements.txt
```## Usage
1. Launch the application:
```bash
python app.py
```