Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/parthapray/nlp_pipeline_openai
This repo contains nlp pipeline and openai API integration
https://github.com/parthapray/nlp_pipeline_openai
gradio matplotlib networkx nltk openai rake-nltk scikit-learn seaborn spacy textblob textstat wordcloud
Last synced: about 15 hours ago
JSON representation
This repo contains nlp pipeline and openai API integration
- Host: GitHub
- URL: https://github.com/parthapray/nlp_pipeline_openai
- Owner: ParthaPRay
- License: mit
- Created: 2024-12-26T06:17:06.000Z (1 day ago)
- Default Branch: main
- Last Pushed: 2024-12-26T07:01:22.000Z (1 day ago)
- Last Synced: 2024-12-26T07:25:16.271Z (about 24 hours ago)
- Topics: gradio, matplotlib, networkx, nltk, openai, rake-nltk, scikit-learn, seaborn, spacy, textblob, textstat, wordcloud
- Language: Jupyter Notebook
- Homepage:
- Size: 10.4 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# NLP Pipeline with Graceful Clustering
This **Natural Language Processing (NLP) Pipeline** along with **gpt-4o-mini** provides a comprehensive solution for analyzing, clustering, and visualizing text data. It integrates advanced machine learning techniques with a user-friendly **Gradio interface**, enabling users to interactively explore results with structured outputs and dynamic visualizations.
---
## Key Features
1. **Text Preprocessing**:
- Tokenization, stopword removal, and POS tagging.
- Named Entity Recognition (NER) for identifying entities.
2. **Feature Extraction**:
- **TF-IDF Analysis**: Highlights significant terms.
- **Keyword Extraction**: Uses RAKE for extracting relevant phrases.
3. **Analysis**:
- **Sentiment Analysis**: Evaluates text polarity and subjectivity.
- **Readability Metrics**: Calculates text complexity using multiple readability indices.
- **Dependency Parsing**: Identifies linguistic dependencies.
4. **Clustering**:
- Groups documents based on similarity using KMeans clustering.
5. **Topic Modeling**:
- Identifies dominant themes in documents using Latent Dirichlet Allocation (LDA).
6. **Visualization**:
- **Word Cloud**: Displays frequent terms.
- **TF-IDF Bar Chart**: Highlights keyword scores.
- **Co-occurrence Network**: Visualizes relationships between terms.
- **Polarity Heatmap**: Displays sentence-level sentiment variations.
7. **Interactive Interface**:
- Powered by **Gradio**, offering an easy-to-use web-based interface for exploring results.---
## Requirements
### Dependencies
The required Python packages are listed in `requirements.txt`:
```plaintext
spacy
wordcloud
networkx
nltk
textblob
scikit-learn
seaborn
matplotlib
rake-nltk
textstat
gradio
openai
```### Installation
1. Clone the repository:
```bash
git clone https://github.com/ParthaPRay/nlp_pipeline_openai/.git
cd nlp_pipeline_openai
```2. Install the dependencies:
```bash
pip install -r requirements.txt
```3. Download required NLTK and SpaCy resources:
```bash
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"
python -m spacy download en_core_web_sm
```---
## Workflow
### Workflow Overview
1. **Input Text and Documents**:
- Users provide a text string for analysis and optional documents for comparison.
2. **Preprocessing**:
- Tokenize text, remove stopwords, and tag parts of speech.
- Extract named entities and clean tokens for further analysis.
3. **Feature Extraction**:
- Compute **TF-IDF** scores for identifying important terms.
- Extract keywords using **RAKE**.4. **Analysis**:
- Perform **sentiment analysis** to evaluate polarity and subjectivity.
- Assess **readability metrics** using indices like Flesch Reading Ease.
- Parse linguistic dependencies to understand relationships in text.5. **Clustering and Topic Modeling**:
- Group similar documents using **KMeans clustering**.
- Identify key topics with **LDA (Latent Dirichlet Allocation)**.6. **Visualization**:
- Generate visual outputs like:
- **Word Cloud**
- **TF-IDF Chart**
- **Co-occurrence Network**
- **Polarity Heatmap**7. **Interactive Results**:
- Use **Gradio** for an intuitive, web-based exploration of results.---
## Workflow Diagram
```mermaid
graph TD
A[Input Text/Documents] --> B[Preprocessing]
B --> C[Feature Extraction]
C --> D[Sentiment Analysis]
C --> E[Topic Modeling]
C --> F[Clustering]
D --> G[Visualization]
E --> G
F --> G
G --> H[Interactive UI with Gradio]
```---
## Code Structure
### Key Functions
#### Text Preprocessing
```python
def dependency_parsing(text):
doc = nlp(text)
for token in doc:
print(f"{token.text} -> {token.dep_} -> {token.head.text}")
```#### Feature Extraction
```python
def compute_tfidf(documents, top_n=5):
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
dense = tfidf_matrix.todense()
scores = dense[0].tolist()[0]
tfidf_scores = [(feature_names[i], scores[i]) for i in range(len(scores))]
sorted_scores = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)
return sorted_scores[:top_n]
```#### Clustering
```python
def cluster_documents(documents, n_clusters=3):
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(documents)
km = KMeans(n_clusters=n_clusters, random_state=42)
km.fit(tfidf_matrix)
return km.labels_
```#### Visualization
- **TF-IDF Chart**:
```python
def visualize_tfidf_figure(tfidf_scores):
fig, ax = plt.subplots()
words, scores = zip(*tfidf_scores) if tfidf_scores else ([], [])
ax.barh(words, scores)
ax.set_xlabel("TF-IDF Score")
ax.set_title("Top TF-IDF Keywords")
plt.tight_layout()
return fig
```
- **Word Cloud**:
```python
def generate_wordcloud_figure(text):
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(text)
fig, ax = plt.subplots(figsize=(10, 5))
ax.imshow(wordcloud, interpolation="bilinear")
ax.axis("off")
ax.set_title("Word Cloud")
plt.tight_layout()
return fig
```---
## Usage
### Running the Application
1. Start the application:
```bash
python app.py
```
2. Open the Gradio interface at `http://127.0.0.1:7861`.### Example Input
- **Text**: `"Artificial intelligence revolutionizes industries."`
- **Documents**:
```
AI is transforming healthcare.
Robotics drives automation.
Machine learning enables new opportunities.
```### Example Output
- **Named Entities**: `["Artificial intelligence", "industries"]`
- **Sentiment Analysis**: `Positive (Polarity: 0.85)`
- **Clusters**: `[0, 1, 2]`
- **TF-IDF Keywords**: `["artificial", "intelligence", "revolutionizes"]`
- **Readability Scores**:
```json
{
"flesch_reading_ease": 70.2,
"gunning_fog_index": 8.3,
"smog_index": 7.2
}
```---
## Gradio Panels
### Inputs
- **Text**: Multiline input for primary text analysis.
- **Documents**: Optional multiline input for document clustering and comparison.### Outputs
- **JSON Results**:
- Named entities, clean tokens, word frequencies, sentiment analysis, etc.
- **Visualization Panels**:
- Word Cloud, Polarity Heatmap, Co-occurrence Network, and TF-IDF Chart.---
## Customization
- **Adjust Number of Topics**:
```python
topic_modeling(documents, n_topics=5)
```
- **Modify Clusters**:
```python
cluster_documents(documents, n_clusters=4)
```---
## Troubleshooting
| **Issue** | **Solution** |
|----------------------------|------------------------------------------------------------------|
| Missing NLTK Data | Run `nltk.download('punkt')` and `nltk.download('stopwords')`. |
| SpaCy Model Missing | Run `python -m spacy download en_core_web_sm`. |
| Backend Errors | Uncomment `matplotlib.use('Agg')` for compatibility. |---
## Contribution
We welcome contributions! Fork the repository, make changes, and submit pull requests to enhance features or fix bugs.
---
## License
This project is licensed under the MIT License. See the `LICENSE` file for details.
---
## Screenshots
### Gradio Interface
![image](https://github.com/user-attachments/assets/ada65ce7-ad2f-49fc-93ce-5ecb90d392e1)
### Visualizations
-**Named Entities**
![image](https://github.com/user-attachments/assets/b235a814-3841-4c43-99b9-b01d9cf7b993)-**Clean Tokens**
![image](https://github.com/user-attachments/assets/98593af4-48e7-47b3-9d39-1da9981789ea)-**Word Frequencies**
![image](https://github.com/user-attachments/assets/b5d08e53-173b-4122-98d3-893c0f1a13ac)-**Sentiment ANalysis**
![image](https://github.com/user-attachments/assets/d2539eb3-5cef-41a5-b180-60fc4b59d097)-**Top TFIDF keywords**
![image](https://github.com/user-attachments/assets/08f36633-750c-4a7c-9676-278afe86f6ca)-**Topics**
![image](https://github.com/user-attachments/assets/6eb5f354-91c6-4f0a-b8e7-7fdf848d6314)-**Summary**
![image](https://github.com/user-attachments/assets/6a837d46-9f46-4f25-9eab-fef462f4faff)-**RAKE Keywords**
![image](https://github.com/user-attachments/assets/601c8c4e-296e-4ae5-91dd-27184c946fb4)-**Document Clusters**
![image](https://github.com/user-attachments/assets/7f6fb996-e974-4e5b-a539-f356bfb1a65c)-**POS Tagging Counts**
![image](https://github.com/user-attachments/assets/05e495a1-48f1-4cf2-a2aa-3e5b2d7462a8)-**Readability Scores**
![image](https://github.com/user-attachments/assets/e13f5f61-b175-4529-85f1-81196b6f798d)- **Word Cloud**:
![image](https://github.com/user-attachments/assets/b09e0a8e-906b-4e9f-b365-a47241e43432)- **Polarity Heatmap**:
![image](https://github.com/user-attachments/assets/cda1c9de-c9d4-4136-a401-10301f18376d)- **TF-IDF Keywords**:
![image](https://github.com/user-attachments/assets/d2e3942a-640f-427d-b429-a5e23ca6ecb2)- **Co-occurence Network**:
![image](https://github.com/user-attachments/assets/8d49634f-f733-400a-8fac-4349214d1816)## References
- [SpaCy Documentation](https://spacy.io/)
- [NLTK Documentation](https://www.nltk.org/)
- [TextBlob Documentation](https://textblob.readthedocs.io/)
- [Gradio Documentation](https://gradio.app/)Happy Analyzing! 🚀