An open API service indexing awesome lists of open source software.

https://github.com/drkenreid/generalized-analysis-of-text-data

A comprehensive toolkit for analyzing text data using various AI and NLP techniques, including topic modeling, sentiment analysis, and text classification, demonstrated on the 20 Newsgroups dataset.
https://github.com/drkenreid/generalized-analysis-of-text-data

artificial-intelligence dependency-parser document-similarity exploratory-data-analysis natural-language-processing network-visualization newsgroups nlp sentiment-analysis text-classification text-clustering text-summarization topic-modeling word-embeddings

Last synced: 4 months ago
JSON representation

A comprehensive toolkit for analyzing text data using various AI and NLP techniques, including topic modeling, sentiment analysis, and text classification, demonstrated on the 20 Newsgroups dataset.

Awesome Lists containing this project

README

        

# πŸ“Š Generalized Analysis of Text Data

## πŸ” Overview
This repo provides a comprehensive toolkit for analyzing text data using various AI and Natural Language Processing (NLP) techniques. It's designed to be a reference guide and inspiration for text analysis projects, offering insights into themes, sentiment, named entities, and more.

## ✨ Features
- **πŸ“₯ Data Collection**: Uses the 20 Newsgroups dataset for demonstration.
- **πŸ“ Initial Textual Analysis**: Performs basic text statistics and word frequency analysis.
- **πŸ”¬ Exploratory Data Analysis**: Visualizes key aspects of the text data.
- **πŸ—‚οΈ Topic Modeling**: Uncovers hidden thematic structures in the text corpus.
- **🧩 Text Clustering**: Groups similar documents using K-means clustering.
- **πŸ”€ Word Embeddings**: Captures semantic relationships between words using Word2Vec.
- **πŸ”— Document Similarity**: Identifies related documents using cosine similarity.
- **🏷️ Named Entity Recognition**: Extracts and classifies named entities in the text.
- **πŸ•ΈοΈ Topic Network Visualization**: Visualizes relationships between topics and words.
- **😊 Sentiment Analysis**: Analyzes the emotional tone of the text.
- **πŸ“š Text Classification**: Automatically categorizes texts using machine learning.
- **πŸ“ Text Summarization**: Generates concise summaries of longer texts.
- **πŸ”  POS Tagging**: Assigns parts of speech to words in the text.
- **🌳 Dependency Parsing**: Analyzes the grammatical structure of sentences.
- **🧐 Topic Coherence**: Evaluates the quality of extracted topics.

## πŸ› οΈ Requirements
- Python 3.6+
- Required libraries:
- pandas
- numpy
- matplotlib
- seaborn
- nltk
- spacy
- textblob
- scikit-learn
- gensim
- networkx
- transformers

## πŸš€ Installation
1. Clone this repository:
```
git clone https://github.com/DrKenReid/Generalized-Analysis-of-Text-Data.git
```
2. Install required packages:
```
pip install -r requirements.txt
```

## πŸ‘¨β€πŸ’» Usage
1. Open the notebook in Google Colab or your preferred Jupyter environment.
2. Run all cells in the notebook:
- In Colab: Runtime -> Run all
- In Jupyter: Cell -> Run All

## πŸ“‘ Sections
1. Setup: Imports necessary libraries and initializes key components.
2. Data Collection: Fetches the 20 Newsgroups dataset.
3. Dataset Building: Structures the data into a pandas DataFrame.
4. Initial Textual Analysis: Performs basic text statistics.
5. Exploratory Data Analysis: Visualizes key aspects of the data.
6. AI-Enhanced Insights: Applies various NLP techniques for deeper analysis.

## πŸ“€ Output
The notebook generates various visualizations and outputs, including:
- Word frequency distributions
- Topic models
- Cluster visualizations
- Sentiment analysis results
- Named entity recognition results
- Text summaries

## πŸ”§ Customization
You can modify the notebook to use your own dataset by replacing the data collection step with your data loading process.

## 🀝 Contributing
Contributions, issues, and feature requests are welcome. Feel free to check [issues page](https://github.com/DrKenReid/Generalized-Analysis-of-Text-Data/issues) if you want to contribute.

## πŸ“„ License
This project is licensed under the MIT License.

## πŸ™ Acknowledgements
- This project uses the 20 Newsgroups dataset for demonstration purposes.
- Special thanks to the developers of the various Python libraries used in this project.

## βš–οΈ Disclaimer
This notebook is for educational and research purposes only. Ensure you have the right to use and analyze any data you input into this notebook.