https://github.com/drkenreid/generalized-analysis-of-text-data
A comprehensive toolkit for analyzing text data using various AI and NLP techniques, including topic modeling, sentiment analysis, and text classification, demonstrated on the 20 Newsgroups dataset.
https://github.com/drkenreid/generalized-analysis-of-text-data
artificial-intelligence dependency-parser document-similarity exploratory-data-analysis natural-language-processing network-visualization newsgroups nlp sentiment-analysis text-classification text-clustering text-summarization topic-modeling word-embeddings
Last synced: 4 months ago
JSON representation
A comprehensive toolkit for analyzing text data using various AI and NLP techniques, including topic modeling, sentiment analysis, and text classification, demonstrated on the 20 Newsgroups dataset.
- Host: GitHub
- URL: https://github.com/drkenreid/generalized-analysis-of-text-data
- Owner: DrKenReid
- License: mit
- Created: 2024-08-20T22:05:57.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-08-21T20:25:05.000Z (10 months ago)
- Last Synced: 2024-12-31T15:54:10.124Z (6 months ago)
- Topics: artificial-intelligence, dependency-parser, document-similarity, exploratory-data-analysis, natural-language-processing, network-visualization, newsgroups, nlp, sentiment-analysis, text-classification, text-clustering, text-summarization, topic-modeling, word-embeddings
- Language: Jupyter Notebook
- Homepage:
- Size: 1.45 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π Generalized Analysis of Text Data
## π Overview
This repo provides a comprehensive toolkit for analyzing text data using various AI and Natural Language Processing (NLP) techniques. It's designed to be a reference guide and inspiration for text analysis projects, offering insights into themes, sentiment, named entities, and more.## β¨ Features
- **π₯ Data Collection**: Uses the 20 Newsgroups dataset for demonstration.
- **π Initial Textual Analysis**: Performs basic text statistics and word frequency analysis.
- **π¬ Exploratory Data Analysis**: Visualizes key aspects of the text data.
- **ποΈ Topic Modeling**: Uncovers hidden thematic structures in the text corpus.
- **π§© Text Clustering**: Groups similar documents using K-means clustering.
- **π€ Word Embeddings**: Captures semantic relationships between words using Word2Vec.
- **π Document Similarity**: Identifies related documents using cosine similarity.
- **π·οΈ Named Entity Recognition**: Extracts and classifies named entities in the text.
- **πΈοΈ Topic Network Visualization**: Visualizes relationships between topics and words.
- **π Sentiment Analysis**: Analyzes the emotional tone of the text.
- **π Text Classification**: Automatically categorizes texts using machine learning.
- **π Text Summarization**: Generates concise summaries of longer texts.
- **π POS Tagging**: Assigns parts of speech to words in the text.
- **π³ Dependency Parsing**: Analyzes the grammatical structure of sentences.
- **π§ Topic Coherence**: Evaluates the quality of extracted topics.## π οΈ Requirements
- Python 3.6+
- Required libraries:
- pandas
- numpy
- matplotlib
- seaborn
- nltk
- spacy
- textblob
- scikit-learn
- gensim
- networkx
- transformers## π Installation
1. Clone this repository:
```
git clone https://github.com/DrKenReid/Generalized-Analysis-of-Text-Data.git
```
2. Install required packages:
```
pip install -r requirements.txt
```## π¨βπ» Usage
1. Open the notebook in Google Colab or your preferred Jupyter environment.
2. Run all cells in the notebook:
- In Colab: Runtime -> Run all
- In Jupyter: Cell -> Run All## π Sections
1. Setup: Imports necessary libraries and initializes key components.
2. Data Collection: Fetches the 20 Newsgroups dataset.
3. Dataset Building: Structures the data into a pandas DataFrame.
4. Initial Textual Analysis: Performs basic text statistics.
5. Exploratory Data Analysis: Visualizes key aspects of the data.
6. AI-Enhanced Insights: Applies various NLP techniques for deeper analysis.## π€ Output
The notebook generates various visualizations and outputs, including:
- Word frequency distributions
- Topic models
- Cluster visualizations
- Sentiment analysis results
- Named entity recognition results
- Text summaries## π§ Customization
You can modify the notebook to use your own dataset by replacing the data collection step with your data loading process.## π€ Contributing
Contributions, issues, and feature requests are welcome. Feel free to check [issues page](https://github.com/DrKenReid/Generalized-Analysis-of-Text-Data/issues) if you want to contribute.## π License
This project is licensed under the MIT License.## π Acknowledgements
- This project uses the 20 Newsgroups dataset for demonstration purposes.
- Special thanks to the developers of the various Python libraries used in this project.## βοΈ Disclaimer
This notebook is for educational and research purposes only. Ensure you have the right to use and analyze any data you input into this notebook.