An open API service indexing awesome lists of open source software.

https://github.com/oleksiym/tweet-engagement-analysis

Python data science project analyzes simulated tweets to understand user engagement (likes), covering data generation, cleaning, EDA, statistical testing (ANOVA), correlation, and time series analysis using pandas, NumPy, Matplotlib, and Seaborn.
https://github.com/oleksiym/tweet-engagement-analysis

Last synced: 3 months ago
JSON representation

Python data science project analyzes simulated tweets to understand user engagement (likes), covering data generation, cleaning, EDA, statistical testing (ANOVA), correlation, and time series analysis using pandas, NumPy, Matplotlib, and Seaborn.

Awesome Lists containing this project

README

        

- [Tweet Engagement Analysis](#tweet-engagement-analysis)
- [Project Overview](#project-overview)
- [Project Goals](#project-goals)
- [Prerequisites](#prerequisites)
- [Project Structure](#project-structure)
- [Data Generation](#data-generation)
- [Data Cleaning and Preprocessing](#data-cleaning-and-preprocessing)
- [Exploratory Data Analysis (EDA) and Visualization](#exploratory-data-analysis-eda-and-visualization)
- [Analysis Techniques](#analysis-techniques)
- [Results and Conclusions](#results-and-conclusions)
- [Recommendations](#recommendations)
- [Limitations](#limitations)
- [How to Run](#how-to-run)
- [Contributing](#contributing)
- [License](#license)

# Tweet Engagement Analysis

## Project Overview

This repository contains a Data Science Python project that simulates and analyzes social media data (specifically, tweets) to gain insights into user engagement. The project focuses on understanding the factors that influence the number of "likes" a tweet receives. It's an exploratory data analysis (EDA) project, utilizing various data cleaning, visualization, and statistical techniques.

## Project Goals

* Simulate a realistic dataset of tweets with relevant features (ID, category, likes, date, user ID).
* Clean and preprocess the data, handling missing values and outliers.
* Perform exploratory data analysis (EDA) to visualize data distributions and relationships.
* Conduct statistical analysis (ANOVA and correlation) to test hypotheses.
* Explore time series patterns in engagement.
* Draw conclusions and suggest recommendations for further analysis.

## Prerequisites

* Python 3.7+
* Jupyter Notebook (or a similar environment)
* Required Libraries:
* pandas
* NumPy
* Matplotlib
* Seaborn
* SciPy

You can install the required libraries using pip:

```bash
pip install pandas numpy matplotlib seaborn scipy
```

## Project Structure

* **`simulated_tweets.csv`:** The simulated dataset (generated by the notebook).
* **`SocialMediaDataAnalysis.ipynb`:** The Jupyter Notebook containing all the code, analysis, and visualizations.
* **`SocialMediaAnalysisReport.pdf`:** The project report, detailing the analysis, findings, and recommendations.
* **`images/`:** A directory to store the generated plots for easy inclusion in reports or presentations. *Note: You may need to manually create this directory*
* **(PNG files)**: plots generated for the presentation

## Data Generation

The project begins by generating a synthetic dataset of 1000 tweets. The data includes the following features:

* **`tweet_id`:** A unique identifier for each tweet.
* **`category`:** The category the tweet belongs to (News, Sports, Entertainment, Tech, Food, Travel, Fashion).
* **`likes`:** The number of likes the tweet received (simulated using an exponential distribution with intentional outliers).
* **`date`:** The date the tweet was posted.
* **`user_id`:** A unique identifier for the user who posted the tweet.

Missing values (5%) are intentionally introduced into the `category` and `likes` columns to demonstrate data cleaning techniques.

## Data Cleaning and Preprocessing

The following data cleaning steps are performed:

1. **Missing Values:**
* Rows with missing values in the `category` column are dropped, as it's a crucial categorical feature.
* Missing values in the `likes` column are imputed using the median (more robust to outliers than the mean).
2. **Data Type Conversion:** The `date` column is converted to the `datetime` data type.

## Exploratory Data Analysis (EDA) and Visualization

The project uses various visualization techniques to explore the data:

* **Distribution of Likes (Histogram):** Shows the overall distribution of likes, highlighting the skewness.
* **Likes by Category (Box Plots):** Compares the distribution of likes across different categories.
* **Category Counts (Count Plot):** Visualizes the number of tweets in each category.
* **Rolling Average of Likes Over Time (Line Plot):** Displays the 30-day rolling average of likes to identify trends.
* **Daily Average Likes (Line Plot):** Shows the average likes per day to highlight daily fluctuations.
* **Likes by User (Bar Chart):** Shows top users by total likes received.

## Analysis Techniques

* **ANOVA (Analysis of Variance):** A statistical test to determine if there's a significant difference in the *mean* number of likes between different categories.
* **Correlation Matrix (Heatmap):** Visualizes the linear correlations between the `likes` column and the one-hot encoded category variables.
* **Time Series Analysis:** Calculate daily average likes.

## Results and Conclusions

The analysis revealed the following key findings:

* The distribution of likes is highly skewed (most tweets have few likes, a few have many).
* While there's variation in the distribution of likes across categories, the ANOVA test showed *no statistically significant difference* in the mean likes between categories.
* The correlation matrix confirmed the weak (or non-existent) linear relationship between `likes` and `category`.
* The time series plots showed fluctuations in engagement over time, suggesting the influence of external factors.

## Recommendations

* **Investigate Outliers:** Examine the tweets with extremely high like counts to understand their characteristics.
* **Explore Temporal Patterns:** Analyze daily/weekly/monthly cycles and correlate engagement with external events.
* **Content Analysis:** Use Natural Language Processing (NLP) to analyze the text of the tweets.
* **User Segmentation:** Identify different user groups based on their behavior.
* **Predictive Modeling:** Build a model to predict tweet likes, incorporating additional features (content, user, time).

## Limitations

* The analysis is based on *simulated* data. Real-world social media data is often more complex and requires more extensive cleaning and preprocessing.
* The dataset only includes a limited set of features. Many other factors can influence engagement.
* The correlation analysis only considers linear relationships.
* Correlation does not imply causation.

## How to Run

1. **Clone the repository:**

```bash
git clone https://github.com/OleksiyM/tweet-engagement-analysis.git
```
2. **Navigate to the project directory:**

```bash
cd tweet-engagement-analysis
```
3. **Open and run the Jupyter Notebook:**

```bash
jupyter notebook SocialMediaDataAnalysis.ipynb
```

Make sure you have the necessary libraries installed (see Prerequisites). The notebook will generate the `simulated_tweets.csv` file automatically.

## Contributing

Contributions are welcome! If you have ideas for improvements, bug fixes, or new features, please feel free to open an issue or submit a pull request.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.