Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dogan-the-analyst/social_media_data_analysis

Data analysis with Python.
https://github.com/dogan-the-analyst/social_media_data_analysis

data-analysis jupyter-notebook matplotlib numpy pandas python

Last synced: about 1 month ago
JSON representation

Data analysis with Python.

Host: GitHub
URL: https://github.com/dogan-the-analyst/social_media_data_analysis
Owner: dogan-the-analyst
Created: 2024-12-25T14:02:02.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2025-01-01T17:03:26.000Z (about 1 month ago)
Last Synced: 2025-01-01T18:18:23.474Z (about 1 month ago)
Topics: data-analysis, jupyter-notebook, matplotlib, numpy, pandas, python
Language: Jupyter Notebook
Homepage:
Size: 114 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Social Media Data Analysis

## Introduction

Social media has become a ubiquitous part of modern life, with platforms such as Instagram, Twitter, and Facebook serving as essential communication channels. Social media data sets are vast and complex, making analysis a challenging task for businesses and researchers alike. In this project, I explore a simulated social media, for example Tweets, data set to understand trends in likes across different categories.

## Project Scope

The objective of this project is to analyze tweets (or other social media data) and gain insights into user engagement. I will explore the data set using visualization techniques to understand the distribution of likes across different categories. Finally, I will analyze the data to draw conclusions about the most popular categories and the overall engagement on the platform.

## Importing Required Libraries

```python

import pandas as pd

from pandasql import sqldf

import numpy as np

import matplotlib.pyplot as plt

import random

```

## Creating Categories List, Data Size, and Random Data

```python

categories = ['Food', 'Travel', 'Fashion',

              'Fitness', 'Music', 'Culture',

              'Family', 'Health', 'Sports']

n = 1000

data_dict = {'Date': pd.date_range('2023-01-01', periods=n),

             'Category': [random.choice(categories) for _ in range(n)],

             'Likes': np.random.randint(0, 10000, size=n)}

```

## Transporting from Dictionary to DataFrame

```python

df = pd.DataFrame(data_dict)

df.tail()

```

## Exploratory Data Analysis (EDA)

### Checking Data Types

```python

df.dtypes

df.info()

```

### Dropping Duplicate Rows

```python

df.duplicated().sum()

```

- Great! Fortunately, there are no duplicate rows.

```python

df.drop_duplicates()

```

### Dropping Missing & Null Values

```python

df.isnull().sum()

```

- Again, no missing or null values.

```python

df = df.dropna()

df.count()

```

## Converting Fields

Even though the `Date` and `Likes` columns are in the correct format, I applied these formatting steps.

```python

df['Date'] = pd.to_datetime(df['Date'])

df['Likes'] = df['Likes'].astype('int64')

```

## Visualizing and Analyzing the Data

### Likes by Category

```python

likes_by_category = df.groupby('Category')['Likes'].sum()

likes_by_category.plot(kind="bar", figsize=(10, 5))

plt.title("Number of Likes by Category")

plt.ylabel("Number of Likes")

plt.xlabel("Category")

plt.xticks(rotation=45)

plt.show()

```

### Pie Chart of Likes Distribution

```python

likes_by_category = df.groupby('Category')['Likes'].sum()

plt.pie(likes_by_category, labels=likes_by_category.index)

plt.title("Distribution of Likes by Category")

plt.show()

```

### Mean of Likes

```python

mean_of_likes = df['Likes'].mean()

print("Mean of the 'Likes' category: {}".format(mean_of_likes))

```

### Number of Tweets by Category

```python

number_of_tweets_by_category = df.groupby('Category')['Category'].count()

number_of_tweets_by_category.plot(kind="bar", figsize=(10, 5), color='green')

plt.title("Number of Tweets by Category")

plt.ylabel("Number of Tweets")

plt.xlabel("Category")

plt.xticks(rotation=45)

plt.show()

```

### Pie Chart of Tweets Distribution

```python

number_of_tweets_by_category = df.groupby('Category')['Category'].count()

plt.pie(number_of_tweets_by_category, labels=number_of_tweets_by_category.index)

plt.title("Distribution of Tweets by Category")

plt.show()

```

### Most Popular Category

```python

query = """

SELECT

    Category,

    COUNT(*) AS Frequency

FROM

    df

GROUP BY

    Category

ORDER BY

    Frequency DESC

"""

result = sqldf(query)

print(result)

```

- The result shows that `Sports` is in first place.