Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/boddepallyvenkatesh06/clean-and-analyze-social-media-usage-data-with-python
Utilized Python for data cleaning and analysis of social media usage data, employing libraries like Pandas and Matplotlib. Improved insights for informed decision-making and business strategies.
https://github.com/boddepallyvenkatesh06/clean-and-analyze-social-media-usage-data-with-python
pyhon python-lambda python-script pythonframework pythonsqlite
Last synced: 29 days ago
JSON representation
Utilized Python for data cleaning and analysis of social media usage data, employing libraries like Pandas and Matplotlib. Improved insights for informed decision-making and business strategies.
- Host: GitHub
- URL: https://github.com/boddepallyvenkatesh06/clean-and-analyze-social-media-usage-data-with-python
- Owner: BoddepallyVenkatesh06
- License: agpl-3.0
- Created: 2023-09-15T15:05:27.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-09-15T15:15:47.000Z (over 1 year ago)
- Last Synced: 2023-09-16T08:22:18.581Z (over 1 year ago)
- Topics: pyhon, python-lambda, python-script, pythonframework, pythonsqlite
- Language: JavaScript
- Homepage: https://secnhack.in/social-analyzer-analyzing-finding-a-person-profiles/
- Size: 667 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Clean-and-analyze-social-media-usage-data-with-Python
---
# Table of Contents
- [Problem Statement](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Problem-Statement)
- [Data Sourcing](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Data-Sourcing)
- [Data Preparation](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Data-Preparation)
- [Data Modeling](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Data-Modeling)
- [Data Visualization](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Data-Visualization)
- [Data Analysis](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Data-Analysis)
- [Insights](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Insights)
- [Shareable link](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Shareable-Link)---
# Problem Statement
The purpose of this analysis is to gain insights into the number of engagements the #NG30DaysofLearning has on twitter.
For this study we examined a variety of categories: the number of tweets, number of users, the most active users, the most mentioned tools e.t.c
---
# Data Sourcing
- The dataset used for this analysis was scrapped from twitter jupyter notebook
- The preview of the dataset and python code is shown below:```python
!pip install snscrape
```Requirement already satisfied: snscrape in c:\users\user\anaconda3\lib\site-packages (0.4.3.20220106)
Requirement already satisfied: lxml in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.8.0)
Requirement already satisfied: beautifulsoup4 in c:\users\user\anaconda3\lib\site-packages (from snscrape) (4.11.1)
Requirement already satisfied: requests[socks] in c:\users\user\anaconda3\lib\site-packages (from snscrape) (2.27.1)
Requirement already satisfied: filelock in c:\users\user\anaconda3\lib\site-packages (from snscrape) (3.6.0)
Requirement already satisfied: soupsieve>1.2 in c:\users\user\anaconda3\lib\site-packages (from beautifulsoup4->snscrape) (2.3.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (2021.10.8)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in c:\users\user\anaconda3\lib\site-packages (from requests[socks]->snscrape) (1.7.1)
```python
import pandas as pd
import snscrape.modules.twitter as sntwitter
``````python
query = "(#30DaysOfLearning OR #NG30DaysOfLearning) until:2022-06-26 since:2022-05-05"
tweets = []
limit = 30000for tweet in sntwitter.TwitterHashtagScraper(query).get_items():
if len(tweets) == limit:
break
else:
tweets.append([tweet.date, tweet.url, tweet.user.username, tweet.sourceLabel, tweet.user.location, tweet.content, tweet.likeCount, tweet.retweetCount, tweet.quoteCount, tweet.replyCount])
df = pd.DataFrame(tweets, columns=['Date', 'TweetURL','User', 'Source', 'Location', 'Tweet', 'Likes_Count','Retweet_Count', 'Quote_Count', 'Reply_Count'])df.to_csv('30DLTweets.csv')
``````python
df.head()
```
Date
TweetURL
User
Source
Location
Tweet
Likes_Count
Retweet_Count
Quote_Count
Reply_Count
0
2022-06-25 22:51:18+00:00
https://twitter.com/poetrineer/status/15408300...
poetrineer
Twitter for Android
Oyo, Nigeria
So as one of my commitment to document my lear...
0
0
0
1
1
2022-06-25 22:44:10+00:00
https://twitter.com/poetrineer/status/15408282...
poetrineer
Twitter for Android
Oyo, Nigeria
Finally, here is my updated COVID-19 Data Anal...
1
0
0
0
2
2022-06-25 19:25:58+00:00
https://twitter.com/MichealOjuri/status/154077...
MichealOjuri
Twitter Web App
Oyo, Nigeria
#30NGDaysOfLearning\n#30daysoflearning \n#micr...
0
0
0
0
3
2022-06-25 16:44:36+00:00
https://twitter.com/oye__aashu/status/15407377...
oye__aashu
Twitter for Android
Nainital, India
Day 4/ #30daysoflearning learned all about arr...
3
0
0
1
4
2022-06-25 12:49:02+00:00
https://twitter.com/hsb_data/status/1540678455...
hsb_data
Twitter Web App
New Jersey
Learning about sub queries on @DataCamp (SQL) ...
1
0
0
0
```python
df.describe()
```
Likes_Count
Retweet_Count
Quote_Count
Reply_Count
count
683.000000
683.000000
683.000000
683.000000
mean
15.780381
3.812592
0.185944
1.166911
std
41.164555
12.665561
0.737938
2.626554
min
0.000000
0.000000
0.000000
0.000000
25%
1.000000
0.000000
0.000000
0.000000
50%
2.000000
0.000000
0.000000
0.000000
75%
8.000000
2.000000
0.000000
1.000000
max
549.000000
248.000000
9.000000
29.000000
The dataset is also available at [30DLTweets](https://github.com/globalsmile/Twitter-Sentiment-Analysis/blob/main/30DLTweets.csv)
---
# Data Preparation
Data transformation was done in Power Query and the dataset was loaded into Microsoft Power BI Desktop for modeling.
The Twitter Sentiment dataset is given by a table named:
- `30DLTweets` which has `10 columns and 680 rows` of observation
The tabulation below shows the `30DLTweets` table with its column names and their description:
| Column Name | Description |
| ----------- | ----------- |
| Date | Represents the date and time of tweet |
| TweetURL | Describes the tweet url |
| User | Describes the username of the user |
| Source | Descibes the device type of the user |
| Location | Describes the location of the user |
| Tweet | Describes the content of the tweet |
| Likes_Count | Represents the count of likes of the tweet |
| Retweet_Count | Represents the count of retweets of the tweet |
| Quote_Count | Represents the count of quote tweets on the tweet |
| Reply_Count | Represents the count of reply on the tweet |Data Cleaning for the dataset was done in power query as follows:
- The `30DLTweets` table was split into a dimension and 2 fact tables respectively, hence called:
1. `UserProfile`
2. `TweetStats`
3. `TweetProfile`- A calculated column `UserID` was created in each of the tables using the M-formula `UserID = [User] & "_" & [Source]`
- Unnecessary columns were removed in each of the tables
- Each of the columns in the tables were validated to have the correct data typeTo ensure the accuracy of the dates in the `Date` column of `TweetStats` and `TweetProfile` tables, a date table was created for referencing using the M-formula:
`{Number.From(List.Min(TweetProfile[Date]))..Number.From(List.Max(TweetProfile[Date]))}`
Here is a breakdown of what the formula does:
For the dataset, we want the start date to reflect the earliest to latest date that we have in the data: May 9, 2022 - June 25, 2022.
`Day Name` column was inserted into the date table and renamed to ` DayOfTheWeek`
The date table was named `Calender`.
---
# Data Modeling
After the dataset was cleaned and transformed, it was ready to be modeled.
- The `Calender` table was marked as the official date table in the dataset.
- A `one-to-many (*:1) relationship` was created between the `TweetStats` and the `Calender` tables using the `date` column in each of the tables
- A `one-to-many (*:1) relationship` was created between the `TweetProfile` and the `Calender` tables using the `date` column in each of the tables
- A `one-to-many (*:1) relationship` was created between the `UserProfile` and the `TweetStats` tables using the `UserID` column in each of the tables
- A `one-to-many (*:1) relationship` was created between the `UserProfile` and the `TweetProfile` tables using the `UserID` column in each of the tables
- The realtioships formed in the data model is a `Star Schema` and is shown below:---
# Data Visualization
Data visualization for the dataset was done in 3 folds using Microsoft Power BI Desktop:
- The `Content Analysis`: Shows the tools by mention, word cloud, top active users, etc.
- The `Summary`: Shows the total number of tweets, total number of users, tweet by day of the week, etc.
- The `Dashboard`: Shows visualization from `Content Analysis` and `Summary` to provide answer to the [Problem Statement](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Problem-Statement).Figure 1 shows visualizations from `Content Analysis` page
| Figure 1 |
| ----------- |
| ![image](https://user-images.githubusercontent.com/106287208/187567316-46bc6332-7507-4f11-b3c7-a18a52ed8e14.png) |Figure 2 shows visualizations from `Summary` page
| Figure 2 |
| ----------- |
| ![image](https://user-images.githubusercontent.com/106287208/187567503-2db262f6-c7f2-449d-9b11-01ff7e084022.png) |Figure 3 shows visualizations from `Dashboard` page
| Figure 3 |
| ----------- |
| ![image](https://user-images.githubusercontent.com/106287208/187677655-9bbd0398-f0fe-4af0-870f-f9e090633f13.png) |---
# Data Analysis
Measures used in visualization are:
- Total No. of Tweets = `COUNT(TweetProfile[Tweet])`
- Total No. of Users = `COUNT(UserProfile[user])`
- Total No. of Sources = `COUNT(UserProfile[Source])`
- Average Engagement Rate = `DIVIDE(Total No. of Users, Total No. of Tweets)`As shown from [Data Visualization](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Data-Visualization), It can be deduced that:
- The were `680` tweets for the #NG30DaysOfLearning
- The were about `234` users
- The average engagement rate is `34%`---
# Insights
As shown by [Data Visualization](https://github.com/globalsmile/Twitter-Sentiment-Analysis#Data-Visualization), It can be deduced that:
- The most active user of the #NG30DaysOfLearning is [theoyinbooke](https://techcommunity.microsoft.com/t5/user/viewprofilepage/user-id/1379718) with 110 tweets
- The most mentioned tool is Github---
# Shareable Link
You can interact with the report here:
[View Report](https://app.powerbi.com/view?r=eyJrIjoiZjMzMjk1ZDAtYzBjYy00OTZjLTk1YzQtMzI1MjE0NWFkOGYxIiwidCI6IjQ5ODY4YWYzLWNjNWYtNDIxNC04YjdmLTQwZjM3NDY0OWEwOSJ9)