https://github.com/omarsolieman/socialgiveawaydataanalysis
This project involved cleaning, analyzing, and processing data from an Instagram giveaway to ensure a fair and data-driven winner selection process. The primary goal was to automate the process of identifying valid entries, weighting them based on engagement (likes and multiple entries), and performing a post-giveaway analysis
https://github.com/omarsolieman/socialgiveawaydataanalysis
data-analysis data-science data-visualization instagram scraping threejs
Last synced: about 1 month ago
JSON representation
This project involved cleaning, analyzing, and processing data from an Instagram giveaway to ensure a fair and data-driven winner selection process. The primary goal was to automate the process of identifying valid entries, weighting them based on engagement (likes and multiple entries), and performing a post-giveaway analysis
- Host: GitHub
- URL: https://github.com/omarsolieman/socialgiveawaydataanalysis
- Owner: omarsolieman
- License: mit
- Created: 2025-08-30T08:30:55.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-08-30T09:34:24.000Z (10 months ago)
- Last Synced: 2025-08-30T10:20:38.058Z (10 months ago)
- Topics: data-analysis, data-science, data-visualization, instagram, scraping, threejs
- Language: Python
- Homepage:
- Size: 163 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SocialGiveAwayDataAnalysis
# 📊 Project: Instagram Giveaway Analysis & Winner Selection Automation
## 🚀 Project Overview
This project involved cleaning, analyzing, and processing data from an Instagram giveaway to ensure a fair and data-driven winner selection process. The primary goal was to **automate the process of identifying valid entries, weighting them based on engagement (likes and multiple entries), and performing a post-giveaway analysis** to understand participant engagement.
---
## 🤯 The Challenge: Dirty Data
The initial dataset, exported from an Instagram scraping tool (`instagram.csv`), presented several challenges:
- **Cryptic Column Headers:** Columns were labeled with non-descriptive names (e.g., `x1i10hfl href`, `_ap3a`), making it difficult to understand the data.
- **Ambiguous "Likes" Data:** The number of likes on a comment was embedded in a text string like `"1 like"` or `"15 likes"` within an `action_type` column.
- **Defining a "Valid Entry":** The rules for winning (tagging at least 3 people) needed to be programmatically verified. Entries with only tags and no comment text were at risk of being incorrectly discarded.
- **Duplicate Entries:** The raw data contained numerous duplicate rows. A simple duplicate removal could unfairly penalize users who made multiple, legitimate entries.
- **Potential for Spam/Bot Activity:** Some users had an unusually high number of entries (e.g., over 150), requiring investigation to ensure they were not automated spam.
---
## 💡 My Solution: A Multi-Stage Python Scripting Process
I developed a series of **Python scripts** to create a repeatable and transparent workflow.
### 1. **Advanced Data Cleaning** (`advanced_cleaner.py`)
- **Relabeled Columns:** Renamed cryptic column names to descriptive ones like `username`, `comment_text`, and `mentioned_user_1_username`.
- **Intelligent Duplicate Removal:** Only removed rows that were 100% identical, preserving all legitimate multiple entries.
- **Handled Empty Comments:** Added logic to ensure comments containing only tags were considered valid.
### 2. **Fair Winner Selection** (`pick_winner.py`)
- **Identified Valid Entries:** Filtered the dataset to find all comments where at least **three unique users** were mentioned.
- **Weighted Chance System:** Calculated a *winning score* for each participant based on the sum of `1 + likes` for all their valid entries.
- **Multi-Winner Selection:** Configured the script to select **10 unique winners**.
### 3. **Post-Giveaway Analysis** (`analyze_giveaway.py`)
- **Generated Detailed Statistics:** Provided a breakdown of engagement metrics after winners were chosen.
- **Created Summary Tables:** Exported a clean summary of the 10 winners to a `.csv` file.
- **Spam/Bot Investigation:** Added a feature to flag users with exceptionally high entries and generate a report with comment samples for manual review.
---
## 💪 Key Challenges & How I Overcame Them
- **Challenge:** Incorrectly removing valid entries.
**Solution:** Iterated on the duplicate removal logic to only remove true scraping errors.
- **Challenge:** Extracting numerical "likes" from text.
**Solution:** Used Pandas `.str.extract()` with a regex to parse like counts.
- **Challenge:** Verifying a user with 160+ entries.
**Solution:** Built a high-volume analysis tool to create a report with timestamps and samples.
- **Challenge:** `UnicodeEncodeError` on some systems.
**Solution:** Removed emoji characters from print statements for script portability.
---
## ✨ Results & Visualizations
The scripts successfully:
- Cleaned the dataset.
- Selected **10 winners** based on a fair and weighted system.
- Generated a comprehensive analysis report.
### **Winner Engagement Comparison**
A chart was generated to visualize the final *winning score* for each of the 10 winners, highlighting engagement differences.
### **Winner Summary Table (Usernames Censored)**
| Username | Total Valid Entries | Total Likes on Entries | Final Winning Score |
|-------------|----------------------|-------------------------|----------------------|
| Winner 1 | 160 | 43 | 203 |
| Winner 2 | 53 | 23 | 76 |
| Winner 3 | 9 | 9 | 18 |
| Winner 4 | 26 | 24 | 50 |
| Winner 5 | 1 | 1 | 2 |
| Winner 6 | 24 | 24 | 48 |
| Winner 7 | 1 | 0 | 1 |
| Winner 8 | 272 | 18 | 290 |
| Winner 9 | 14 | 0 | 14 |
| Winner 10 | 2 | 0 | 2 |
---
## 💻 Technologies Used
- **Python**
- **Pandas:** For data manipulation, cleaning, and analysis.
- **Matplotlib & Seaborn:** For data visualization and generating graphs.