Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/codeslash21/wrangle-twitter-archive

Wrangle Twitter Archive WeRateDog. WeRateDog has 8M followers and they rate the dogs with funny comments and unique rating system. Also use dog-breed classifier to predict dog's breed in the tweets.
https://github.com/codeslash21/wrangle-twitter-archive

data-analysis data-wrangling neural-networkt twitter-api twitter-archive

Last synced: 4 days ago
JSON representation

Wrangle Twitter Archive WeRateDog. WeRateDog has 8M followers and they rate the dogs with funny comments and unique rating system. Also use dog-breed classifier to predict dog's breed in the tweets.

Host: GitHub
URL: https://github.com/codeslash21/wrangle-twitter-archive
Owner: codeslash21
License: mit
Created: 2024-03-06T14:44:54.000Z (10 months ago)
Default Branch: master
Last Pushed: 2024-03-06T14:49:57.000Z (10 months ago)
Last Synced: 2024-11-05T21:46:48.402Z (about 2 months ago)
Topics: data-analysis, data-wrangling, neural-networkt, twitter-api, twitter-archive
Language: Jupyter Notebook
Homepage:
Size: 3.2 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Wrangle Twitter Archive

## Table of contents:

- Introduction
- Dataset
- What software do I need?
- Project Steps

WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. They rate the dogs almost
always with a denominator of 10. But numerators?? Most of them are greater than 10. But WHY??? WeRateDogs believes every dog is beautiful and almost all dogs deserve 10 and sometimes more than that. WeRateDogs has over 8 million followers and has received international media coverage. Our goal is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.

## Dataset:

The dataset consists of three parts.
- **Enhanced twitter archive:**
The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356). This data is stored in `twitter_archive_enhanced.csv` file.

- **Additional Data via the Twitter API:**
Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. We have the WeRateDogs Twitter archive and specifically the tweet IDs within it, we can gather this data for all 5000+. We're going to query Twitter's API to gather this valuable data. Finally we store these data in `tweet_json.txt` file.

- **Image Predictions File:**
One more cool thing: I ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images). We store this prediction data in `image_predictions.tsv` file.

## What Software Do I Need?
One can do this project in jupyter notebook using python 3.x But one has to install the following python packages to wrangle dataset and query twitter api.

> - pandas
> - NumPy
> - requests
> - tweepy
> - json
> - sqlalchemy

## Project Steps:
Basically data wrangling process consissts of three steps. These are follows -

- **Gather Data:** Gather dataset for wrangling.
- **Assess Data:** Note the issues regarding quality and tidiness of the dataset.
- **Clean Data:** Here we fixing issues those are documented during data assessment process to make dataset ready for analysis.

Its recomended that after data wrangling process, clean data should be stored for future analysis purpose. Here we store the clean data in a flat file `twitter_archive_master.csv` and a sqlite database `twitter.db`.