https://github.com/lisa-ho/breadit
Respository for scraping and analysing data from the Reddit/Sourdough community to explore lockdown baking trends.
https://github.com/lisa-ho/breadit
data-analysis data-viz nltk python reddit-api sentiment-analysis web-scraping
Last synced: about 2 months ago
JSON representation
Respository for scraping and analysing data from the Reddit/Sourdough community to explore lockdown baking trends.
- Host: GitHub
- URL: https://github.com/lisa-ho/breadit
- Owner: Lisa-Ho
- Created: 2020-12-11T21:14:43.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2021-02-21T17:35:30.000Z (over 5 years ago)
- Last Synced: 2025-04-05T08:30:32.655Z (about 1 year ago)
- Topics: data-analysis, data-viz, nltk, python, reddit-api, sentiment-analysis, web-scraping
- Language: Jupyter Notebook
- Homepage:
- Size: 2.48 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# About this project
This is a personal project scraping and analysing data from the [Reddit/Sourdough](https://www.reddit.com/r/Sourdough/) community in 2020.
As a sourdough baker myself, I wanted to explore lockdown baking trends in more detail, see when engagement peaked and what bakers were talking about.
Thanks to [pushshift.io](https://pushshift.io/api-parameters/) I was able to retrieve data from Reddit relatively easily.
The write up of my analysis can be found on [my blog](https://inside-numbers.com/blog).
## Notebooks
This project is organised in two different jupyter notebooks.
1. Webscraping (data collection)
2. Data cleaning and analysis
## Requirements
This project is run on python 3 and a number of python libraries specified in ```requirements.txt```.
## Notes on methodology
### Users
Users are those who posted a submission in r/Sourdough in 2020. Some users have since then been deleted and are counted as one single [deleted] user.
### Score / upvoting data
Unfortunately, the data retrieved through pushshift for submission scores (upvotes) seemed to be incorrect, so could not be used for analysis.
### Dates and times
When converting unix timestamp to datetime, I did not account for different timezones of users at the time of their submission. Hence, analysis of submissions by days and hours of the day might be slightly distorted.