https://github.com/glowfi/reddit-scraper
A lightweight, beginner-friendly tool for collecting Reddit posts and comments with ease. Fast setup, clean code, and perfect for data analysis, research, or automation experiments.
https://github.com/glowfi/reddit-scraper
asyncio asyncpraw python3 reddit reddit-api reddit-scraper requests social-media
Last synced: 5 months ago
JSON representation
A lightweight, beginner-friendly tool for collecting Reddit posts and comments with ease. Fast setup, clean code, and perfect for data analysis, research, or automation experiments.
- Host: GitHub
- URL: https://github.com/glowfi/reddit-scraper
- Owner: glowfi
- License: gpl-3.0
- Created: 2024-02-11T22:07:01.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-11-30T15:40:56.000Z (7 months ago)
- Last Synced: 2025-12-02T22:38:44.905Z (7 months ago)
- Topics: asyncio, asyncpraw, python3, reddit, reddit-api, reddit-scraper, requests, social-media
- Language: Python
- Homepage:
- Size: 2.68 MB
- Stars: 19
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Reddit Scraper
A modular Reddit scraping tool that collects data about **subreddits**, **posts**, and **users**, and exports everything as structured **JSON** files for easy processing or database import.
---
## 🚀 Features
- Scrapes:
- ✔️ Subreddits
- ✔️ Posts
- ✔️ Users (kindof)
- Outputs clean, structured **JSON** data
- Includes tools to:
- Split large JSON files into smaller chunks
- Import JSON data into MongoDB
- Fully automated workflow via `run.py`
---
## 📦 Dependencies
- **Python 3.9+**
- Packages listed in `requirements.txt`
---
## 📂 Output Data Structure
> Sample JSON file sizes are big about 16-25mb each ,download them as browser will take time to load them.
### **Subreddit document example**

Sample JSON:
[https://files.catbox.moe/r7a7um.json](https://files.catbox.moe/r7a7um.json)
### **Post document example**

Sample JSON:
[https://files.catbox.moe/5cf2xw.json](https://files.catbox.moe/5cf2xw.json)
### **User document example**

Sample JSON:
[https://files.catbox.moe/yp506n.json](https://files.catbox.moe/yp506n.json)
---
## 🛠️ Script Overview
| Script | Description |
| --------------------------------- | --------------------------------------------------- |
| `subreddits.py` | Scrapes subreddit metadata |
| `posts.py` | Scrapes posts from each subreddit |
| `users.py` | Scrapes user information |
| `utils/split.py` | Splits large JSON files into import-friendly chunks |
| `utils/import_data_to_mongodb.sh` | Imports JSON chunks into MongoDB |
| `run.py` | Runs all scrapers sequentially |
---
## ⚙️ Installation & Setup
### 1. Clone the repository & create a virtual environment
```sh
pip install virtualenv
git clone https://github.com/glowfi/reddit-scraper
cd reddit-scraper
python -m venv env
source env/bin/activate # Linux / macOS
# or: env\Scripts\activate # Windows PowerShell
pip install -r requirements.txt
```
---
### 2. Configure environment variables
Edit the file **`env-sample`**, then rename it to **`.env`**:
```env
username=
password=
client_id=
client_secret=
TOTAL_SUBREDDITS_PER_TOPICS=6
SUBREDDIT_SORT_FILTER="hot"
POSTS_PER_SUBREDDIT=10
POSTS_SORT_FILTER="new"
```
> Ensure your Reddit app is created at:
> [https://www.reddit.com/prefs/apps](https://www.reddit.com/prefs/apps)
---
### 3. Run the scraper
```sh
./run.py
```
This will:
1. Scrape subreddits
2. Scrape posts
3. Scrape users
4. Save all data in this directory
5. (Optional) Split files for MongoDB import
---
## 🗄️ Importing Data Into MongoDB
After scraping, use the helper script:
```sh
./utils/import_data_to_mongodb.sh
```
Make sure your MongoDB service is running beforehand.
---
## 💡 Notes
- API limits apply; use reasonable configuration values
- Scraping speed depends on your network & Reddit API rate limiting
- JSON outputs are ready for further processing (ML, analytics, etc.)
---
## 🤝 Contributing
Pull requests, issue reports, and improvements are welcome!
---