An open API service indexing awesome lists of open source software.

https://github.com/glowfi/reddit-scraper

A lightweight, beginner-friendly tool for collecting Reddit posts and comments with ease. Fast setup, clean code, and perfect for data analysis, research, or automation experiments.
https://github.com/glowfi/reddit-scraper

asyncio asyncpraw python3 reddit reddit-api reddit-scraper requests social-media

Last synced: 5 months ago
JSON representation

A lightweight, beginner-friendly tool for collecting Reddit posts and comments with ease. Fast setup, clean code, and perfect for data analysis, research, or automation experiments.

Awesome Lists containing this project

README

          

# Reddit Scraper


Project Logo

A modular Reddit scraping tool that collects data about **subreddits**, **posts**, and **users**, and exports everything as structured **JSON** files for easy processing or database import.

---

## 🚀 Features

- Scrapes:
- ✔️ Subreddits
- ✔️ Posts
- ✔️ Users (kindof)

- Outputs clean, structured **JSON** data
- Includes tools to:
- Split large JSON files into smaller chunks
- Import JSON data into MongoDB

- Fully automated workflow via `run.py`

---

## 📦 Dependencies

- **Python 3.9+**
- Packages listed in `requirements.txt`

---

## 📂 Output Data Structure

> Sample JSON file sizes are big about 16-25mb each ,download them as browser will take time to load them.

### **Subreddit document example**

![Subreddit example](./images/subreddits.png)

Sample JSON:
[https://files.catbox.moe/r7a7um.json](https://files.catbox.moe/r7a7um.json)

### **Post document example**

![Post example](./images/posts.png)

Sample JSON:
[https://files.catbox.moe/5cf2xw.json](https://files.catbox.moe/5cf2xw.json)

### **User document example**

![User example](./images/users.png)

Sample JSON:
[https://files.catbox.moe/yp506n.json](https://files.catbox.moe/yp506n.json)

---

## 🛠️ Script Overview

| Script | Description |
| --------------------------------- | --------------------------------------------------- |
| `subreddits.py` | Scrapes subreddit metadata |
| `posts.py` | Scrapes posts from each subreddit |
| `users.py` | Scrapes user information |
| `utils/split.py` | Splits large JSON files into import-friendly chunks |
| `utils/import_data_to_mongodb.sh` | Imports JSON chunks into MongoDB |
| `run.py` | Runs all scrapers sequentially |

---

## ⚙️ Installation & Setup

### 1. Clone the repository & create a virtual environment

```sh
pip install virtualenv
git clone https://github.com/glowfi/reddit-scraper
cd reddit-scraper

python -m venv env
source env/bin/activate # Linux / macOS
# or: env\Scripts\activate # Windows PowerShell

pip install -r requirements.txt
```

---

### 2. Configure environment variables

Edit the file **`env-sample`**, then rename it to **`.env`**:

```env
username=
password=
client_id=
client_secret=

TOTAL_SUBREDDITS_PER_TOPICS=6
SUBREDDIT_SORT_FILTER="hot"
POSTS_PER_SUBREDDIT=10
POSTS_SORT_FILTER="new"
```

> Ensure your Reddit app is created at:
> [https://www.reddit.com/prefs/apps](https://www.reddit.com/prefs/apps)

---

### 3. Run the scraper

```sh
./run.py
```

This will:

1. Scrape subreddits
2. Scrape posts
3. Scrape users
4. Save all data in this directory
5. (Optional) Split files for MongoDB import

---

## 🗄️ Importing Data Into MongoDB

After scraping, use the helper script:

```sh
./utils/import_data_to_mongodb.sh
```

Make sure your MongoDB service is running beforehand.

---

## 💡 Notes

- API limits apply; use reasonable configuration values
- Scraping speed depends on your network & Reddit API rate limiting
- JSON outputs are ready for further processing (ML, analytics, etc.)

---

## 🤝 Contributing

Pull requests, issue reports, and improvements are welcome!

---