Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/datavorous/yars

Yet Another Reddit Scrapper (without API keys) | Scrap search results, posts and images from subreddits filtered by hot, new etc and bulk download any user's data.
https://github.com/datavorous/yars

api data-mining hacktoberfest hoarding json python reddit reddit-api reddit-crawler reddit-downloader reddit-scraper requests scraper webscraping

Last synced: about 1 month ago
JSON representation

Yet Another Reddit Scrapper (without API keys) | Scrap search results, posts and images from subreddits filtered by hot, new etc and bulk download any user's data.

Awesome Lists containing this project

README

        



# YARS (Yet Another Reddit Scraper)

[![GitHub stars](https://img.shields.io/github/stars/datavorous/yars.svg?style=social&label=Stars&style=plastic)](https://github.com/datavorous/yars/stargazers)

YARS is a Python package designed to simplify the process of scraping Reddit for posts, comments, user data, and other media. The package also includes utility functions. It is built using **Python** and relies on the **requests** module for fetching data from Reddit’s public API. The scraper uses simple `.json` requests, avoiding the need for official Reddit API keys, making it lightweight and easy to use.

## Features

- **Reddit Search**: Search Reddit for posts using a keyword query.
- **Post Scraping**: Scrape post details, including title, body, and comments.
- **User Data Scraping**: Fetch recent activity (posts and comments) of a Reddit user.
- **Subreddit Posts Fetching**: Retrieve posts from specific subreddits with flexible options for category and time filters.
- **Image Downloading**: Download images from posts.
- **Results Display**: Utilize `Pygments` for colorful display of JSON-formatted results.

> [!WARNING]
> Use with rotating proxies, or Reddit might gift you with an IP ban.
> I could extract max 2552 posts at once from 'r/all' using this.
> [Here](https://files.catbox.moe/zdra2i.json) is a **7.1 MB JSON** file containing the top 100 posts from 'r/nosleep', which included post titles, body text, all comments and their replies, post scores, time of upload etc.

## Dependencies

- `requests`
- `Pygments`

## Installation

1. Clone the repository:

```
git clone https://github.com/datavorous/YARS.git
```
Navigate inside the ```src``` folder.

2. Install ```uv``` (if not already installed):

```
pip install uv
```

3. Run the application:
```
uv run example.py
```
It'll setup the virtual env, install the necessary packages and run the ```example.py``` program.

## Usage

We will use the following Python script to demonstrate the functionality of the scraper. The script includes:

- Searching Reddit
- Scraping post details
- Fetching user data
- Retrieving subreddit posts
- Downloading images from posts

#### Code Overview

```python
from yars import YARS
from utils import display_results, download_image

miner = YARS()
```

#### Step 1: Searching Reddit

The `search_reddit` method allows you to search Reddit using a query string. Here, we search for posts containing "OpenAI" and limit the results to 3 posts. The `display_results` function is used to present the results in a formatted way.

```python
search_results = miner.search_reddit("OpenAI", limit=3)
display_results(search_results, "SEARCH")
```

#### Step 2: Scraping Post Details

Next, we scrape details of a specific Reddit post by passing its permalink. If the post details are successfully retrieved, they are displayed using `display_results`. Otherwise, an error message is printed.

```python
permalink = "https://www.reddit.com/r/getdisciplined/comments/1frb5ib/what_single_health_test_or_practice_has/".split('reddit.com')[1]
post_details = miner.scrape_post_details(permalink)
if post_details:
display_results(post_details, "POST DATA")
else:
print("Failed to scrape post details.")
```

#### Step 3: Fetching User Data

We can also retrieve a Reddit user’s recent activity (posts and comments) using the `scrape_user_data` method. Here, we fetch data for the user `iamsecb` and limit the results to 2 items.

```python
user_data = miner.scrape_user_data("iamsecb", limit=2)
display_results(user_data, "USER DATA")
```

#### Step 4: Fetching Subreddit Posts

The `fetch_subreddit_posts` method retrieves posts from a specified subreddit. In this example, we fetch 11 top posts from the "generative" subreddit from the past week.

```python
subreddit_posts = miner.fetch_subreddit_posts("generative", limit=11, category="top", time_filter="week")
display_results(subreddit_posts, "EarthPorn SUBREDDIT New Posts")
```

#### Step 5: Downloading Images

For the posts retrieved from the subreddit, we try to download their associated images. The `download_image` function is used for this. If the post doesn't have an `image_url`, the thumbnail URL is used as a fallback.

```python
for z in range(3):
try:
image_url = subreddit_posts[z]["image_url"]
except:
image_url = subreddit_posts[z]["thumbnail_url"]
download_image(image_url)
```

### Complete Code Example

```python
from yars import YARS
from utils import display_results, download_image

miner = YARS()

# Search for posts related to "OpenAI"
search_results = miner.search_reddit("OpenAI", limit=3)
display_results(search_results, "SEARCH")

# Scrape post details using its permalink
permalink = "https://www.reddit.com/r/getdisciplined/comments/1frb5ib/what_single_health_test_or_practice_has/".split('reddit.com')[1]
post_details = miner.scrape_post_details(permalink)
if post_details:
display_results(post_details, "POST DATA")
else:
print("Failed to scrape post details.")

# Fetch recent activity of user "iamsecb"
user_data = miner.scrape_user_data("iamsecb", limit=2)
display_results(user_data, "USER DATA")

# Fetch top posts from the subreddit "generative" from the past week
subreddit_posts = miner.fetch_subreddit_posts("generative", limit=11, category="top", time_filter="week")
display_results(subreddit_posts, "EarthPorn SUBREDDIT New Posts")

# Download images from the fetched posts
for z in range(3):
try:
image_url = subreddit_posts[z]["image_url"]
except:
image_url = subreddit_posts[z]["thumbnail_url"]
download_image(image_url)
```

You can now use these techniques to explore and scrape data from Reddit programmatically.

## Contributing

Contributions are welcome! For feature requests, bug reports, or questions, please open an issue. If you would like to contribute code, please open a pull request with your changes.

### Our Notable Contributors