https://github.com/datavorous/yars

Yet Another Reddit Scrapper (without API keys) | Scrap search results, posts and images from subreddits filtered by hot, new etc and bulk download any user's data.
https://github.com/datavorous/yars

api data-mining hacktoberfest hoarding json python reddit reddit-api reddit-crawler reddit-downloader reddit-scraper requests scraper webscraping

Last synced: 3 months ago
JSON representation

Yet Another Reddit Scrapper (without API keys) | Scrap search results, posts and images from subreddits filtered by hot, new etc and bulk download any user's data.

Host: GitHub
URL: https://github.com/datavorous/yars
Owner: datavorous
License: mit
Created: 2024-09-10T13:18:18.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-04-07T20:33:40.000Z (3 months ago)
Last Synced: 2025-04-07T21:38:25.044Z (3 months ago)
Topics: api, data-mining, hacktoberfest, hoarding, json, python, reddit, reddit-api, reddit-crawler, reddit-downloader, reddit-scraper, requests, scraper, webscraping
Language: Python
Homepage:
Size: 1.29 MB
Stars: 45
Watchers: 1
Forks: 10
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


  



# YARS (Yet Another Reddit Scraper)

[![GitHub stars](https://img.shields.io/github/stars/datavorous/yars.svg?style=social&label=Stars&style=plastic)](https://github.com/datavorous/yars/stargazers)




YARS is a Python package designed to simplify the process of scraping Reddit for posts, comments, user data, and other media. The package also includes utility functions. It is built using **Python** and relies on the **requests** module for fetching data from Reddit’s public API. The scraper uses simple `.json` requests, avoiding the need for official Reddit API keys, making it lightweight and easy to use.

## Features

- **Reddit Search**: Search Reddit for posts using a keyword query.

- **Post Scraping**: Scrape post details, including title, body, and comments.

- **User Data Scraping**: Fetch recent activity (posts and comments) of a Reddit user.

- **Subreddit Posts Fetching**: Retrieve posts from specific subreddits with flexible options for category and time filters.

- **Image Downloading**: Download images from posts.

- **Results Display**: Utilize `Pygments` for colorful display of JSON-formatted results.

> [!WARNING]

> Use with rotating proxies, or Reddit might gift you with an IP ban.  

> I could extract max 2552 posts at once from 'r/all' using this.  

> [Here](https://files.catbox.moe/zdra2i.json) is a **7.1 MB JSON** file containing the top 100 posts from 'r/nosleep', which included post titles, body text, all comments and their replies, post scores, time of upload etc.

## Dependencies

- `requests`

- `Pygments`

## Installation

1. Clone the repository:

   ```

   git clone https://github.com/datavorous/YARS.git

   ```

   Navigate inside the ```src``` folder.

2. Install ```uv``` (if not already installed):

   ```

   pip install uv

   ```

3. Run the application:

   ```

   uv run example/example.py

   ```

   It'll setup the virtual env, install the necessary packages and run the ```example.py``` program.

## Usage

We will use the following Python script to demonstrate the functionality of the scraper. The script includes:

- Searching Reddit

- Scraping post details

- Fetching user data

- Retrieving subreddit posts

- Downloading images from posts

#### Code Overview

```python

from yars import YARS

from utils import display_results, download_image

miner = YARS()

```

#### Step 1: Searching Reddit

The `search_reddit` method allows you to search Reddit using a query string. Here, we search for posts containing "OpenAI" and limit the results to 3 posts. The `display_results` function is used to present the results in a formatted way.

```python

search_results = miner.search_reddit("OpenAI", limit=3)

display_results(search_results, "SEARCH")

```

#### Step 2: Scraping Post Details

Next, we scrape details of a specific Reddit post by passing its permalink. If the post details are successfully retrieved, they are displayed using `display_results`. Otherwise, an error message is printed.

```python

permalink = "https://www.reddit.com/r/getdisciplined/comments/1frb5ib/what_single_health_test_or_practice_has/".split('reddit.com')[1]

post_details = miner.scrape_post_details(permalink)

if post_details:

    display_results(post_details, "POST DATA")

else:

    print("Failed to scrape post details.")

```

#### Step 3: Fetching User Data

We can also retrieve a Reddit user’s recent activity (posts and comments) using the `scrape_user_data` method. Here, we fetch data for the user `iamsecb` and limit the results to 2 items.

```python

user_data = miner.scrape_user_data("iamsecb", limit=2)

display_results(user_data, "USER DATA")

```

#### Step 4: Fetching Subreddit Posts

The `fetch_subreddit_posts` method retrieves posts from a specified subreddit. In this example, we fetch 11 top posts from the "generative" subreddit from the past week.

```python

subreddit_posts = miner.fetch_subreddit_posts("generative", limit=11, category="top", time_filter="week")

display_results(subreddit_posts, "EarthPorn SUBREDDIT New Posts")

```

#### Step 5: Downloading Images

For the posts retrieved from the subreddit, we try to download their associated images. The `download_image` function is used for this. If the post doesn't have an `image_url`, the thumbnail URL is used as a fallback.

```python

for z in range(3):

    try:

        image_url = subreddit_posts[z]["image_url"]

    except:

        image_url = subreddit_posts[z]["thumbnail_url"]

    download_image(image_url)

```

### Complete Code Example

```python

from yars import YARS

from utils import display_results, download_image

miner = YARS()

# Search for posts related to "OpenAI"

search_results = miner.search_reddit("OpenAI", limit=3)

display_results(search_results, "SEARCH")

# Scrape post details using its permalink

permalink = "https://www.reddit.com/r/getdisciplined/comments/1frb5ib/what_single_health_test_or_practice_has/".split('reddit.com')[1]

post_details = miner.scrape_post_details(permalink)

if post_details:

    display_results(post_details, "POST DATA")

else:

    print("Failed to scrape post details.")

# Fetch recent activity of user "iamsecb"

user_data = miner.scrape_user_data("iamsecb", limit=2)

display_results(user_data, "USER DATA")

# Fetch top posts from the subreddit "generative" from the past week

subreddit_posts = miner.fetch_subreddit_posts("generative", limit=11, category="top", time_filter="week")

display_results(subreddit_posts, "EarthPorn SUBREDDIT New Posts")

# Download images from the fetched posts

for z in range(3):

    try:

        image_url = subreddit_posts[z]["image_url"]

    except:

        image_url = subreddit_posts[z]["thumbnail_url"]

    download_image(image_url)

```

You can now use these techniques to explore and scrape data from Reddit programmatically.

## Contributing

Contributions are welcome! For feature requests, bug reports, or questions, please open an issue. If you would like to contribute code, please open a pull request with your changes.

### Our Notable Contributors

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datavorous/yars

Awesome Lists containing this project

README