{"id":15301824,"url":"https://github.com/datavorous/yars","last_synced_at":"2025-04-14T19:36:15.481Z","repository":{"id":256455113,"uuid":"855189815","full_name":"datavorous/yars","owner":"datavorous","description":"Yet Another Reddit Scrapper (without API keys) | Scrap search results, posts and images from subreddits filtered by hot, new etc and bulk download any user's data.","archived":false,"fork":false,"pushed_at":"2025-04-07T20:33:40.000Z","size":1354,"stargazers_count":45,"open_issues_count":6,"forks_count":10,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-07T21:38:25.044Z","etag":null,"topics":["api","data-mining","hacktoberfest","hoarding","json","python","reddit","reddit-api","reddit-crawler","reddit-downloader","reddit-scraper","requests","scraper","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datavorous.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-10T13:18:18.000Z","updated_at":"2025-03-29T05:05:12.000Z","dependencies_parsed_at":"2024-09-10T20:55:16.983Z","dependency_job_id":"68482a4b-cb3c-45d4-88d8-1f55a45eefa5","html_url":"https://github.com/datavorous/yars","commit_stats":{"total_commits":52,"total_committers":8,"mean_commits":6.5,"dds":"0.28846153846153844","last_synced_commit":"8088303b7fa08733261bfde88e85b8383d8f7978"},"previous_names":["datavorous/redditsuite","datavorous/yars","datavorous/redditminer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datavorous%2Fyars","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datavorous%2Fyars/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datavorous%2Fyars/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datavorous%2Fyars/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datavorous","download_url":"https://codeload.github.com/datavorous/yars/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248946835,"owners_count":21187582,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","data-mining","hacktoberfest","hoarding","json","python","reddit","reddit-api","reddit-crawler","reddit-downloader","reddit-scraper","requests","scraper","webscraping"],"created_at":"2024-10-01T03:02:49.768Z","updated_at":"2025-04-14T19:36:15.456Z","avatar_url":"https://github.com/datavorous.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\r\n  \r\n\u003cimg src=\"logo.svg\" width=\"10%\"\u003e\r\n\r\n# YARS (Yet Another Reddit Scraper)\r\n\r\n[![GitHub stars](https://img.shields.io/github/stars/datavorous/yars.svg?style=social\u0026label=Stars\u0026style=plastic)](https://github.com/datavorous/yars/stargazers)\u003cbr\u003e\r\n\r\n\u003c/div\u003e\r\n\r\nYARS is a Python package designed to simplify the process of scraping Reddit for posts, comments, user data, and other media. The package also includes utility functions. It is built using **Python** and relies on the **requests** module for fetching data from Reddit’s public API. The scraper uses simple `.json` requests, avoiding the need for official Reddit API keys, making it lightweight and easy to use.\r\n\r\n## Features\r\n\r\n- **Reddit Search**: Search Reddit for posts using a keyword query.\r\n- **Post Scraping**: Scrape post details, including title, body, and comments.\r\n- **User Data Scraping**: Fetch recent activity (posts and comments) of a Reddit user.\r\n- **Subreddit Posts Fetching**: Retrieve posts from specific subreddits with flexible options for category and time filters.\r\n- **Image Downloading**: Download images from posts.\r\n- **Results Display**: Utilize `Pygments` for colorful display of JSON-formatted results.\r\n\r\n\u003e [!WARNING]\r\n\u003e Use with rotating proxies, or Reddit might gift you with an IP ban.  \r\n\u003e I could extract max 2552 posts at once from 'r/all' using this.  \r\n\u003e [Here](https://files.catbox.moe/zdra2i.json) is a **7.1 MB JSON** file containing the top 100 posts from 'r/nosleep', which included post titles, body text, all comments and their replies, post scores, time of upload etc.\r\n\r\n## Dependencies\r\n\r\n- `requests`\r\n- `Pygments`\r\n\r\n## Installation\r\n\r\n1. Clone the repository:\r\n\r\n   ```\r\n   git clone https://github.com/datavorous/YARS.git\r\n   ```\r\n   Navigate inside the ```src``` folder.\r\n\r\n2. Install ```uv``` (if not already installed):\r\n\r\n   ```\r\n   pip install uv\r\n   ```\r\n\r\n3. Run the application:\r\n   ```\r\n   uv run example/example.py\r\n   ```\r\n   It'll setup the virtual env, install the necessary packages and run the ```example.py``` program.\r\n\r\n## Usage\r\n\r\nWe will use the following Python script to demonstrate the functionality of the scraper. The script includes:\r\n\r\n- Searching Reddit\r\n- Scraping post details\r\n- Fetching user data\r\n- Retrieving subreddit posts\r\n- Downloading images from posts\r\n\r\n#### Code Overview\r\n\r\n```python\r\nfrom yars import YARS\r\nfrom utils import display_results, download_image\r\n\r\nminer = YARS()\r\n```\r\n\r\n#### Step 1: Searching Reddit\r\n\r\nThe `search_reddit` method allows you to search Reddit using a query string. Here, we search for posts containing \"OpenAI\" and limit the results to 3 posts. The `display_results` function is used to present the results in a formatted way.\r\n\r\n```python\r\nsearch_results = miner.search_reddit(\"OpenAI\", limit=3)\r\ndisplay_results(search_results, \"SEARCH\")\r\n```\r\n\r\n#### Step 2: Scraping Post Details\r\n\r\nNext, we scrape details of a specific Reddit post by passing its permalink. If the post details are successfully retrieved, they are displayed using `display_results`. Otherwise, an error message is printed.\r\n\r\n```python\r\npermalink = \"https://www.reddit.com/r/getdisciplined/comments/1frb5ib/what_single_health_test_or_practice_has/\".split('reddit.com')[1]\r\npost_details = miner.scrape_post_details(permalink)\r\nif post_details:\r\n    display_results(post_details, \"POST DATA\")\r\nelse:\r\n    print(\"Failed to scrape post details.\")\r\n```\r\n\r\n#### Step 3: Fetching User Data\r\n\r\nWe can also retrieve a Reddit user’s recent activity (posts and comments) using the `scrape_user_data` method. Here, we fetch data for the user `iamsecb` and limit the results to 2 items.\r\n\r\n```python\r\nuser_data = miner.scrape_user_data(\"iamsecb\", limit=2)\r\ndisplay_results(user_data, \"USER DATA\")\r\n```\r\n\r\n#### Step 4: Fetching Subreddit Posts\r\n\r\nThe `fetch_subreddit_posts` method retrieves posts from a specified subreddit. In this example, we fetch 11 top posts from the \"generative\" subreddit from the past week.\r\n\r\n```python\r\nsubreddit_posts = miner.fetch_subreddit_posts(\"generative\", limit=11, category=\"top\", time_filter=\"week\")\r\ndisplay_results(subreddit_posts, \"EarthPorn SUBREDDIT New Posts\")\r\n```\r\n\r\n#### Step 5: Downloading Images\r\n\r\nFor the posts retrieved from the subreddit, we try to download their associated images. The `download_image` function is used for this. If the post doesn't have an `image_url`, the thumbnail URL is used as a fallback.\r\n\r\n```python\r\nfor z in range(3):\r\n    try:\r\n        image_url = subreddit_posts[z][\"image_url\"]\r\n    except:\r\n        image_url = subreddit_posts[z][\"thumbnail_url\"]\r\n    download_image(image_url)\r\n```\r\n\r\n### Complete Code Example\r\n\r\n```python\r\nfrom yars import YARS\r\nfrom utils import display_results, download_image\r\n\r\nminer = YARS()\r\n\r\n# Search for posts related to \"OpenAI\"\r\nsearch_results = miner.search_reddit(\"OpenAI\", limit=3)\r\ndisplay_results(search_results, \"SEARCH\")\r\n\r\n# Scrape post details using its permalink\r\npermalink = \"https://www.reddit.com/r/getdisciplined/comments/1frb5ib/what_single_health_test_or_practice_has/\".split('reddit.com')[1]\r\npost_details = miner.scrape_post_details(permalink)\r\nif post_details:\r\n    display_results(post_details, \"POST DATA\")\r\nelse:\r\n    print(\"Failed to scrape post details.\")\r\n\r\n# Fetch recent activity of user \"iamsecb\"\r\nuser_data = miner.scrape_user_data(\"iamsecb\", limit=2)\r\ndisplay_results(user_data, \"USER DATA\")\r\n\r\n# Fetch top posts from the subreddit \"generative\" from the past week\r\nsubreddit_posts = miner.fetch_subreddit_posts(\"generative\", limit=11, category=\"top\", time_filter=\"week\")\r\ndisplay_results(subreddit_posts, \"EarthPorn SUBREDDIT New Posts\")\r\n\r\n# Download images from the fetched posts\r\nfor z in range(3):\r\n    try:\r\n        image_url = subreddit_posts[z][\"image_url\"]\r\n    except:\r\n        image_url = subreddit_posts[z][\"thumbnail_url\"]\r\n    download_image(image_url)\r\n```\r\n\r\nYou can now use these techniques to explore and scrape data from Reddit programmatically.\r\n\r\n## Contributing\r\n\r\nContributions are welcome! For feature requests, bug reports, or questions, please open an issue. If you would like to contribute code, please open a pull request with your changes.\r\n\r\n### Our Notable Contributors\r\n\r\n\u003ca href=\"https://github.com/datavorous/yars/graphs/contributors\"\u003e\r\n  \u003cimg src=\"https://contrib.rocks/image?repo=datavorous/yars\" /\u003e\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatavorous%2Fyars","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatavorous%2Fyars","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatavorous%2Fyars/lists"}