Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ecklf/reddit-clawler
A command-line tool written in Rust that crawls Reddit posts from a user or subreddit
https://github.com/ecklf/reddit-clawler
cli crawler downloader downloader-for-reddit reddit
Last synced: about 1 month ago
JSON representation
A command-line tool written in Rust that crawls Reddit posts from a user or subreddit
- Host: GitHub
- URL: https://github.com/ecklf/reddit-clawler
- Owner: ecklf
- License: gpl-3.0
- Created: 2023-12-07T23:30:57.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-09T16:59:15.000Z (5 months ago)
- Last Synced: 2024-10-04T18:41:43.716Z (4 months ago)
- Topics: cli, crawler, downloader, downloader-for-reddit, reddit
- Language: Rust
- Homepage:
- Size: 138 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Reddit Clawler 🐾
A command-line tool written in Rust that crawls Reddit posts from a user, subreddit, or search term.
## Usage
Install the following dependencies:
- [yt-dlp](https://github.com/yt-dlp/yt-dlp)
## Commands
You can see all available commands by running:
```sh
./reddit_clawler --help
```By default, the tool will download posts to the `output/{subcommand}/{value}` folder
### User
Crawls posts from `/u/spez` with spawning `50` tasks to `./downloads/user/spez`:```sh
./reddit_clawler user spez --category new --tasks 50 -o ./downloads
```### Subreddit
Crawls posts from `/r/redpandas` from the `top` category, filtered by `hour`:```sh
./reddit_clawler subreddit redpandas --category top --timeframe hour
```### Search
Crawls posts for search term `olympics` from the `top` category, filtered by `hour`:```sh
./reddit_clawler search olympics --category top --timeframe hour
```## Features
### Providers (these are the most common I found):
- [x] Reddit Media
- [x] Imgur Media
- [x] YouTube Videos
- [x] Redgifs Videos### Caching
After the downloads have finished, a `cache.json` file will be created in the folder of the downloaded resource.
This file keeps track of the posts you have already downloaded and skips downloading them on subsequent runs.### Rate limiting
Querying posts is paginated (100 items per requests) and can lead to rate limiting.
To avoid this, you can provide a `--limit` flag to limit the number of requests for fetching a resource.
This can be useful for subsequent crawling.### File format
By default it will prefer `mp4` over `gif`, if available.
## Planned
- [ ] Providing custom filename scheme
- [ ] Configuration for conversion to other/small formats (`avif`/`webp`/`webm`)
- [ ] Remove duplicated## Development
You can use the `--skip` flag to skip the download process:
```sh
cargo run -- user spez --skip
```You can use the `--mock` flag to provide a mock file for the responses of the Reddit client:
```sh
cargo run -- user spez --mock ./tests/mocks/reddit/submitted_response/reddit_video.json
```## License
Reddit Clawler is licensed under the GNU General Public License v3.0. See the LICENSE file for details.