Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ecklf/reddit-clawler

A command-line tool written in Rust that crawls Reddit posts from a user or subreddit
https://github.com/ecklf/reddit-clawler

cli crawler downloader downloader-for-reddit reddit

Last synced: about 1 month ago
JSON representation

A command-line tool written in Rust that crawls Reddit posts from a user or subreddit

Host: GitHub
URL: https://github.com/ecklf/reddit-clawler
Owner: ecklf
License: gpl-3.0
Created: 2023-12-07T23:30:57.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-09-09T16:59:15.000Z (5 months ago)
Last Synced: 2024-10-04T18:41:43.716Z (4 months ago)
Topics: cli, crawler, downloader, downloader-for-reddit, reddit
Language: Rust
Homepage:
Size: 138 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Reddit Clawler 🐾

A command-line tool written in Rust that crawls Reddit posts from a user, subreddit, or search term.

## Usage

Install the following dependencies:

- [yt-dlp](https://github.com/yt-dlp/yt-dlp)

## Commands

You can see all available commands by running:

```sh
./reddit_clawler --help
```

By default, the tool will download posts to the `output/{subcommand}/{value}` folder

### User
Crawls posts from `/u/spez` with spawning `50` tasks to `./downloads/user/spez`:

```sh
./reddit_clawler user spez --category new --tasks 50 -o ./downloads
```

### Subreddit
Crawls posts from `/r/redpandas` from the `top` category, filtered by `hour`:

```sh
./reddit_clawler subreddit redpandas --category top --timeframe hour
```

### Search
Crawls posts for search term `olympics` from the `top` category, filtered by `hour`:

```sh
./reddit_clawler search olympics --category top --timeframe hour
```

## Features

### Providers (these are the most common I found):

- [x] Reddit Media
- [x] Imgur Media
- [x] YouTube Videos
- [x] Redgifs Videos

### Caching

After the downloads have finished, a `cache.json` file will be created in the folder of the downloaded resource.
This file keeps track of the posts you have already downloaded and skips downloading them on subsequent runs.

### Rate limiting

Querying posts is paginated (100 items per requests) and can lead to rate limiting.
To avoid this, you can provide a `--limit` flag to limit the number of requests for fetching a resource.
This can be useful for subsequent crawling.

### File format

By default it will prefer `mp4` over `gif`, if available.

## Planned

- [ ] Providing custom filename scheme
- [ ] Configuration for conversion to other/small formats (`avif`/`webp`/`webm`)
- [ ] Remove duplicated

## Development

You can use the `--skip` flag to skip the download process:

```sh
cargo run -- user spez --skip
```

You can use the `--mock` flag to provide a mock file for the responses of the Reddit client:

```sh
cargo run -- user spez --mock ./tests/mocks/reddit/submitted_response/reddit_video.json
```

## License

Reddit Clawler is licensed under the GNU General Public License v3.0. See the LICENSE file for details.