Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lamdav/redditfeedfetcher

Listen to RSS/Atom Reddit Feed and Download Imgur Albums
https://github.com/lamdav/redditfeedfetcher

atom downloader imgur reddit rss

Last synced: about 12 hours ago
JSON representation

Listen to RSS/Atom Reddit Feed and Download Imgur Albums

Awesome Lists containing this project

README

        

# Reddit Feed Fetcher
Consume Reddit Feed and Download Imgur Albums

## Motive
I wanted some script/program that could listen in on my saved post and check if it was an imgur album
it could download and store for future use/backup.

Reddit provides personal RSS feed. That is, I have an RSS/Atom feed for my saved posts. Using that,
I can periodically listen/query this endpoint to fetch for new post I have saved and do some
rudimentary checking before downloading the album (i.e. check if its an imgur posts, from a particular sub,
etc.). From there, I can use parse out information to pass into the imgur API to fetch the raw image
links.

## Notes
This project is still pretty buggy and inefficient. I should probably use a connection pool and
have some way to throttle my connection when starting to process. I should also include someway to
check if I have already downloaded something to avoid using up bandwidth. All in all, I hacked this
together quickly one night.

### TODO:
- [x] Connection Pooling
- [x] Fetch past recent items on rss feed
- [x] Add ability to skip processed items on rate limit
- [x] Avoid fetching existing images based on path
- [ ] Add rate limiting mitigations/throttling
- [x] Add option to only process recent items
- [ ] Fix occassional hiccups with `undefined` path args and timeout (`x` number of retries?)

## Structure
```
REDDIT_SAVED_RSS_FEED="link to reddit rss feed"
IMGUR_CLIENT_SECRET="imgur client secret"

ENABLE_LOG_SUMMARY="true or false value to enable more robust logging"
DESTINATION="where to drop off image and pdf"
CONNECTIONS=integer value of sockets to use for requests
START_AFTER="reddit id to start after"
SINGLE_BATCH="if defined, only one batch will execute"
```
`.env` variables needed to be defined.

All images are stored using this path convention
```
DESTINATION/SUBREDDIT_SOURCE/POST_TITLE # base path
/page_0[1-9].png
/page_[10+].png # prefixed 0 if page download is between 0-9
/POST_TITLE.pdf # all pages stitched together
```

Use `START_AFTER` if you are rate limited skip processing of previous feed elements. A
reddit `id` is posted in the logs every batch. Use `CONNECTIONS` to use set a limit on the
number of connections used to fetch data.

After every `rss` batch, there will be a `3` second delay before the next batch starts. This
is to just to allow I/O processes to keep up and as a rudimentary delay from hitting the
imgur servers too frequently.

PDF stitch will always be regenerated. I currently do not have a way to detect if a previously
missing image has been fetched (i.e. missing because of rate limiting/partial batch processing).

## How to use
1. `git clone https://github.com/lamdaV/RedditFeedFetcher.git`
2. `yarn install` or `npm install`
3. create a `.env` file and fill out relevant information (see above)
- [Reddit RSS Wiki](https://www.reddit.com/wiki/rss)
- [Reddit RSS Personalized Feed](https://redditblog.com/2010/02/02/feed-me/)
- [Imgur Client Secret](https://apidocs.imgur.com/)
4. `yarn start` or `npm run start`
- On Linux or OSX environment, run `yarn start | tee path/to/output.log` for both stdout logging
and file logging.