https://github.com/capjamesg/web-feed-recovery

Try to identify new versions of feeds that now return a 404.
https://github.com/capjamesg/web-feed-recovery

atom feed-reader-testing feed-reading rss

Last synced: about 1 year ago
JSON representation

Try to identify new versions of feeds that now return a 404.

Host: GitHub
URL: https://github.com/capjamesg/web-feed-recovery
Owner: capjamesg
License: mit
Created: 2024-12-21T20:06:49.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-12-23T13:40:46.000Z (about 1 year ago)
Last Synced: 2025-03-10T18:00:44.917Z (about 1 year ago)
Topics: atom, feed-reader-testing, feed-reading, rss
Language: Python
Homepage:
Size: 27.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Web Feed Recovery

This repository contains a script that aims to find a new version of a web feed for a feed that currently returns a 404.

This repository takes a list of feed URLs that are known to be 404s and attempts to find new feeds.

On a test of 160 broken feeds from the real world, this project recovered 67%.

## Installation

First, clone this project:

```
git clone https://github.com/capjamesg/web-feed-recovery
```

Then, create a file called `feeds.txt` and add feeds that are known to be broken. Add one feed URL per line.

Then, run:

```
app.py
```

Results will be saved to a file called `results.json` with the structure:

```json
[
{
"original_feed": "https://blog.autumnrain.cc",
"found_feeds": {
"https://blog.autumnrain.cc/rss/": "application/rss+xml"
}
}
]
```

The key-value pairs are the found feed URL mapped to the found MIME type.

MIME types are only added if a feed was found through HTTP header discovery. If the feed was not found through HTTP header discovery, the MIME type will be null.

## Algorithm

1. Go to the homepage of the site associated with the feed.
2. Check the HTTP headers and HTML `` tags for signals of a feed (using the [indieweb-utils feed discovery implementation](https://indieweb-utils.readthedocs.io/en/latest/discovery.html#indieweb_utils.discover_web_page_feeds)).
3. Check for instances of several link anchors indicative of a feed (i.e. "RSS", "RSS Feed"). Save those as potential new feeds.
4. Check for instances of link anchors for several blog-related terms, like "Blog" and "Writing". Go to those pages, perform HTTP header and HTML `` tag analysis, and save any feeds.
5. Present all discovered feeds.

### Limitations

For a multi-user site on the same domain, the algorithm will not work. This is because a feed on the URL cannot be confidently, generally reconciled with a single writer with the algorithm above. More additions would be needed to support such behaviour.

## UX

The feeds returned are "potential" feeds, since any feed that the user did not add to a feed reader themselves (or that a feed reader did not infer from a URL provided by a user) cannot be known to be the right replacement without confirmation from a user. Thus, use of this script in any project should be accompanied by a stage where a user is asked to confirm that the new feed matches their expectations before replacing the broken feed with the newly-found one.

## License

This project is licensed under an [MIT license](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/capjamesg/web-feed-recovery

Awesome Lists containing this project

README