https://github.com/lwindolf/rss-feed-index
Crawler bot + data + website providing an index of known RSS feeds in the web
https://github.com/lwindolf/rss-feed-index
feed index rss search-engine web
Last synced: 9 months ago
JSON representation
Crawler bot + data + website providing an index of known RSS feeds in the web
- Host: GitHub
- URL: https://github.com/lwindolf/rss-feed-index
- Owner: lwindolf
- Created: 2025-09-07T20:59:51.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-14T21:43:26.000Z (9 months ago)
- Last Synced: 2025-09-14T23:34:48.480Z (9 months ago)
- Topics: feed, index, rss, search-engine, web
- Language: JavaScript
- Homepage: https://lwindolf.github.io/rss-feed-index/
- Size: 27.3 MB
- Stars: 5
- Watchers: 0
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# RSS Feed Index
This repo hosts
1. a crawler for news feeds (RSS, Atom, ...)
2. the current crawling result `index.json` for the [majestic million websites](https://majestic.com/reports/majestic-million) which
is [CC BY Attribution 3.0 Unported](https://creativecommons.org/licenses/by/3.0/deed.en) licensed.
3. a Github Pages [site](https://lwindolf.github.io/rss-feed-index/) to test the results
## Feed Catalog Format
The catalog JSON stored as `index.json` has the following format
{
"example.com" : [{
"n" : "Example.com feed",
"u" : "https://example.com/feed.xml",
"t" : 134,
"f" : "rss",
"ns" : [ "syn", "wfw", "dc" ],
"d" : 1757110273
}]
}
The meaning of the fields being
| Field | Description |
|-------|--------------------------------------------------------|
| | Domain |
| n | Feed title |
| i | Feed description |
| u | URL to feed |
| t | Average score of characters in item description |
| f | Feed type "rss", "atom", "json" |
| ns | Namespaces / Features discovered |
| d | Last update timestamp of the feed |
All of the text fields are to be considered UTF-8 plain text and might need escaping.
## Crawler Usage
wget https://downloads.majestic.com/majestic_million.csv
npm i
npm run crawl
For parallel execution there is a `parallel.sh` script.
## Crawler Ethics
- robots.txt is respected
- feed discovery only on domain root no traversal
- minimal traffic
- 1 update/check request per feed per month max
- almost no retries
- no parallel crawling on a domain
- filtering of domains using Cloudflares family filter (1.1.1.3 resolver) to avoid malware and adult content
Effectivley most sites without a feed should be hit by 2 requests only.
All sites having feeds should see 2+nr of feeds (specified by ``) requests.
Crawler user agent is
Mozilla/5.0 (compatible; rss-feed-index-bot/0.9; +https://github.com/lwindolf/rss-feed-index)
## Website Build
Prepare for deployment run:
npm i
npm run build-www
Test locally with `npx serve www`