https://github.com/jareware/xpath2rss
A simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.
https://github.com/jareware/xpath2rss
Last synced: about 1 year ago
JSON representation
A simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.
- Host: GitHub
- URL: https://github.com/jareware/xpath2rss
- Owner: jareware
- Created: 2011-07-11T16:49:52.000Z (almost 15 years ago)
- Default Branch: master
- Last Pushed: 2015-12-01T22:37:20.000Z (over 10 years ago)
- Last Synced: 2025-03-17T18:09:35.867Z (over 1 year ago)
- Language: PHP
- Homepage:
- Size: 98.6 KB
- Stars: 17
- Watchers: 4
- Forks: 4
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
XPath2RSS
=========
A simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.
It's in PHP because it's a good glue for anything web-related, and it uses XPaths because they're awesome to work with.
What's it for
-------------
It's for keeping up with the updates to those annoying sites that don't provide an RSS feed themselves. There's some example cases below.
Installing
----------
On a Debian-like system, get the dependencies with:
$ apt-get install php5-cli php5-curl
Then get yourself a copy of `xpath2rss.php` (might be handy to drop it in your `PATH` somewhere, like under `/usr/bin`). Feel free to rename it to `xpath2rss` while you're at it if you don't like the extension (the interpreter is specified in the file).
To see that it's a-OK, try running:
$ xpath2rss
You should see a usage message. PHP 5.3+ is recommended, but the script should run with anything 5.1+.
Usage
-----
The command expects a path to a configuration file as its only argument. The configuration file is a traditional ini-file that specifies what to fetch, the XPath expressions to use etc. You can test out a configuration file by running:
$ xpath2rss --test myconfig.ini
You'll see some useful info.
The script is likely most useful when ran from a cron-like facility periodically.
Configuration
-------------
A configuration file must contain the following properties:
* `feed` - Name of the feed. This will appear as the `` of the RSS feed.
* `url` - URL from which to load the HTML that will be scraped.
* `file` - Path to an XML file that will host the RSS feed (likely under your webroot somewhere so an RSS reader can access it).
* `title` - Template for the contents of the `` for a single item in the RSS feed. If this template contains any `%variables%`, they are replaced with the corresponding XPath matches from `[vars]`.
* `description` - Same as above, but for the `` tag.
* `context` - An (optional) XPath expression to select a context node for any following expressions under `[vars]` below. Use this to avoid repetition of the same search prefix in multiple variables. See Examples.
* `[vars]` - Any number of XPath expressions that will be used to scrape content from the page at `url`. If the name of the var is `foo`, then it will be usable in the `title` and `description` fields as `%foo%`. The only mandatory var is `guid`.
Notes
-----
Each RSS item has a GUID. Once an item has been added to the feed, an item with the same GUID won't be added again.
The GUID, along with other optional variables, are specified under the `[vars]` heading of the configuration file. The content of each variable is determined by its XPath. Any `%var%`s found in the `title` and `description` templates of an RSS item are expanded to their value.
Examples
--------
### A webcomic ###
To get a feed from one popular webcomic (yes, they already have one), set up an `xkcd.ini` along these lines:
feed = "xkcd"
url = "http://xkcd.com/"
file = "/path/to/webroot/xkcd.xml"
title = "%guid%"
description = "
%text%
"
[vars]
guid = "//div[@id='middleContent']//img/@alt"
image = "//div[@id='middleContent']//img/@src"
text = "//div[@id='middleContent']//img/@title"
And run:
$ xpath2rss --test xkcd.ini
You should see the name of the latest comic as the `guid` and the other vars populated as well. The `
%text%
` has the added benefit of being able to read the image title text with devices without a cursor (say, a phone).
### Episodic YouTube-content ###
Some good stuff on YouTube don't have their own channel (from which you could get a feed directly). To scrape a feed from the search page, you could do something like:
feed = "When Cheese Fails"
url = "http://www.youtube.com/results?search_type=videos&search_query=when+cheese+fails&search_sort=video_date_uploaded"
file = "/path/to/webroot/whencheesefails.xml"
title = "%guid%"
description = "View on YouTube"
context = "//div[@id='search-results']//a[ contains(@title, 'Season') and contains(@title, 'Episode') ]"
[vars]
guid = "@title"
link = "@href"
This works because the search results are ordered newest first, and the XPath expressions will always use the first match if multiple are found. Also, since the search query is a bit long-winded, we use the optional `context` option to first select the matching context node. After that, any `[vars]` we declare will use that node as their context. Note that the same could have been done with the webcomic example.
See also
--------
1. http://www.w3.org/TR/xpath/ - XPath syntax