https://github.com/jareware/xpath2rss

A simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.
https://github.com/jareware/xpath2rss

Last synced: about 1 year ago
JSON representation

A simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.

Host: GitHub
URL: https://github.com/jareware/xpath2rss
Owner: jareware
Created: 2011-07-11T16:49:52.000Z (almost 15 years ago)
Default Branch: master
Last Pushed: 2015-12-01T22:37:20.000Z (over 10 years ago)
Last Synced: 2025-03-17T18:09:35.867Z (over 1 year ago)
Language: PHP
Homepage:
Size: 98.6 KB
Stars: 17
Watchers: 4
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

XPath2RSS
=========

A simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.

It's in PHP because it's a good glue for anything web-related, and it uses XPaths because they're awesome to work with.

What's it for
-------------

It's for keeping up with the updates to those annoying sites that don't provide an RSS feed themselves. There's some example cases below.

Installing
----------

On a Debian-like system, get the dependencies with:

$ apt-get install php5-cli php5-curl

Then get yourself a copy of `xpath2rss.php` (might be handy to drop it in your `PATH` somewhere, like under `/usr/bin`). Feel free to rename it to `xpath2rss` while you're at it if you don't like the extension (the interpreter is specified in the file).

To see that it's a-OK, try running:

$ xpath2rss

You should see a usage message. PHP 5.3+ is recommended, but the script should run with anything 5.1+.

Usage
-----

The command expects a path to a configuration file as its only argument. The configuration file is a traditional ini-file that specifies what to fetch, the XPath expressions to use etc. You can test out a configuration file by running:

$ xpath2rss --test myconfig.ini

You'll see some useful info.

The script is likely most useful when ran from a cron-like facility periodically.

Configuration
-------------

A configuration file must contain the following properties:

* `feed` - Name of the feed. This will appear as the `` of the RSS feed.
* `url` - URL from which to load the HTML that will be scraped.
* `file` - Path to an XML file that will host the RSS feed (likely under your webroot somewhere so an RSS reader can access it).
* `title` - Template for the contents of the `` for a single item in the RSS feed. If this template contains any `%variables%`, they are replaced with the corresponding XPath matches from `[vars]`.
* `description` - Same as above, but for the `` tag.
* `context` - An (optional) XPath expression to select a context node for any following expressions under `[vars]` below. Use this to avoid repetition of the same search prefix in multiple variables. See Examples.
* `[vars]` - Any number of XPath expressions that will be used to scrape content from the page at `url`. If the name of the var is `foo`, then it will be usable in the `title` and `description` fields as `%foo%`. The only mandatory var is `guid`.

Notes
-----

Each RSS item has a GUID. Once an item has been added to the feed, an item with the same GUID won't be added again.

The GUID, along with other optional variables, are specified under the `[vars]` heading of the configuration file. The content of each variable is determined by its XPath. Any `%var%`s found in the `title` and `description` templates of an RSS item are expanded to their value.

Examples
--------

### A webcomic ###

To get a feed from one popular webcomic (yes, they already have one), set up an `xkcd.ini` along these lines:

feed = "xkcd"
url = "http://xkcd.com/"
file = "/path/to/webroot/xkcd.xml"
title = "%guid%"
description = "

%text%

"

[vars]

guid = "//div[@id='middleContent']//img/@alt"
image = "//div[@id='middleContent']//img/@src"
text = "//div[@id='middleContent']//img/@title"

And run:

$ xpath2rss --test xkcd.ini

You should see the name of the latest comic as the `guid` and the other vars populated as well. The `

%text%

` has the added benefit of being able to read the image title text with devices without a cursor (say, a phone).

### Episodic YouTube-content ###

Some good stuff on YouTube don't have their own channel (from which you could get a feed directly). To scrape a feed from the search page, you could do something like:

feed = "When Cheese Fails"
url = "http://www.youtube.com/results?search_type=videos&search_query=when+cheese+fails&search_sort=video_date_uploaded"
file = "/path/to/webroot/whencheesefails.xml"
title = "%guid%"
description = "View on YouTube"
context = "//div[@id='search-results']//a[ contains(@title, 'Season') and contains(@title, 'Episode') ]"

[vars]

guid = "@title"
link = "@href"

This works because the search results are ordered newest first, and the XPath expressions will always use the first match if multiple are found. Also, since the search query is a bit long-winded, we use the optional `context` option to first select the matching context node. After that, any `[vars]` we declare will use that node as their context. Note that the same could have been done with the webcomic example.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jareware/xpath2rss

Awesome Lists containing this project

README