{"id":17991384,"url":"https://github.com/jareware/xpath2rss","last_synced_at":"2025-03-25T23:32:07.203Z","repository":{"id":73474847,"uuid":"2031389","full_name":"jareware/xpath2rss","owner":"jareware","description":"A simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.","archived":false,"fork":false,"pushed_at":"2015-12-01T22:37:20.000Z","size":101,"stargazers_count":17,"open_issues_count":2,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-17T18:09:35.867Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jareware.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-07-11T16:49:52.000Z","updated_at":"2025-01-22T19:52:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"13c3fd8b-d98f-4a77-b20a-aa9370465480","html_url":"https://github.com/jareware/xpath2rss","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jareware%2Fxpath2rss","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jareware%2Fxpath2rss/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jareware%2Fxpath2rss/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jareware%2Fxpath2rss/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jareware","download_url":"https://codeload.github.com/jareware/xpath2rss/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245562077,"owners_count":20635861,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-29T19:21:35.359Z","updated_at":"2025-03-25T23:32:02.180Z","avatar_url":"https://github.com/jareware.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"XPath2RSS\n=========\n\nA simple web scraper for querying HTML documents with XPath and turning the results into an RSS feed.\n\nIt's in PHP because it's a good glue for anything web-related, and it uses XPaths because they're awesome to work with.\n\nWhat's it for\n-------------\n\nIt's for keeping up with the updates to those annoying sites that don't provide an RSS feed themselves.  There's some example cases below.\n\nInstalling\n----------\n\nOn a Debian-like system, get the dependencies with:\n\n    $ apt-get install php5-cli php5-curl\n\nThen get yourself a copy of `xpath2rss.php` (might be handy to drop it in your `PATH` somewhere, like under `/usr/bin`).  Feel free to rename it to `xpath2rss` while you're at it if you don't like the extension (the interpreter is specified in the file).\n\nTo see that it's a-OK, try running:\n\n    $ xpath2rss\n\nYou should see a usage message.  PHP 5.3+ is recommended, but the script should run with anything 5.1+.\n\nUsage\n-----\n\nThe command expects a path to a configuration file as its only argument.  The configuration file is a traditional ini-file that specifies what to fetch, the XPath expressions to use etc.  You can test out a configuration file by running:\n\n    $ xpath2rss --test myconfig.ini\n\nYou'll see some useful info.\n\nThe script is likely most useful when ran from a cron-like facility periodically.\n\nConfiguration\n-------------\n\nA configuration file must contain the following properties:\n\n* `feed` - Name of the feed.  This will appear as the `\u003ctitle\u003e` of the RSS feed.\n* `url` - URL from which to load the HTML that will be scraped.\n* `file` - Path to an XML file that will host the RSS feed (likely under your webroot somewhere so an RSS reader can access it).\n* `title` - Template for the contents of the `\u003ctitle\u003e` for a single item in the RSS feed.  If this template contains any `%variables%`, they are replaced with the corresponding XPath matches from `[vars]`.\n* `description` - Same as above, but for the `\u003cdescription\u003e` tag.\n* `context` - An (optional) XPath expression to select a context node for any following expressions under `[vars]` below.  Use this to avoid repetition of the same search prefix in multiple variables.  See Examples.\n* `[vars]` - Any number of XPath expressions that will be used to scrape content from the page at `url`.  If the name of the var is `foo`, then it will be usable in the `title` and `description` fields as `%foo%`.  The only mandatory var is `guid`.\n\nNotes\n-----\n\nEach RSS item has a GUID.  Once an item has been added to the feed, an item with the same GUID won't be added again.\n\nThe GUID, along with other optional variables, are specified under the `[vars]` heading of the configuration file.  The content of each variable is determined by its XPath.  Any `%var%`s found in the `title` and `description` templates of an RSS item are expanded to their value.\n\nExamples\n--------\n\n### A webcomic ###\n\nTo get a feed from one popular webcomic (yes, they already have one), set up an `xkcd.ini` along these lines:\n\n    feed = \"xkcd\"\n    url = \"http://xkcd.com/\"\n    file = \"/path/to/webroot/xkcd.xml\"\n    title = \"%guid%\"\n    description = \"\u003cimg src='%image%' /\u003e \u003cp\u003e%text%\u003c/p\u003e\"\n    \n    [vars]\n    \n    guid = \"//div[@id='middleContent']//img/@alt\"\n    image = \"//div[@id='middleContent']//img/@src\"\n    text = \"//div[@id='middleContent']//img/@title\"\n\nAnd run:\n\n    $ xpath2rss --test xkcd.ini\n\nYou should see the name of the latest comic as the `guid` and the other vars populated as well.  The `\u003cp\u003e%text%\u003c/p\u003e` has the added benefit of being able to read the image title text with devices without a cursor (say, a phone).\n\n### Episodic YouTube-content ###\n\nSome good stuff on YouTube don't have their own channel (from which you could get a feed directly).  To scrape a feed from the search page, you could do something like:\n\n    feed = \"When Cheese Fails\"\n    url = \"http://www.youtube.com/results?search_type=videos\u0026search_query=when+cheese+fails\u0026search_sort=video_date_uploaded\"\n    file = \"/path/to/webroot/whencheesefails.xml\"\n    title = \"%guid%\"\n    description = \"\u003ca href='http://www.youtube.com%link%'\u003eView on YouTube\u003c/a\u003e\"\n    context = \"//div[@id='search-results']//a[ contains(@title, 'Season') and contains(@title, 'Episode') ]\"\n    \n    [vars]\n    \n    guid = \"@title\"\n    link = \"@href\"\n\nThis works because the search results are ordered newest first, and the XPath expressions will always use the first match if multiple are found.  Also, since the search query is a bit long-winded, we use the optional `context` option to first select the matching context node.  After that, any `[vars]` we declare will use that node as their context.  Note that the same could have been done with the webcomic example.\n\nSee also\n--------\n\n 1. http://www.w3.org/TR/xpath/ - XPath syntax\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjareware%2Fxpath2rss","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjareware%2Fxpath2rss","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjareware%2Fxpath2rss/lists"}