https://github.com/colcarroll/feed_seeker
Find rss, atom, xml, and rdf feeds on webpages
https://github.com/colcarroll/feed_seeker
Last synced: over 1 year ago
JSON representation
Find rss, atom, xml, and rdf feeds on webpages
- Host: GitHub
- URL: https://github.com/colcarroll/feed_seeker
- Owner: ColCarroll
- License: mit
- Created: 2018-01-05T04:12:04.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-01-08T21:37:58.000Z (over 8 years ago)
- Last Synced: 2025-01-30T08:29:46.438Z (over 1 year ago)
- Language: Python
- Size: 20.5 KB
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
===========
Feed Seeker
===========
*It slant rhymes with "heat seeker"*
|Build Status| |Coverage|
A library for finding atom, rss, rdf, and xml feeds from web pages. Produced at the `mediacloud `_ project. An incremental improvement over `feedfinder2 `_, which was itself based on `feedfinder `_, written by Mark Pilgrim, and maintained by Aaron Swartz until his untimely death.
Quickstart
==========
By default, the library uses :code:`requests` to grab html and inspect it and find the most
likely feed url:
.. code-block:: python
from feed_seeker import find_feed_url
>>> find_feed_url('https://github.com/ColCarroll/feed_seeker')
'https://github.com/ColCarroll/feed_seeker/commits/master.atom'
To do a more thorough search, use :code:`generate_feed_urls`, which returns more likely candidates first.
.. code-block:: python
from feed_seeker import generate_feed_urls
>>> for url in generate_feed_urls('https://xkcd.com'):
... print(url)
...
https://xkcd.com/atom.xml
https://xkcd.com/rss.xml
For the most thorough search, add a :code:`spider` argument to do depth-first spidering of urls on the same hostname. Note the below call takes nearly four minutes, compared to 0.5 seconds for :code:`find_feed_url`.
.. code-block:: python
>>> for url in generate_feed_urls('https://github.com/ColCarroll/feed_seeker', spider=1):
... print(url)
...
https://github.com/ColCarroll/feed_seeker/commits/master.atom
https://github.com/ColCarroll/feed_seeker/commits/a8f7b86eac2cedd9209ac5d2ddcceb293d2404c9.atom
https://github.com/ColCarroll/feed_seeker/commits/3b5245b46a10fb3647a1f08b8e584b471683fbbd.atom
https://github.com/ColCarroll/feed_seeker/commits/659311b8853c4c4a67e3b4bc67a78461d825a064.atom
https://github.com/ColCarroll/feed_seeker/commits/3e93490cb91f7652325c2fe41ef29a5be4558d6a.atom
https://github.com/index.atom
https://github.com/articles.atom
https://github.com/dfm/feedfinder2/commits/master.atom
https://github.com/ColCarroll.atom
https://github.com/blog.atom
https://github.com/blog/all.atom
https://github.com/blog/broadcasts.atom
Installation
------------
The library is not yet available on PyPI, so installation is via github only for now:
.. code-block:: bash
pip install git+https://github.com/ColCarroll/feed_seeker
Differences with :code:`feedfinder2`
====================================
The biggest difference is that all functions are implemented as generators, and are evaluated lazily. Candidate feed links are actually accessed and inspected to determine whether or not they are a feed, which can be quite time consuming. We expose a function to find the most likely feed link, and another to lazily generate links in rough order from most prominent to least.
There are also a few more heuristics based on our experience at `mediacloud `_.
.. |Build Status| image:: https://travis-ci.org/ColCarroll/feed_seeker.png?branch=master
:target: https://travis-ci.org/ColCarroll/feed_seeker
.. |Coverage| image:: https://coveralls.io/repos/github/ColCarroll/feed_seeker/badge.svg?branch=master
:target: https://coveralls.io/github/ColCarroll/feed_seeker?branch=master