Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/igrishaev/remus

Attentive RSS/Atom feed parser for Clojure
https://github.com/igrishaev/remus

atom clojure feed http rss

Last synced: about 1 month ago
JSON representation

Attentive RSS/Atom feed parser for Clojure

Awesome Lists containing this project

README

        

# Remus

[rome-site]: https://rometools.github.io/rome/

[clj-http]: https://github.com/dakrone/clj-http

An attentive RSS and Atom feed parser for Clojure. It's built on top of
well-known and powerful [ROME Tools][rome-site] Java library. Remus deals with
weird encoding and non-standard XML tags. The library fetches as much
information from a feed as possible.

![](art/romulus-remus.jpg)

# Table of Contents

- [Benefits](#benefits)
- [Installation](#installation)
- [Usage](#usage)
* [Parsing a URL](#parsing-a-url)
* [Parsing a file](#parsing-a-file)
* [Parsing an input stream](#parsing-an-input-stream)
- [HTTP tweaks](#http-tweaks)
* [Errors and exceptions](#errors-and-exceptions)
* [Saving headers](#saving-headers)
- [Non-standard tags](#non-standard-tags)
- [Encoding issues](#encoding-issues)
- [License](#license)

## Benefits

- Gets all the known fields from a feed and turns them into plain Clojure data
structures;
- relies on up-to-date ROME release;
- uses the power of [clj-http][clj-http] client instead of deprecated ROME
Fetcher;
- preserves all the non-standard XML tags for further processing (see example
below).

## Installation

Leiningen/Boot:

```clojure
[remus "0.2.4"]
```

Clojure CLI/deps.edn

```clojure
remus/remus {:mvn/version "0.2.4"}
```

## Usage

The library provides a one-word top namespace `remus` so it's easier to
remember.

```clojure
(ns your.project
(:require [remus :refer [parse-url parse-file]]))
```

or:

```clojure
(require '[remus :refer [parse-url parse-file]])
```

### Parsing a URL

Let's parse [Planet Clojure](http://planet.clojure.in/):

```clojure
(def result (parse-url "http://planet.clojure.in/atom.xml"))
```

The variable `result` is a map of two keys: `:response` and `:feed`. These are
an HTTP response and a parsed feed. Below, there is a truncated version of a
feed:

```clojure
(def feed (:feed result))

(println feed)

;;;;
;; just a small subset
;;;;

{:description nil,
:feed-type "atom_1.0"
:entries
[{:description nil
:updated-date #inst "2018-08-13T10:00:00.000-00:00"
:extra {:tag :extra, :attrs nil, :content ()}
:title
"PurelyFunctional.tv Newsletter 287: DataScript, GraphQL, CRDTs"
:author "Eric Normand"
:link
"https://purelyfunctional.tv/issues/purelyfunctional-tv-newsletter-287-datascript-graphql-crdts/"
:uri "https://purelyfunctional.tv/?p=28660"
:contents
({:type "html"
:mode nil
:value
"

\n

Issue 287 August 13, 2018 Archives Subscribe

\n

Hi Clojurationists,

\n

I've just been digging this lovely tweet from Alex Miller.

\n

Rock on!
Eric Normand <

As for HTTP response, it's the same data structure that
`clj-http.client/response` function returns. You might need that data to save
some of the HTTP headers for further requests (see below).

### Parsing a file

```clojure
(def feed (parse-file "/path/to/some/atom.xml"))
```

This function just returns a parsed feed.

### Parsing an input stream

Just in case you're getting a feed from a stream, here is a function for that:

```clojure
(def feed (parse-stream (clojure.java.io/input-stream some-source)))
```

Like `parse-file`, it returns a parsed feed as a data structure.

## HTTP tweaks

Since `Remus` relies on [clj-http][clj-http] library for HTTP communication, you
are welcome to use all its features. For example, to control redirects, security
validation, authentication, etc. When calling `parse-url`, pass an optional map
with HTTP parameters:

```clojure
;; Do not check an untrusted SSL certificate.
(parse-url "http://planet.clojure.in/atom.xml"
{:insecure? true})

;; Parse a user/pass protected HTTP resource.
(parse-url "http://planet.clojure.in/atom.xml"
{:basic-auth ["username" "password"]})

;; Pretending being a browser. Some sites protect access by "User-Agent" header.
(parse-url "http://planet.clojure.in/atom.xml"
{:headers {"User-Agent" "Mozilla/5.0 (Macintosh; Intel Mac...."}})
```

Remus overrides **just one option** which is `:as`. No matter what you put into
it, the value becomes `:stream`. We need a streamed HTTP response because ROME
relies on an input stream.

### Errors and exceptions

It's up to you how to deal with non-200 HTTP responses. Even if you pass
`{:throw-exceptions false}`, the feed only be parsed when the status code is
200.

```clojure
(let [result (parse-url "http://example.com/non-existing-url"
{:throw-exceptions false})
{:keys [response feed]} result]
(when-not feed
(process-non-200 response)))
```

Or just skip the `:throw-exceptions` flag and wrap everything into the standard
`try/catch` form:

```clojure
(try
(parse-url "http://non-existing-url")
(catch clojure.lang.ExceptionInfo e
(let [response (ex-data e)
{:keys [status headers]} response]
(println status headers)
;; do anything you want
)))
```

[clj-http-ex]:https://github.com/dakrone/clj-http#exceptions

[slingshot]: https://github.com/scgilardi/slingshot

Alternately, you may use the [Slingshot][slingshot] approach to catch
HTTP-thrown exceptions as the [official manual][clj-http-ex] describes.

### Saving headers

[cond-get]: https://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers

When parsing a URL, a good option would be to pass the `If-None-Match` and
`If-Modified-Since` headers with the values from the `Etag` and `Last-Modified`
ones from the previous response. This trick is know as [conditional
GET][cond-get]. It might prevent server from sending the data you've already
received before:

```clojure
;; returns the whole feed
(def result (parse-url "http://planet.lisp.org/rss20.xml"))

;; split the result
(def feed (:feed result))
(def response (:response result))

;; ensure we got the data
(:length response)
48082

;; save the headers
(def etag (-> response :headers :etag))
;; "5b71766f-2f597"

(def last-modified (-> response :headers :last-modified))
;; Mon, 19 Oct 2020 12:15:27 GMT

;;;;
;; Now, try to fetch data passing conditionals headers:
;;;;

(def result-new
(parse-url "http://planet.lisp.org/rss20.xml"
{:headers {"If-None-Match" etag
"If-Modified-Since" last-modified}}))

(-> result-new :response :status)
304

(-> result-new :response :length)
0

(-> result-new :feed)
nil
```

Since the server returned non-200 but positive status code (304 in our case), we
don't parse the response at all. So the `:feed` field in the `result-new`
variable will be `nil`.

## Non-standard tags

[youtube-rss]: https://www.youtube.com/feeds/videos.xml?channel_id=UCaLlzGqiPE2QRj6sSOawJRg

Sometimes, a feed ships additional data with non-standard tags. A good example
might be a typical [YouTube feed][youtube-rss]. Let's examine one of its
entries:

```xml

yt:video:TbthtdBw93w
TbthtdBw93w
UCaLlzGqiPE2QRj6sSOawJRg
Datomic Ions in Seven Minutes


ClojureTV

https://www.youtube.com/channel/UCaLlzGqiPE2QRj6sSOawJRg


2018-07-03T21:16:16+00:00
2018-08-09T16:29:51+00:00

Datomic Ions in Seven Minutes



Stuart Halloway introduces Ions for Datomic Cloud on AWS.





```

In addition to the standard fields, the feed carries information about the video
ID, channel ID and statistics: views count, the number of times the video was
starred and its average rating. You would probably want to use that data.

Alternately, if you parse a geo-related feed, you'll get lat/lot coordinates,
location names, tracks, etc.

Other RSS parsers either drop this data or require you to write a custom
extension. `Remus` provides all the non-standard tags as a parsed XML
structure. It puts that data into an `:extra` field for each entry and on the
top level of a feed. This is how you can reach it:

```clojure
(def result (parse-url "https://www.youtube.com/feeds/videos.xml?channel_id=UCaLlzGqiPE2QRj6sSOawJRg"))

(def feed (:feed result))

;;;;
;; Get entry-specific custom data
;;;;

;; Extra data from the first entry:
(-> feed :entries first :extra)

{:tag :rome/extra
:attrs nil
:content
({:tag :yt/videoId :attrs nil :content ["faoXSarGgEI"]}
{:tag :yt/channelId :attrs nil :content ["UCaLlzGqiPE2QRj6sSOawJRg"]}
{:tag :media/group
:attrs nil
:content
({:tag :media/title :attrs nil :content ["Datomic Cloud - Datoms"]}
{:tag :media/content
:attrs
{:url "https://www.youtube.com/v/faoXSarGgEI?version=3"
:type "application/x-shockwave-flash"
:width "640"
:height "390"}
:content nil}
{:tag :media/thumbnail
:attrs
{:url "https://i3.ytimg.com/vi/faoXSarGgEI/hqdefault.jpg"
:width "480"
:height "360"}
:content nil}
{:tag :media/description
:attrs nil
:content
["Check out the live animated tutorial: https://docs.datomic.com/cloud/livetutorial/datoms.html\n\nYour Datomic database consists of datoms. What are Datoms?"]}
{:tag :media/community
:attrs nil
:content
({:tag :media/starRating
:attrs {:count "72" :average "5.00" :min "1" :max "5"}
:content nil}
{:tag :media/statistics :attrs {:views "2014"} :content nil})})})}

;;;;
;; Get feed-specific extra:
;;;;

(-> feed :extra)

{:tag :rome/extra
:attrs nil
:content
({:tag :yt/channelId :attrs nil :content ["UCaLlzGqiPE2QRj6sSOawJRg"]})}
```

The `:extra` fields follow the standard XML-friendly structure so they can be
processed with any XML-related technics like walking, zippers, etc.

## Encoding issues

All the `parse-` functions mentioned above take additional
ROME-related options. Use them to solve XML-decoding issues when dealing with
weird or non-set HTTP headers. ROME's got a solid algorithm to guess encoding,
but sometimes it might need your help.

At the moment, Remus supports `:lenient`, `:encoding` and `content-type` options
with has the following meaning:

- `lenient`: a boolean flag which makes Rome to be more loyal to some mistakes
in XML markup;

- `encoding`: a string which represents the encoding of the feed. When parsing
a URL, it comes from the `Content-Encoding` HTTP header. Possible values are
listed here: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

- `content-type`: a string meaning the MIME type of the feed,
e.g. `application/rss` or something. When parsing a URL, it comes from the
`Content-Type` header.

Dealing with Windows encoding and unset `Content-type` or `Content-Encoding`
headers:

```clojure
(parse-url "https://some/rss.xml" nil {:lenient true :encoding "cp1251"})
```

The same options work for parsing a file or a stream:

```clojure
(parse-file "https://another/atom.xml" {:lenient true :encoding "cp1251"})

(parse-stream in-source {:lenient true :encoding "cp1251"})
```

## License

Copyright © 2020 Ivan Grishaev

Distributed under the Eclipse Public License either version 1.0 or (at your
option) any later version.