https://github.com/igrishaev/remus

Attentive RSS/Atom feed parser for Clojure
https://github.com/igrishaev/remus
atom clojure feed http rss
Last synced: 27 days ago
JSON representation
Attentive RSS/Atom feed parser for Clojure
Host: GitHub
URL: https://github.com/igrishaev/remus
Owner: igrishaev
License: epl-1.0
Created: 2018-08-11T17:18:47.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2023-02-17T10:19:43.000Z (about 2 years ago)
Last Synced: 2025-04-17T05:10:50.027Z (about 1 month ago)
Topics: atom, clojure, feed, http, rss
Language: Clojure
Homepage:
Size: 398 KB
Stars: 62
Watchers: 4
Forks: 2
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

        # Remus

[rome-site]: https://rometools.github.io/rome/

[clj-http]: https://github.com/dakrone/clj-http

An attentive RSS and Atom feed parser for Clojure. It's built on top of

well-known and powerful [ROME Tools][rome-site] Java library. Remus deals with

weird encoding and non-standard XML tags. The library fetches as much

information from a feed as possible.

![](art/romulus-remus.jpg)

# Table of Contents

- [Benefits](#benefits)

- [Installation](#installation)

- [Usage](#usage)

  * [Parsing a URL](#parsing-a-url)

  * [Parsing a file](#parsing-a-file)

  * [Parsing an input stream](#parsing-an-input-stream)

- [HTTP tweaks](#http-tweaks)

  * [Errors and exceptions](#errors-and-exceptions)

  * [Saving headers](#saving-headers)

- [Non-standard tags](#non-standard-tags)

- [Encoding issues](#encoding-issues)

- [License](#license)

## Benefits

- Gets all the known fields from a feed and turns them into plain Clojure data

  structures;

- relies on up-to-date ROME release;

- uses the power of [clj-http][clj-http] client instead of deprecated ROME

  Fetcher;

- preserves all the non-standard XML tags for further processing (see example

  below).

## Installation

Leiningen/Boot:

```clojure

[remus "0.2.4"]

```

Clojure CLI/deps.edn

```clojure

remus/remus {:mvn/version "0.2.4"}

```

## Usage

The library provides a one-word top namespace `remus` so it's easier to

remember.

```clojure

(ns your.project

  (:require [remus :refer [parse-url parse-file]]))

```

or:

```clojure

(require '[remus :refer [parse-url parse-file]])

```

### Parsing a URL

Let's parse [Planet Clojure](http://planet.clojure.in/):

```clojure

(def result (parse-url "http://planet.clojure.in/atom.xml"))

```

The variable `result` is a map of two keys: `:response` and `:feed`. These are

an HTTP response and a parsed feed. Below, there is a truncated version of a

feed:

```clojure

(def feed (:feed result))

(println feed)

;;;;

;; just a small subset

;;;;

{:description nil,

 :feed-type "atom_1.0"

 :entries

 [{:description nil

   :updated-date #inst "2018-08-13T10:00:00.000-00:00"

   :extra {:tag :extra, :attrs nil, :content ()}

   :title

   "PurelyFunctional.tv Newsletter 287: DataScript, GraphQL, CRDTs"

   :author "Eric Normand"

   :link

   "https://purelyfunctional.tv/issues/purelyfunctional-tv-newsletter-287-datascript-graphql-crdts/"

   :uri "https://purelyfunctional.tv/?p=28660"

   :contents

   ({:type "html"

     :mode nil

     :value

     "
\nIssue 287 August 13, 2018 Archives Subscribe
\nHi Clojurationists,
\nI've just been digging this lovely tweet from Alex Miller.
\nRock on!
Eric Normand <


As for HTTP response, it's the same data structure that

`clj-http.client/response` function returns. You might need that data to save

some of the HTTP headers for further requests (see below).

### Parsing a file

```clojure

(def feed (parse-file "/path/to/some/atom.xml"))

```

This function just returns a parsed feed.

### Parsing an input stream

Just in case you're getting a feed from a stream, here is a function for that:

```clojure

(def feed (parse-stream (clojure.java.io/input-stream some-source)))

```

Like `parse-file`, it returns a parsed feed as a data structure.

## HTTP tweaks

Since `Remus` relies on [clj-http][clj-http] library for HTTP communication, you

are welcome to use all its features. For example, to control redirects, security

validation, authentication, etc. When calling `parse-url`, pass an optional map

with HTTP parameters:

```clojure

;; Do not check an untrusted SSL certificate.

(parse-url "http://planet.clojure.in/atom.xml"

           {:insecure? true})

;; Parse a user/pass protected HTTP resource.

(parse-url "http://planet.clojure.in/atom.xml"

           {:basic-auth ["username" "password"]})

;; Pretending being a browser. Some sites protect access by "User-Agent" header.

(parse-url "http://planet.clojure.in/atom.xml"

           {:headers {"User-Agent" "Mozilla/5.0 (Macintosh; Intel Mac...."}})

```

Remus overrides **just one option** which is `:as`. No matter what you put into

it, the value becomes `:stream`. We need a streamed HTTP response because ROME

relies on an input stream.

### Errors and exceptions

It's up to you how to deal with non-200 HTTP responses. Even if you pass

`{:throw-exceptions false}`, the feed only be parsed when the status code is

200.

```clojure

(let [result (parse-url "http://example.com/non-existing-url"

                               {:throw-exceptions false})

             {:keys [response feed]} result]

         (when-not feed

           (process-non-200 response)))

```

Or just skip the `:throw-exceptions` flag and wrap everything into the standard

`try/catch` form:

```clojure

(try

  (parse-url "http://non-existing-url")

  (catch clojure.lang.ExceptionInfo e

    (let [response (ex-data e)

          {:keys [status headers]} response]

      (println status headers)

      ;; do anything you want

      )))

```

[clj-http-ex]:https://github.com/dakrone/clj-http#exceptions

[slingshot]: https://github.com/scgilardi/slingshot

Alternately, you may use the [Slingshot][slingshot] approach to catch

HTTP-thrown exceptions as the [official manual][clj-http-ex] describes.

### Saving headers

[cond-get]: https://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers

When parsing a URL, a good option would be to pass the `If-None-Match` and

`If-Modified-Since` headers with the values from the `Etag` and `Last-Modified`

ones from the previous response. This trick is know as [conditional

GET][cond-get]. It might prevent server from sending the data you've already

received before:

```clojure

;; returns the whole feed

(def result (parse-url "http://planet.lisp.org/rss20.xml"))

;; split the result

(def feed (:feed result))

(def response (:response result))

;; ensure we got the data

(:length response)

48082

;; save the headers

(def etag (-> response :headers :etag))

;; "5b71766f-2f597"

(def last-modified (-> response :headers :last-modified))

;; Mon, 19 Oct 2020 12:15:27 GMT

;;;;

;; Now, try to fetch data passing conditionals headers:

;;;;

(def result-new

  (parse-url "http://planet.lisp.org/rss20.xml"

             {:headers {"If-None-Match" etag

                        "If-Modified-Since" last-modified}}))

(-> result-new :response :status)

304

(-> result-new :response :length)

0

(-> result-new :feed)

nil

```

Since the server returned non-200 but positive status code (304 in our case), we

don't parse the response at all. So the `:feed` field in the `result-new`

variable will be `nil`.

## Non-standard tags

[youtube-rss]: https://www.youtube.com/feeds/videos.xml?channel_id=UCaLlzGqiPE2QRj6sSOawJRg

Sometimes, a feed ships additional data with non-standard tags. A good example

might be a typical [YouTube feed][youtube-rss]. Let's examine one of its

entries:

```xml

  yt:video:TbthtdBw93w

  TbthtdBw93w

  UCaLlzGqiPE2QRj6sSOawJRg

  Datomic Ions in Seven Minutes

  

  

    ClojureTV

    

      https://www.youtube.com/channel/UCaLlzGqiPE2QRj6sSOawJRg

    

  

  2018-07-03T21:16:16+00:00

  2018-08-09T16:29:51+00:00

  

    Datomic Ions in Seven Minutes

    

    

    

      Stuart Halloway introduces Ions for Datomic Cloud on AWS.

    

    

      

      

    

  

```

In addition to the standard fields, the feed carries information about the video

ID, channel ID and statistics: views count, the number of times the video was

starred and its average rating. You would probably want to use that data.

Alternately, if you parse a geo-related feed, you'll get lat/lot coordinates,

location names, tracks, etc.

Other RSS parsers either drop this data or require you to write a custom

extension. `Remus` provides all the non-standard tags as a parsed XML

structure. It puts that data into an `:extra` field for each entry and on the

top level of a feed. This is how you can reach it:

```clojure

(def result (parse-url "https://www.youtube.com/feeds/videos.xml?channel_id=UCaLlzGqiPE2QRj6sSOawJRg"))

(def feed (:feed result))

;;;;

;; Get entry-specific custom data

;;;;

;; Extra data from the first entry:

(-> feed :entries first :extra)

{:tag :rome/extra

 :attrs nil

 :content

 ({:tag :yt/videoId :attrs nil :content ["faoXSarGgEI"]}

  {:tag :yt/channelId :attrs nil :content ["UCaLlzGqiPE2QRj6sSOawJRg"]}

  {:tag :media/group

   :attrs nil

   :content

   ({:tag :media/title :attrs nil :content ["Datomic Cloud - Datoms"]}

    {:tag :media/content

     :attrs

     {:url "https://www.youtube.com/v/faoXSarGgEI?version=3"

      :type "application/x-shockwave-flash"

      :width "640"

      :height "390"}

     :content nil}

    {:tag :media/thumbnail

     :attrs

     {:url "https://i3.ytimg.com/vi/faoXSarGgEI/hqdefault.jpg"

      :width "480"

      :height "360"}

     :content nil}

    {:tag :media/description

     :attrs nil

     :content

     ["Check out the live animated tutorial: https://docs.datomic.com/cloud/livetutorial/datoms.html\n\nYour Datomic database consists of datoms. What are Datoms?"]}

    {:tag :media/community

     :attrs nil

     :content

     ({:tag :media/starRating

       :attrs {:count "72" :average "5.00" :min "1" :max "5"}

       :content nil}

      {:tag :media/statistics :attrs {:views "2014"} :content nil})})})}

;;;;

;; Get feed-specific extra:

;;;;

(-> feed :extra)

{:tag :rome/extra

 :attrs nil

 :content

 ({:tag :yt/channelId :attrs nil :content ["UCaLlzGqiPE2QRj6sSOawJRg"]})}

```

The `:extra` fields follow the standard XML-friendly structure so they can be

processed with any XML-related technics like walking, zippers, etc.

## Encoding issues

All the `parse-` functions mentioned above take additional

ROME-related options. Use them to solve XML-decoding issues when dealing with

weird or non-set HTTP headers. ROME's got a solid algorithm to guess encoding,

but sometimes it might need your help.

At the moment, Remus supports `:lenient`, `:encoding` and `content-type` options

with has the following meaning:

- `lenient`: a boolean flag which makes Rome to be more loyal to some mistakes

  in XML markup;

- `encoding`: a string which represents the encoding of the feed.  When parsing

  a URL, it comes from the `Content-Encoding` HTTP header.  Possible values are

  listed here: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

- `content-type`: a string meaning the MIME type of the feed,

  e.g. `application/rss` or something. When parsing a URL, it comes from the

  `Content-Type` header.

Dealing with Windows encoding and unset `Content-type` or `Content-Encoding`

headers:

```clojure

(parse-url "https://some/rss.xml" nil {:lenient true :encoding "cp1251"})

```

The same options work for parsing a file or a stream:

```clojure

(parse-file "https://another/atom.xml" {:lenient true :encoding "cp1251"})

(parse-stream in-source {:lenient true :encoding "cp1251"})

```

## License

Copyright © 2020 Ivan Grishaev

Distributed under the Eclipse Public License either version 1.0 or (at your

option) any later version.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/igrishaev/remus

Awesome Lists containing this project

README