An open API service indexing awesome lists of open source software.

https://github.com/juxt/reap

A Clojure library for decoding and encoding strings used by web protocols.
https://github.com/juxt/reap

Last synced: 5 months ago
JSON representation

A Clojure library for decoding and encoding strings used by web protocols.

Awesome Lists containing this project

README

          

= Reap

Regular Expressions for Accurate Parsing

A Clojure library for decoding and encoding strings used by web protocols.

[WARNING]
--
STATUS: Ready to use, the API is stable.
--

== Quick Start

Suppose you want to decode an Accept header from an HTTP request. For example, Firefox sends one like this:

`Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8`

whereas a Chrome browser on Windows 7 might send:

`Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9`

You can use *reap* to parse that header's value into data you can more easily work with.

Here's how:

[source,clojure]
----
(require
'[juxt.reap.decoders.rfc7231 :refer [accept]]
'[juxt.reap.regex :as re])

(let [decoder (accept {})]
(decoder
(re/input
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9")))
----

This will return the following sequence of items:

[source,clojure]
----
#:juxt.reap.rfc7231
{:media-range "text/html",
:type "text",
:subtype "html",
:parameters {}}

#:juxt.reap.rfc7231
{:media-range "application/xhtml+xml",
:type "application",
:subtype "xhtml+xml",
:parameters {}}

#:juxt.reap.rfc7231
{:media-range "application/xml",
:type "application",
:subtype "xml",
:parameters {},
:qvalue 0.9}

#:juxt.reap.rfc7231
{:media-range "image/webp",
:type "image",
:subtype "webp",
:parameters {}}

#:juxt.reap.rfc7231
{:media-range "image/apng",
:type "image",
:subtype "apng",
:parameters {}}

#:juxt.reap.rfc7231
{:media-range "*/*",
:type "*",
:subtype "*",
:parameters {},
:qvalue 0.8}

#:juxt.reap.rfc7231
{:media-range "application/signed-exchange",
:type "application",
:subtype "signed-exchange",
:parameters {"v" "b3"},
:qvalue 0.9}
----

*reap* contains parsers for most things you'd want to parse when writing web
applications, so you can focus on writing your app without worrying about
writing parsers. It's fast too, so you don't have to worry about a performance
impact.

== Introduction

The Internet is a system of interoperable computer software written to
a set of exacting specifications
(https://tools.ietf.org/rfc/index[RFCs]) published by the
https://www.ietf.org/[Internet Engineering Task Force].

Many Internet protocols, notably HTTP, are textual in nature.

Software components of the Internet must be able to efficiently encode and
decode strings of text accurately in order to process correctly.

=== Problem Statement

There are not many tools of sufficient quality which can help with the decoding
and encoding of text strings, especially those defined in RFCs.

Therefore, programmers are often left to write their own 'quick and dirty'
code. This leads to software that does not properly implement (and is not fully
conformant with) the rules defined in the RFCs.

Programmers often have to strike a balance between conforming to the
rules layed down by the RFCs and competing priorities such as meeting
performance requirements and project deadlines.

Unfortunately, code that violates any aspect of a specification can
lead to an unhealthy Internet. Time is wasted debugging
interoperability problems, buggy implementations cause problems for
users and lead to, in some cases, security vulnerabilities.

=== Example: the HTTP Accept header

In RFC 7231 (which defines part of HTTP), the `Accept`
header is specified by the following rule:

[source]
----
Accept = [ ( "," / ( media-range [ accept-params ] ) ) *( OWS "," [
OWS ( media-range [ accept-params ] ) ] ) ]
----

As well as indicating the ways that various punctuation and other characters can
be combined, the rule makes reference to other rules, such as `media-range`:

[source]
----
media-range = ( "*/*" / ( type "/*" ) / ( type "/" subtype ) ) *( OWS
";" OWS parameter )
----

A `type` here is a `token`, defined in another RFC (RFC 7230), which
states a `token` is a sequence of at least one `tchar`:

[source]
----
token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
"^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
----

Let's leave aside DIGIT and ALPHA and return to the `parameter` rule,
which itself is non-trivial:

[source]
----
parameter = token "=" ( token / quoted-string )
----

The rule tells us that values can be tokens, but can _alternatively_
be separated by quotation marks:

[source]
----
quoted-string = DQUOTE *( qdtext / quoted-pair ) DQUOTE
----

What is contained within these quotation marks is subject to further
exacting rules about which characters and character ranges are valid
and how characters can be escaped by using ``quoted-pair``s:

[source]
----
qdtext = HTAB / SP / "!" / %x23-5B ; '#'-'['
/ %x5D-7E ; ']'-'~'
/ obs-text
obs-text = %x80-FF
quoted-pair = "\" ( HTAB / SP / VCHAR / obs-text )
----

A `media-range`, itself containing parameters (where values are required) can be
optionally followed by a special parameter indicating the term's `weight`,
optionally followed by further parameters (where values are optional), called
accept extensions.

These are the rules for just one HTTP request header, and it's by far
from the most complex!

So it's no surprise that programmers who resort to writing custom
parsing code might skip a few details.

=== Ingredients

*reap* is built from some old ideas.

==== Lisp (1958)

Clojure is used as the implementation language to facilitate faster research and prototyping.
If this project proves useful/stable it might be a good idea to port to Java and provide a Clojure wrapper.

==== Regular Expressions (1950s)

*reap* uses https://en.wikipedia.org/wiki/Regular_expressions[regular expressions] to parse terminals.

==== Allen's Interval Algebra (1983)

https://en.wikipedia.org/wiki/Allen's_interval_algebra[Allen's interval algebra] allows character intervals to be manipulated and combined, to form optimal ranges which optimise the performance of the regular expression.

==== Parser Combinators (1989)

https://en.wikipedia.org/wiki/Parser_combinator[Parser combinators] are used to combine parsers built from regular expressions.

See

==== Parsing Expression Grammars (2004)

*reap* uses a technique known as https://en.wikipedia.org/wiki/Parsing_expression_grammar[Parsing expression grammar (PEG)].
There are other dedicated PEG parsing libraries, including https://github.com/ericnormand/squarepeg[squarepeg] and https://github.com/aroemers/crustimoney[crustimoney].
Reap is focussed on the practical problem of parsing real-world strings found in web protocols, rather than providing a general PEG parsing library.
We don't currently support packrat caching (memoization), although that may be added in the future.

The alternative to PEG is a Context Free Grammar. There are a number of excellent tools for generating CFG parsers, from venerable ones such as flex/bison to more modern ones including https://www.antlr.org/[Antlr].

In the Clojure eco-system, we have https://github.com/aphyr/clj-antlr[clj-antlr] and https://github.com/Engelberg/instaparse[Instaparse].

Note, however, this useful comparison between PEG and CFG parsers.

[quote, Bryan Ford, https://bford.info/pub/lang/peg.pdf]
____
Chomsky’s generative system of grammars, from which the ubiqui-
tous context-free grammars (CFGs) and regular expressions (REs)
arise, was originally designed as a formal tool for modelling and
analyzing natural (human) languages. Due to their elegance and
expressive power, computer scientists adopted generative grammars
for describing machine-oriented languages as well. The ability of
a CFG to express ambiguous syntax is an important and powerful
tool for natural languages. Unfortunately, this power gets in the
way when we use CFGs for machine-oriented languages that are
intended to be precise and unambiguous. Ambiguity in CFGs is
difficult to avoid even when we want to, and it makes general CFG
parsing an inherently super-linear-time problem.
____

== User Guide

Functions marked with the metadata tag `:juxt.reap/codec` take an 'options' argument and return a map of entries.

`:juxt.reap/decode`:: A single-arity parser function, taking a
`java.util.regex.Matcher` as the only argument and returning a Clojure map or
sequence.

`:juxt.reap/encode`:: A single-arity function, taking a Clojure map or sequence
and returning a string.

=== Options

The 'options' argument is a map containing the following optional entries:

`:juxt.reap/decode-preserve-case`:: Set to true to prevent the parser from transforming tokens that are treated as case-insensitive to lower-case. This lossy transformation simplifies case-insensitive comparisons. Defaults to nil (false).

`:juxt.reap/encode-case-transform`:: Set to `:lower` to transform generated tokens to lower-case, where applicable (where the token is semantically case-insensitive). Set to `:canonical` to transform tokens and header values to their canonical case. Defaults to nil.

== References

https://tools.ietf.org/html/rfc7230[Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing]

https://tools.ietf.org/html/rfc7231[Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content]

https://tools.ietf.org/html/rfc7232[Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests]

https://tools.ietf.org/html/rfc7233[Hypertext Transfer Protocol (HTTP/1.1): Range Requests]

https://tools.ietf.org/html/rfc7234[Hypertext Transfer Protocol (HTTP/1.1): Caching]

https://tools.ietf.org/html/rfc7235[Hypertext Transfer Protocol (HTTP/1.1): Authentication]

https://github.com/Engelberg/instaparse[Instaparse]

https://github.com/Engelberg/instaparse/blob/master/docs/ABNF.md[Instaparse: ABNF Input Format]

https://cse.unl.edu/~choueiry/Documents/Allen-CACM1983.pdf[Maintaining Knowledge about Temporal Intervals, James F. Allen]

https://bford.info/pub/lang/peg.pdf[Parsing Expression Grammars: A Recognition-Based Syntactic Foundation]

https://github.com/ericnormand/squarepeg[squarepeg]: Eric Normand's PEG parsing library (Clojure)

https://github.com/aroemers/crustimoney[crustimoney]: Arnout Roemers' Clojure library for "PEG parsing, supporting various grammars, packrat caching and cuts."

https://www.infoq.com/presentations/Parser-Combinators/[Parser Combinators: How to Parse (nearly) Anything]: Nate Young's Strange Look talk.

== License

The MIT License (MIT)

Copyright © 2020-2024 JUXT LTD.

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.