An open API service indexing awesome lists of open source software.

https://github.com/serras/t-regex

Matchers and grammars using tree regular expressions
https://github.com/serras/t-regex

Last synced: about 1 year ago
JSON representation

Matchers and grammars using tree regular expressions

Awesome Lists containing this project

README

          

`t-regex`: matchers and grammars using tree regular expressions
===============================================================

`t-regex` defines a series of combinators to express tree regular
expressions over any Haskell data type. In addition, with the use
of some combinators (and a bit of Template Haskell), it defines
nice syntax for using this tree regular expressions for matching
and computing attributes over a term.

## Defining your data type

In order to use `t-regex`, you need to define you data type in a
way amenable to inspection. In particular, it means that instead
of a closed data type, you need to define a *pattern functor* in
which recursion is described using the final argument, and which
should instantiate the `Generic1` type class (this can be done
automatically if you are using GHC).

For example, the following block of code defines a `Tree'` data
type in the required fashion:
```haskell
{-# LANGUAGE DeriveGeneric #-}
import GHC.Generics

data Tree' f = Leaf'
| Branch' { elt :: Int, left :: f, right :: f }
deriving (Generic1, Show)
```
Notice the `f` argument in the place where `Tree'` would usually be
found. In addition, we have declared the constructors using `'` at
the end, but we will get rid of them soon.

Now, if you want to create terms, you need to *close* the type, which
essentially makes the `f` argument refer back to `Tree'`. You do so
by using `Fix`:
```haskell
type Tree = Fix Tree'
```
However, this induces the need to add explicit `Fix` constructors at
each level. To alleviate this problem, let's define *pattern synonyms*,
available from GHC 7.8 on:
```haskell
pattern Leaf = Fix Leaf'
pattern Branch n l r = Fix (Branch' n l r)
```
From an outsider point of view, now your data type is a normal one,
with `Leaf` and `Branch` as its constructors:
```haskell
aTree = Branch 2 (Branch 2 Leaf Leaf) Leaf
```

## Tree regular expressions

Tree regular expressions are parametrized by a pattern functor: in this
way they are flexible enough to be used with different data types,
while keeping our loved Haskell type safety.

The available combinators to build regular expressions follow the syntax
of [Tree Automata Techniques and Applications](http://tata.gforge.inria.fr/),
Chapter 2.

#### Emptiness

The expressions `empty_` and `none` do not match any value. They can be
used to signal an error branch on an expressions.

#### Whole language

You can match any possible term using `any_`. It is commonly use in
combination with `capture` to extract information from a term.

#### Choice

A regular expression of the form `r1 <||> r2` tries to match `r1`, and if
it not possible, it tries to do so with `r2`. Note than when capturing,
the first regular expression is given priority.

#### Injection

Of course, at some point you would like to check whether some term has
a specific shape. In our case, this means that it has been constructed in
some specific way. In order to do so, you need to *inject* the
constructor as a tree regular expression. When doing so, you can use
the same syntax as usual, but where at the recursive positions you write
regular expressions again.

Let's make it clearer with an example. In our initial definition we had
a constructor `Branch'` with type:
```haskell
Branch' :: Int -> f -> f -> Tree' f
```
Once you inject it using `inj`, the resulting constructor becomes:
```haskell
inj . Branch' :: Int -> Regex' c f -> Regex c' f -> Tree' (Regex' c f)
```
Notice that fields with no `f` do not change their type. Now, here is how
you would represent a tree whose top node is a 2:
```haskell
topTwo = inj (Branch' 2 any_ any_)
```

In some cases, you don't want to check the value of a certain position
which is not recursive. In that case, you cannot use `any_`, since we
are not talking about the type being built upon. For that case, you
may use the special value `__`.

For example, here is how you would represent the shape of a tree which
has at least one branch point:
```haskell
someBranching = inj (Branch' __ any_ any_)
```

#### Iteration and concatenation

Iteration in tree regular expressions is not as easy as in word languages.
The reason is that iteration may occur several times, and in different
positions. For that reason, we need to introduce the notion of *hole*: a
hole is a placeholder where iteration takes place.

In `t-regex` hole names are represented as lambda-bound variables. Then,
you can use any of the functions `square`, `var` or `#` to indicate a
position where the placeholder should be put. Iteration is then indicated
by a call to `iter` or its post-fix version `^*`.

The following two are ways to indicate a `Tree` where all internal nodes
include the number `2`:
```haskell
{-# LANGUAGE PostfixOperators #-}

regex1 = Regex $
iter $ \k ->
inj (Branch' 2 (square k) (square k))
<||> inj Leaf'

regex2 = Regex $ ( (\k -> inj (Branch' 2 (k#) (k#)) <||> inj Leaf')^* )
```
Notice that the use of `PostfixOperators` enables a much terse language.

Iteration is an instance of a more general construction called *concatenation*,
where a hole in an expression is filled by another given expression. The
general shape of those are:
```haskell
(\k -> ... (k#) ...) <.> r -- k is replaced by r in the first expression
```

## Matching and capturing

You can check whether a term `t` is matched by a regular expression `r`
by simply using:
```haskell
r `matches` t
```
However, the full power of tree regular expression come at the moment you
start *capturing* subterms of your tree. A capture group is introduced
in a expression by either `capture` or `<<-`, and all of them appearing
in a expression should be from the same type. For example, we can refine
the previous `regex2` to capture leaves:
```haskell
regex2 = Regex $ ( (\k -> inj (Branch' 2 (k#) (k#)) <||> "leaves" <<- inj Leaf')^* )
```

To check whether a term matches a expression and capture the subterms,
you need to call `match`. The result is of type `Maybe (Map c (m (Fix f)))`.
Let's see what it means:

* The outermost `Maybe` indicates whether the match is successful
or not. A value of `Nothing` indicates that the given term does
not match the regular expression,
* Capture groups are returned as a `Map`, whose keys are those
given at `capture` or `<<-`. For that reason, you need capture
identifiers to be instances of `Ord`,
* Finally, you have a choice about how to save the found subterms,
given by the `Alternative m`. Most of the time, you will make
`m = []`, which means that all matches of the group will be
returned as a list. Other option is using `m = Maybe`, where
only the first match is returned.

#### Tree regular expression patterns

Capturing is quite simple, but comes with a problem: it becomes easy to
mistake the name of a capture group, so your code becomes quite brittle.
For that reason, `t-regex` includes matchers which you can use at the
same positions where pattern matching is usually done, and which take
care of linking capture groups to variable names, making it impossible
to mistake them.

To use them, you need to import `Data.Regex.TH`. Then, a quasi-quoter
named `rx` is available to you. Here is an example:
```haskell
{-# LANGUAGE QuasiQuotes #-}

example :: Tree -> [Tree]
example [rx| (\k -> inj (Branch' 2 (k#) (k#)) <||> leaves <<- inj Leaf')^* |]
= leaves
example _ = []
```
The name of the capture group, `leaves`, is now available in the body
of the `example` function. There is no need to look inside maps, this
is all taken care for you.

Note that when using the `rx` quasi-quoter, you have no choice about
the `Alternative m` to use when matching. Instead, you always get as
value of each capture group a list of terms.

For those who don't like using quasi-quotation, `t-regex` provides a
less magical version called `with`. In this case, you need to introduce
the variables in a explicit manner, and then pattern match on a tuple
wrapped inside a `Maybe`. The previous example would be written as:
```haskell
{-# LANGUAGE ViewPatterns #-}

example :: Tree -> [Tree]
example with (\leaves -> Regex $ iter $ \k -> inj (Branch' 2 (k#) (k#))
<||> leaves <<- inj Leaf' )
-> Just leaves
= leaves
example _ = []
```
Notice that the pattern is always similar `with (\v1 v2 ... ->
regular expression) -> Just (v1,v2,...)`.

## Random generation

You can use `t-regex` to generate random values of a type which satisfy a
certain tree regular expression. Of course, you might always generate
random values and then check that they match the given expression, but
this is usually very costly and maybe even statistically impossible.
Instead, you should use `arbitraryFromRegex`.

```haskell
instance Arbitrary Tree where
arbitrary = frequency
[ (1, return Leaf)
, (5, Branch <$> arbitrary <*> arbitrary <*> arbitrary) ]

> arbitraryFromRegex regex2
```

Note that in the previous example we also gave an instance declaration
for `Arbitrary`. This class comes from the `QuickCheck` package, and
is needed to generate unconstrained random values for the case in
which `any_` is found.

Sometimes you may not be able or want to write such an instance. In that
case, you can use `arbitraryFromRegexAndGen`, which takes an additional
argument from which `any_` values are generated.

## Attribute grammars

[Attribute grammars](https://www.haskell.org/haskellwiki/The_Monad.Reader/Issue4/Why_Attribute_Grammars_Matter)
are a powerful way to perform computations over a term. The main
idea is that each node in your term (when seen as a tree) is
traversed, and two sets of information are recorded at each point:

* *Inherited attributes* go from parent to children. When
describing a grammar, each node needs to specify the value
of inherited attributes of all their children,
* *Synthesized attributes* flow in the other direction.
At each node, you need to described how to get the value
of each synthesized attribute based on your inherited
attributes and the synthesized attributes of children.

The most performant attribute grammar compilers, such as
[UUAGC](http://foswiki.cs.uu.nl/foswiki/HUT/AttributeGrammarSystem)
only allow deciding which rule to apply on a node depending on
their topmost constructor. With `t-regex` you can look as
deep as you want to take this decision (but of course, performance
will suffer if you do this very often).

Here is an example of a grammar which computes a graphical
representation of a `Tree` plus its number of leaves:
```haskell
grammar = [
rule $ \l r ->
inj (Branch' 2 (l <<- any_) (r <<- any_)) ->> do
(lText,lN) <- use (at l . syn)
(rText,rN) <- use (at r . syn)
this.syn._1 .= "(" ++ lText ++ ")-SPECIAL-(" ++ rText ++ ")"
this.syn._2 .= lN + rN
, rule $ \l r ->
inj (Branch' __ (l <<- any_) (r <<- any_)) ->>> \(Branch e _ _) -> do
check $ e >= 0
(lText,lN) <- use (at l . syn)
(rText,rN) <- use (at r . syn)
this.syn._1 .= "(" ++ lText ++ ")-" ++ show e ++ "-(" ++ rText ++ ")"
this.syn._2 .= lN + rN
, rule $ inj Leaf' ->> do
this.syn._1 .= "leaf"
this.syn._2 .= Sum 1
]
```
Let's dissect it part by part.

First of all, a grammar is made of a series of *rules*. Each rule
follows the same schema:
```haskell
rule $ \v1 ... vn ->
regular expression ->> do
actions
```
`rule` is the constant part which prefixes every rule. Then, you
have the set of variables which will be used to capture information
from the term, in a similar way to previous section. After that you
have the tree regular expression the term needs to match. Finally,
and separated by `->>`, you find the actions to perform when this
rule is selected.

Two small extensions are shown in the second rule. By default,
`->>` does not give you access to the matched term. However, if you
need to access some of its information (for example, because you
used `__`, as in this case), you can use the alternative version
`->>>` which gives this as an argument. The second extension is the
use of `check` to pinpoint a logical condition which is not captured
by the regular expression itself.

The syntax for the actions relies heavily in lenses and operators
from the `lens` package. In particular, you have four lenses:

* `this.inh` gives access to the inherited attributes of the node,
* `at n . syn` gives access to the synthesized attributes of children,
* `this.syn` is where you set the synthesized attributes of your node,
* `at n . inh` is where you set the inherited attributes of children,

Furthermore, you can combine those with lenses over your inherited
and synthesized attribute data type to have more lightweight syntax.
In our case, the synthesized attributes are a tuple `(String, Sum Int)`,
so we access them with `_1` and `_2`, as shown in the example.

The general rule is that you read values using `use`, and set values
via `.=`. If some value is not set, it defaults to the empty element
of your synthesized type, or to the value of parent node in the case
of inherited attributes.