https://github.com/serras/t-regex

Matchers and grammars using tree regular expressions
https://github.com/serras/t-regex
Last synced: about 1 year ago
JSON representation
Matchers and grammars using tree regular expressions
Host: GitHub
URL: https://github.com/serras/t-regex
Owner: serras
Created: 2014-10-22T10:12:29.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2015-10-23T11:39:33.000Z (over 10 years ago)
Last Synced: 2025-03-27T18:52:24.577Z (about 1 year ago)
Language: Haskell
Homepage:
Size: 457 KB
Stars: 6
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          `t-regex`: matchers and grammars using tree regular expressions

===============================================================

`t-regex` defines a series of combinators to express tree regular

expressions over any Haskell data type. In addition, with the use

of some combinators (and a bit of Template Haskell), it defines

nice syntax for using this tree regular expressions for matching

and computing attributes over a term.

## Defining your data type

In order to use `t-regex`, you need to define you data type in a

way amenable to inspection. In particular, it means that instead

of a closed data type, you need to define a *pattern functor* in

which recursion is described using the final argument, and which

should instantiate the `Generic1` type class (this can be done

automatically if you are using GHC).

For example, the following block of code defines a `Tree'` data

type in the required fashion:

```haskell

{-# LANGUAGE DeriveGeneric #-}

import GHC.Generics

data Tree' f = Leaf'

             | Branch' { elt :: Int, left :: f, right :: f }

             deriving (Generic1, Show)

```

Notice the `f` argument in the place where `Tree'` would usually be

found. In addition, we have declared the constructors using `'` at

the end, but we will get rid of them soon.

Now, if you want to create terms, you need to *close* the type, which

essentially makes the `f` argument refer back to `Tree'`. You do so

by using `Fix`:

```haskell

type Tree = Fix Tree'

```

However, this induces the need to add explicit `Fix` constructors at

each level. To alleviate this problem, let's define *pattern synonyms*,

available from GHC 7.8 on:

```haskell

pattern Leaf = Fix Leaf'

pattern Branch n l r = Fix (Branch' n l r)

```

From an outsider point of view, now your data type is a normal one,

with `Leaf` and `Branch` as its constructors:

```haskell

aTree = Branch 2 (Branch 2 Leaf Leaf) Leaf

```

## Tree regular expressions

Tree regular expressions are parametrized by a pattern functor: in this

way they are flexible enough to be used with different data types,

while keeping our loved Haskell type safety.

The available combinators to build regular expressions follow the syntax

of [Tree Automata Techniques and Applications](http://tata.gforge.inria.fr/),

Chapter 2.

#### Emptiness

The expressions `empty_` and `none` do not match any value. They can be

used to signal an error branch on an expressions.

#### Whole language

You can match any possible term using `any_`. It is commonly use in

combination with `capture` to extract information from a term.

#### Choice

A regular expression of the form `r1 <||> r2` tries to match `r1`, and if

it not possible, it tries to do so with `r2`. Note than when capturing,

the first regular expression is given priority.

#### Injection

Of course, at some point you would like to check whether some term has

a specific shape. In our case, this means that it has been constructed in

some specific way. In order to do so, you need to *inject* the

constructor as a tree regular expression. When doing so, you can use

the same syntax as usual, but where at the recursive positions you write

regular expressions again.

Let's make it clearer with an example. In our initial definition we had

a constructor `Branch'` with type:

```haskell

Branch' :: Int -> f -> f -> Tree' f

```

Once you inject it using `inj`, the resulting constructor becomes:

```haskell

inj . Branch' :: Int -> Regex' c f -> Regex c' f -> Tree' (Regex' c f)

```

Notice that fields with no `f` do not change their type. Now, here is how

you would represent a tree whose top node is a 2:

```haskell

topTwo = inj (Branch' 2 any_ any_)

```

In some cases, you don't want to check the value of a certain position

which is not recursive. In that case, you cannot use `any_`, since we

are not talking about the type being built upon. For that case, you

may use the special value `__`.

For example, here is how you would represent the shape of a tree which

has at least one branch point:

```haskell

someBranching = inj (Branch' __ any_ any_)

```

#### Iteration and concatenation

Iteration in tree regular expressions is not as easy as in word languages.

The reason is that iteration may occur several times, and in different

positions. For that reason, we need to introduce the notion of *hole*: a

hole is a placeholder where iteration takes place.

In `t-regex` hole names are represented as lambda-bound variables. Then,

you can use any of the functions `square`, `var` or `#` to indicate a

position where the placeholder should be put. Iteration is then indicated

by a call to `iter` or its post-fix version `^*`.

The following two are ways to indicate a `Tree` where all internal nodes

include the number `2`:

```haskell

{-# LANGUAGE PostfixOperators #-}

regex1 = Regex $

           iter $ \k ->

                  inj (Branch' 2 (square k) (square k))

             <||> inj Leaf'

regex2 = Regex $ ( (\k -> inj (Branch' 2 (k#) (k#)) <||> inj Leaf')^* )

```

Notice that the use of `PostfixOperators` enables a much terse language.

Iteration is an instance of a more general construction called *concatenation*,

where a hole in an expression is filled by another given expression. The

general shape of those are:

```haskell

(\k -> ... (k#) ...) <.> r  -- k is replaced by r in the first expression

```

## Matching and capturing

You can check whether a term `t` is matched by a regular expression `r`

by simply using:

```haskell

r `matches` t

```

However, the full power of tree regular expression come at the moment you

start *capturing* subterms of your tree. A capture group is introduced

in a expression by either `capture` or `<<-`, and all of them appearing

in a expression should be from the same type. For example, we can refine

the previous `regex2` to capture leaves:

```haskell

regex2 = Regex $ ( (\k -> inj (Branch' 2 (k#) (k#)) <||> "leaves" <<- inj Leaf')^* )

```

To check whether a term matches a expression and capture the subterms,

you need to call `match`. The result is of type `Maybe (Map c (m (Fix f)))`.

Let's see what it means:

  * The outermost `Maybe` indicates whether the match is successful

    or not. A value of `Nothing` indicates that the given term does

	not match the regular expression,

  * Capture groups are returned as a `Map`, whose keys are those

    given at `capture` or `<<-`. For that reason, you need capture

	identifiers to be instances of `Ord`,

  * Finally, you have a choice about how to save the found subterms,

    given by the `Alternative m`. Most of the time, you will make

	`m = []`, which means that all matches of the group will be

	returned as a list. Other option is using `m = Maybe`, where

	only the first match is returned.

#### Tree regular expression patterns

Capturing is quite simple, but comes with a problem: it becomes easy to

mistake the name of a capture group, so your code becomes quite brittle.

For that reason, `t-regex` includes matchers which you can use at the

same positions where pattern matching is usually done, and which take

care of linking capture groups to variable names, making it impossible

to mistake them.

To use them, you need to import `Data.Regex.TH`. Then, a quasi-quoter

named `rx` is available to you. Here is an example:

```haskell

{-# LANGUAGE QuasiQuotes #-}

example :: Tree -> [Tree]

example [rx| (\k -> inj (Branch' 2 (k#) (k#)) <||> leaves <<- inj Leaf')^* |]

           = leaves

example _  = []

```

The name of the capture group, `leaves`, is now available in the body

of the `example` function. There is no need to look inside maps, this

is all taken care for you.

Note that when using the `rx` quasi-quoter, you have no choice about

the `Alternative m` to use when matching. Instead, you always get as

value of each capture group a list of terms.

For those who don't like using quasi-quotation, `t-regex` provides a

less magical version called `with`. In this case, you need to introduce

the variables in a explicit manner, and then pattern match on a tuple

wrapped inside a `Maybe`. The previous example would be written as:

```haskell

{-# LANGUAGE ViewPatterns #-}

example :: Tree -> [Tree]

example with (\leaves -> Regex $ iter $ \k -> inj (Branch' 2 (k#) (k#))

                                              <||> leaves <<- inj Leaf' )

             -> Just leaves

           = leaves

example _  = []

```

Notice that the pattern is always similar `with (\v1 v2 ... ->

regular expression) -> Just (v1,v2,...)`.

## Random generation

You can use `t-regex` to generate random values of a type which satisfy a

certain tree regular expression. Of course, you might always generate

random values and then check that they match the given expression, but

this is usually very costly and maybe even statistically impossible.

Instead, you should use `arbitraryFromRegex`.

```haskell

instance Arbitrary Tree where

  arbitrary = frequency

    [ (1, return Leaf)

    , (5, Branch <$> arbitrary <*> arbitrary <*> arbitrary) ]

> arbitraryFromRegex regex2

```

Note that in the previous example we also gave an instance declaration

for `Arbitrary`. This class comes from the `QuickCheck` package, and

is needed to generate unconstrained random values for the case in

which `any_` is found.

Sometimes you may not be able or want to write such an instance. In that

case, you can use `arbitraryFromRegexAndGen`, which takes an additional

argument from which `any_` values are generated.

## Attribute grammars

[Attribute grammars](https://www.haskell.org/haskellwiki/The_Monad.Reader/Issue4/Why_Attribute_Grammars_Matter)

are a powerful way to perform computations over a term. The main

idea is that each node in your term (when seen as a tree) is

traversed, and two sets of information are recorded at each point:

  * *Inherited attributes* go from parent to children. When

    describing a grammar, each node needs to specify the value

	of inherited attributes of all their children,

  * *Synthesized attributes* flow in the other direction.

    At each node, you need to described how to get the value

	of each synthesized attribute based on your inherited

	attributes and the synthesized attributes of children.

The most performant attribute grammar compilers, such as

[UUAGC](http://foswiki.cs.uu.nl/foswiki/HUT/AttributeGrammarSystem)

only allow deciding which rule to apply on a node depending on

their topmost constructor. With `t-regex` you can look as

deep as you want to take this decision (but of course, performance

will suffer if you do this very often).

Here is an example of a grammar which computes a graphical

representation of a `Tree` plus its number of leaves:

```haskell

grammar = [

    rule $ \l r ->

     inj (Branch' 2 (l <<- any_) (r <<- any_)) ->> do

       (lText,lN) <- use (at l . syn)

       (rText,rN) <- use (at r . syn)

       this.syn._1 .= "(" ++ lText ++ ")-SPECIAL-(" ++ rText ++ ")"

       this.syn._2 .= lN + rN

  , rule $ \l r ->

     inj (Branch' __ (l <<- any_) (r <<- any_)) ->>> \(Branch e _ _) -> do

	   check $ e >= 0

       (lText,lN) <- use (at l . syn)

       (rText,rN) <- use (at r . syn)

       this.syn._1 .= "(" ++ lText ++ ")-" ++ show e ++ "-(" ++ rText ++ ")"

       this.syn._2 .= lN + rN

  , rule $ inj Leaf' ->> do

       this.syn._1 .= "leaf"

       this.syn._2 .= Sum 1

  ]

```

Let's dissect it part by part.

First of all, a grammar is made of a series of *rules*. Each rule

follows the same schema:

```haskell

rule $ \v1 ... vn ->

  regular expression ->> do

    actions

```

`rule` is the constant part which prefixes every rule. Then, you

have the set of variables which will be used to capture information

from the term, in a similar way to previous section. After that you

have the tree regular expression the term needs to match. Finally,

and separated by `->>`, you find the actions to perform when this

rule is selected.

Two small extensions are shown in the second rule. By default,

`->>` does not give you access to the matched term. However, if you

need to access some of its information (for example, because you

used `__`, as in this case), you can use the alternative version

`->>>` which gives this as an argument. The second extension is the

use of `check` to pinpoint a logical condition which is not captured

by the regular expression itself.

The syntax for the actions relies heavily in lenses and operators

from the `lens` package. In particular, you have four lenses:

  * `this.inh` gives access to the inherited attributes of the node,

  * `at n . syn` gives access to the synthesized attributes of children,

  * `this.syn` is where you set the synthesized attributes of your node,

  * `at n . inh` is where you set the inherited attributes of children,

Furthermore, you can combine those with lenses over your inherited

and synthesized attribute data type to have more lightweight syntax.

In our case, the synthesized attributes are a tuple `(String, Sum Int)`,

so we access them with `_1` and `_2`, as shown in the example.

The general rule is that you read values using `use`, and set values

via `.=`. If some value is not set, it defaults to the empty element

of your synthesized type, or to the value of parent node in the case

of inherited attributes.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/serras/t-regex

Awesome Lists containing this project

README