Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/torao/terp

Parser Combinator Framework for Rust
https://github.com/torao/terp

Last synced: about 2 months ago
JSON representation

Parser Combinator Framework for Rust

Host: GitHub
URL: https://github.com/torao/terp
Owner: torao
License: mit
Created: 2022-06-12T06:36:57.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-11-03T08:59:04.000Z (about 2 years ago)
Last Synced: 2023-03-24T04:58:12.438Z (almost 2 years ago)
Language: Rust
Size: 2.46 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Terp

[![github actions](https://github.com/torao/terp/actions/workflows/build.yml/badge.svg)](https://github.com/torao/terp/actions)

[![Coverage Status](https://coveralls.io/repos/github/torao/terp/badge.svg?branch=main)](https://coveralls.io/github/torao/terp?branch=main)

**Terp** is a stream-oriented syntactical parser for Rust, capable of sequentially processing fragmented input symbol sequences. This interprets input according to an application-defined syntax and produces a sequence marked up with *begin* and *end* pairs of non-terminal symbols.

## Overview

Terp is implemented for **streaming** or **pipelined** processing, where the processing is performed sequentially form the syntax that could be parsed, without waiting to read the entire fragmented input. This is also useful for read-eval-print loop (REPL) programs, such as interactive processors available on some programming language platforms, that read a line-by-line program fragments and evaluate from a finalised expression, while the unfinalized one waits for the remaining input.

It is also sutaible for an **infinite input streams**, or data with a length that is practically impossible to read into memory (however, the syntax for processing such input must be safely defined to be deterministic state by a practical number of look-aheads).

Another key feature of terp is that instead of matching alternatives using traditional $k$-lookahead prediction or backtracking, matching is done by **parallel evaluation** of parsing paths. It is more suitable for parsing in modern multi-core computer environments.

In the traditional definition, terp would be a variant of the recurisive-descent LL(k) parser, whwich can interpret context-free grammars (CFG). For more information on using terp, see the [Reference Guide](doc/README.md).

## Features

### Easy-to-describe Schema

Instead of using complex function combination, the schema can be described in a BNF or PEG-like manner with better visibility. The following example is a JSON string defined in [RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.html) defined in terp, where `A & B` means that `B` appears after `A`, `A | B` means that `A` or `B` appears, and `A * (X..=Y)` means `X` to `Y` repetitions of `A`.

```rust

let schema = Schema::new("JSON String")

  .define("String",    id("Quote") & (id("Char") * (0..)) & id("Quote"))

  .define("Quote",     ch('\"'))

  .define("Char",      id("Unescaped") | id("Escape") & (one_of_chars("\"\\/bfnrt") | (ch('u') & (id("Hex") * 4))))

  .define("Escape",    ch('\\'))

  .define("Unescaped", range('\x20'..='\x21') | range('\x23'..='\x5B') | range('\x5D'..='\u{10FFFF}'))

  .define("Hex",       range('0'..='9') | range('a'..='f') | range('A'..='F'));

```

The schema is references as immutable while the parser is parsing.

### State-Machine Designed Parser

The parser updates its state for incoming data sequence fragments and sequentially outputs marked-up sequence as events when the meaning is determined (this is similar to the SAX parser in XML). This terp parser behaves like a pipeline, which is useful for streaming processes that read and parse fragmented data from sockets or other inputs.

![Parser Input](doc/input-process-output.png)

Input data sequences will work no matter what delimitations they are fragmented into. The resulting output data sequence are passed as event callbacks.

```rust

let mut events = Vec::new();

let mut parser = Context::new(&schema, "String", |e:Event| events.push(e)).unwrap();

parser.push_str("\"t").unwrap();

parser.push_str("e").unwrap();

parser.push_str("rp\"").unwrap();

parser.finish().unwrap();

println!("{:?}", events);

```

The events called back are a sequence marked up with the input sequence by identifiers' BEGIN-END pair. This constitutes a tree structure organized by meaning, similar to the structure of XML.

```

EventKind::Begin("String")

EventKind::Begin("Quote")

EventKind::Fragments("\"")

EventKind::End("Quote")

```

* The supported data sequences are abstracted, allowing parsers to be built for strings, byte arrays, or any other data sequence.

* Multiple routes are matched in parallel using [`rayon`](https://github.com/rayon-rs/rayon) framework.

* This is not so fast as dedicated parser implementations optimized for the schema. It is suitable for parsing domain-specific data for which a dedicated parser doesn't exist, or for use as a comparison to see if the dedicated parser is working properly.