Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/msakai/haskell-sequitur

Haskell implementation of grammar-based compression algorithm SEQUITUR
https://github.com/msakai/haskell-sequitur

compression context-free-grammar

Last synced: 2 months ago
JSON representation

Haskell implementation of grammar-based compression algorithm SEQUITUR

Awesome Lists containing this project

README

        

# Haskell implementation of _SEQUITUR_ algorithm

Hackage:
[![Hackage](https://img.shields.io/hackage/v/sequitur.svg)](https://hackage.haskell.org/package/sequitur)

Dev:
[![build](https://github.com/msakai/haskell-sequitur/actions/workflows/build.yaml/badge.svg)](https://github.com/msakai/haskell-sequitur/actions/workflows/build.yaml)
[![Coverage Status](https://coveralls.io/repos/github/msakai/haskell-sequitur/badge.svg?branch=main)](https://coveralls.io/github/msakai/haskell-sequitur?branch=main)

## About _SEQUITUR_

_SEQUITUR_ is a linear-time, online algorithm for producing a context-free
grammar from an input sequence. The resulting grammar is a compact representation
of the original sequence and can be used for data compression.

Example:

- Input string: `abcabcabcabcabc`

- Resulting grammar
- `S` → `AAB`
- `A` → `BB`
- `B` → `abc`

_SEQUITUR_ consumes input symbols one-by-one and appends each symbol at the end of the
grammar's start production (`S` in the above example), then substitutes repeating
patterns in the given sequence with new rules. _SEQUITUR_ maintains two invariants:

* **Digram Uniqueness**: _SEQUITUR_ ensures that no digram
(a.k.a. bigram) occurs more than once in the grammar. If a digram
(e.g. `ab`) occurs twice, SEQUITUR introduces a fresh non-terminal
symbol (e.g. `M`) and a rule (e.g. `M` → `ab`) and replaces
original occurrences with the newly introduced non-terminals. One
exception is the cases where two occurrences overlap.

* **Rule Utility**: If a non-terminal symbol occurs only once,
_SEQUITUR_ removes the associated rule and substitutes the occurrence
with the right-hand side of the rule.

## Usage

```console
ghci> import Language.Grammar.Sequitur
ghci> encode "baaabacaa"
Grammar {unGrammar = fromList [(0,[NonTerminal 1,NonTerminal 2,NonTerminal 1,Terminal 'c',NonTerminal 2]),(1,[Terminal 'b',Terminal 'a']),(2,[Terminal 'a',Terminal 'a'])]}
```

The output represents the following grammar:

```
0 → 1 2 1 c 2
1 → b a
2 → a a
```

## References

- [Sequitur algorithm - Wikipedia](https://en.m.wikipedia.org/wiki/Sequitur_algorithm)
- [sequitur.info](http://www.sequitur.info/)
- Nevill-Manning, C.G. and Witten, I.H. (1997) "[Identifying
Hierarchical Structure in Sequences: A linear-time
algorithm](https://doi.org/10.1613/jair.374)," Journal of
Artificial Intelligence Research, 7, 67-82.
- [nikitadanilov/sequuntur](https://github.com/nikitadanilov/sequuntur)