Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rangoo94/universal-lexer

Parse any text input to tokens, according to provided regular expressions.
https://github.com/rangoo94/universal-lexer

lexer lexical-analysis parser parsing regular-expression scanner tokenizer

Last synced: 2 months ago
JSON representation

Parse any text input to tokens, according to provided regular expressions.

Host: GitHub
URL: https://github.com/rangoo94/universal-lexer
Owner: rangoo94
Created: 2018-01-15T15:07:24.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2022-03-10T10:52:54.000Z (almost 3 years ago)
Last Synced: 2024-10-15T09:55:04.841Z (4 months ago)
Topics: lexer, lexical-analysis, parser, parsing, regular-expression, scanner, tokenizer
Language: JavaScript
Size: 384 KB
Stars: 1
Watchers: 2
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Universal Lexer

[![Travis](https://travis-ci.org/rangoo94/universal-lexer.svg)](https://travis-ci.org/rangoo94/universal-lexer)

[![Code Climate](https://codeclimate.com/github/rangoo94/universal-lexer/badges/gpa.svg)](https://codeclimate.com/github/rangoo94/universal-lexer)

[![Coverage Status](https://coveralls.io/repos/github/rangoo94/universal-lexer/badge.svg?branch=master)](https://coveralls.io/github/rangoo94/universal-lexer?branch=master)

[![NPM Downloads](https://img.shields.io/npm/dm/universal-lexer.svg)](https://www.npmjs.com/package/universal-lexer)

Lexer which can parse any text input to tokens, according to provided regular expressions.

> In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters

> (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).

> A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

> A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

## Features

- Allow named regular expressions, so you don't have to work with it a lot

- Allow post-processing tokens, to get more information you require

## How to install

Package is available as `universal-lexer` in NPM, so you can use it in your project using

`npm install universal-lexer` or `yarn add universal-lexer`

## What are requirements?

Code itself is written in ES6 and should work in Node.js 6+ environment.

If you would like to use it in browser or older development, there is also transpiled and bundled (UMD) version included.

You can use `universal-lexer/browser` in your requires or `UniversalLexer` in global environment (in browser):

```js

// Load library

const UniversalLexer = require('universal-lexer/browser')

// Create lexer

const lexer = UniversalLexer.compile(definitions)

// ...

```

## How it works

You've got two sets of functions:

```js

// Load library

const UniversalLexer = require('universal-lexer')

// Build code for this lexer

const code1 = UniversalLexer.build([ { type: 'Colon', value: ':' } ])

const code2 = UniversalLexer.buildFromFile('json.yaml')

// Compile dynamically a function which can be used

const func1 = UniversalLexer.compile([ { type: 'Colon', value: ':' } ])

const func2 = UniversalLexer.compileFromFile('json.yaml')

```

There are two ways of passing rules to this lexer: from file or array of definitions.

### Pass as array of definitions

Simply, pass definitions to lexer:

```js

// Load library

const UniversalLexer = require('universal-lexer')

// Create token definition

const Colon = {

  type: 'Colon',

  value: ':'

}

// Build array of definitions

const definitions = [ Colon ]

// Create lexer

const lexer = UniversalLexer.compile(definitions)

```

A definition is more complex object:

```js

// Required fields: 'type' and either `regex` or `value`

{

  // Token name

  type: 'String',

  // String value which should be searched on beginning on string

  value: 'abc',

  value: '(',

  // Regular expression to validate

  // if current token should be parsed as this token

  // Useful i.e. when you require separator after sentence,

  // but you don't want to include it.

  valid: '"',

  // Regular expression flags for 'valid' field

  validFlags: 'i',

  // Regular expression to find current token

  // You can use named groups as well (?expression):

  // Then it will attach this information to token.

  regex: '"(?([^"]|\\.)+)"',

  // Regular expression flags for 'regex' field

  regexFlags: 'i'

}

```

### Pass YAML file

```js

// Load library

const UniversalLexer = require('universal-lexer')

const lexer = UniversalLexer.compileFromFile('scss.yaml')

```

YAML file for now should contain only `Tokens` property with definitions.

Later it may have more advanced stuff like macros (for simpler syntax).

**Example:**

```yaml

Tokens:

  # Whitespaces

  - type: NewLine

    value: "\n"

  - type: Space

    regex: '[ \t]+'

  # Math

  - type: Operator

    regex: '[+-*/]'

  # Color

  # It has 'valid' field, to be sure that it's not i.e. blacker

  # Now, it will check if there is no text after

  - type: Color

    regex: '(?black|white)'

    valid: '(black|white)[^\w]'

```

## Processing data

Processing input data, after you created a lexer is pretty straight-forward with `for` method:

```js

// Load library

const UniversalLexer = require('universal-lexer')

// Create lexer

const tokenize = UniversalLexer.compileFromFile('scss.yaml')

// Build processor

const tokens = tokenize('some { background: code }').tokens

```

## Post-processing tokens

If you would like to make more advanced parsing on parsed tokens, you can do it with `addProcessor` method:

```js

// Load library

const UniversalLexer = require('universal-lexer')

// Create lexer

const tokenize = UniversalLexer.compileFromFile('scss.yaml')

// That's 'Literal' definition:

const Literal = {

  type: 'Literal',

  regex: '(?([^\t \n;"'',{}()\[\]#=:~&\\]|(\\.))+)'

}

// Create processor which will replace all '\X' to 'X' in value

function process (token) {

  if (token.type === 'Literal') {

    token.data.value = token.data.value.replace(/\\(.)/g, '$1')

  }

  return token

}

// Also, you can return a new token

function process2 (token) {

  if (token.type !== 'Literal') {

    return token

  }

  return {

    type: 'Literal',

    data: {

      value: token.data.value.replace(/\\(.)/g, '$1')

    },

    start: token.start,

    end: token.end

  }

}

// Get all tokens...

const tokens = tokenize('some { background: code }', process).tokens

```

## Beautified code

If you would like to get beautified code of lexer,

you can use second argument of `compile` functions:

```js

UniversalLexer.compile(definitions, true)

UniversalLexer.compileFromFile('scss.yaml', true)

```

## Possible results

On success you will retrieve simple object with array of tokens:

```js

{

  tokens: [

    { type: 'Whitespace', data: { value: '     ' }, start: 0, end: 5 },

    { type: 'Word', data: { value: 'some' }, start: 5, end: 9 }

  ]

}

```

When something is wrong you will get error information:

```js

{

  error: 'Unrecognized token',

  index: 1,

  line: 1,

  column: 2

}

```

## Examples

For now, you can see example of JSON semantics in `examples/json.yaml` file.

## CLI

After installing globally (or inside of NPM scripts) `universal-lexer` command is available:

```

Usage: universal-lexer [options] output.js

Options:

  --version       Show version number                                  [boolean]

  -s, --source    Semantics file                                      [required]

  -b, --beautify  Should beautify code?                [boolean] [default: true]

  -h, --help      Show help                                            [boolean]

Examples:

  universal-lexer -s json.yaml lexer.js  build lexer from semantics file

```

## Changelog

### Version 2

- **2.0.6** - bugfix for single characters

- **2.0.5** - fix mistake in README file (post-processing code)

- **2.0.4** - remove unneeded `benchmark` dependency

- **2.0.3** - add unit and E2E tests, fix small bugs

- **2.0.2** - added CLI command

- **2.0.1** - fix typo in README file

- **2.0.0** - optimize it (even 10x faster) by expression analysis and some other things

### Version 1

- **1.0.8** - change that current position in syntax error starts from 1 always

- **1.0.7** - optimize definitions with "value", make syntax errors developer-friendly

- **1.0.6** - optimized Lexer performance (20% faster in average)

- **1.0.5** - fix browser version to be put into NPM package properly

- **1.0.4** - bugfix for debugging

- **1.0.3** - add proper sanitization for debug HTML

- **1.0.2** - small fixes for README file

- **1.0.1** - added Rollup.js support to build version for browser