Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jeremiah-shaulov/jstok

JavaScript and TypeScript source code tokenizer
https://github.com/jeremiah-shaulov/jstok

Last synced: 6 days ago
JSON representation

JavaScript and TypeScript source code tokenizer

Awesome Lists containing this project

README

        

# jstok - JavaScript and TypeScript source code tokenizer

[Documentation Index](generated-doc/README.md)

Allows to iterate over tokens (code units) in Javascript or Typescript code.

## Example

```ts
// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/jstok/v2.0.1/README.md' | perl -ne '$y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&($m||m~~)' > /tmp/example-p9mn.ts
// deno run /tmp/example-p9mn.ts

import {jstok, TokenType} from 'https://deno.land/x/[email protected]/mod.ts';
import {assertEquals} from 'jsr:@std/[email protected]/equals';

const source =
` // Comment
console.log(\`Current time: \${new Date}\`);
`;

assertEquals
( [...jstok(source)].map(v => Object.assign, unknown>({}, v)),
[ {nLine: 1, nColumn: 1, level: 0, type: TokenType.WHITESPACE, text: "\t"},
{nLine: 1, nColumn: 5, level: 0, type: TokenType.COMMENT, text: "// Comment"},
{nLine: 1, nColumn: 15, level: 0, type: TokenType.WHITESPACE, text: "\n\t"},
{nLine: 2, nColumn: 5, level: 0, type: TokenType.IDENT, text: "console"},
{nLine: 2, nColumn: 12, level: 0, type: TokenType.OTHER, text: "."},
{nLine: 2, nColumn: 13, level: 0, type: TokenType.IDENT, text: "log"},
{nLine: 2, nColumn: 16, level: 0, type: TokenType.OTHER, text: "("},
{nLine: 2, nColumn: 17, level: 1, type: TokenType.STRING_TEMPLATE_BEGIN, text: "`Current time: ${"},
{nLine: 2, nColumn: 34, level: 2, type: TokenType.IDENT, text: "new"},
{nLine: 2, nColumn: 37, level: 2, type: TokenType.WHITESPACE, text: " "},
{nLine: 2, nColumn: 38, level: 2, type: TokenType.IDENT, text: "Date"},
{nLine: 2, nColumn: 42, level: 1, type: TokenType.STRING_TEMPLATE_END, text: "}`"},
{nLine: 2, nColumn: 44, level: 0, type: TokenType.OTHER, text: ")"},
{nLine: 2, nColumn: 45, level: 0, type: TokenType.OTHER, text: ";"},
{nLine: 2, nColumn: 46, level: 0, type: TokenType.MORE_REQUEST, text: "\n"},
{nLine: 2, nColumn: 46, level: 0, type: TokenType.WHITESPACE, text: "\n"},
]
);

for (const token of jstok(source))
{ if (token.type != TokenType.MORE_REQUEST)
{ console.log(token);
}
}
```

## jstok() - Tokenize string

> `function` [jstok](generated-doc/function.jstok/README.md)(source: `string`, tabWidth: `number`=4, nLine: `number`=1, nColumn: `number`=1): Generator\<[Token](generated-doc/class.Token/README.md), `void`, `string`>

This function returns iterator over JavaScript or TypeScript tokens found in a source code provided as a string.

It will start counting lines and chars from the provided `nLine` and `nColumn` values. When counting chars, it will respect the desired `tabWidth`.

Before returning the last token in the source, it generates [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12).
You can ignore it, or you can react by calling the next `it.next(more)` function on the iterator with a string argument, that contains code continuation.
This code will be concatenated with the contents of the [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12), and the tokenization process will continue.

```ts
// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/jstok/v2.0.1/README.md' | perl -ne '$y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&($m||m~~)' > /tmp/example-65ya.ts
// deno run /tmp/example-65ya.ts

import {jstok, TokenType} from 'https://deno.land/x/[email protected]/mod.ts';

let source =
` // Comment
console.log(\`Current time: \${new Date}\`);
`;

function getNextPart()
{ const part = source.slice(0, 10);
source = source.slice(10);
return part;
}

const it = jstok(getNextPart());
let token;
L:while ((token = it.next().value))
{ while (token.type == TokenType.MORE_REQUEST)
{ token = it.next(getNextPart()).value;
if (!token)
{ break L;
}
}

console.log(token);
}
```

This library cannot be used to check source code syntax.
Though in 2 cases it returns [TokenType.ERROR](generated-doc/enum.TokenType/README.md#error--13):

1. if invalid character occured
2. if unbalanced bracket occured

## Token

> `class` Token

> {

>     🔧 [constructor](generated-doc/class.Token/README.md#-constructortext-string-type-tokentype-nline-number1-ncolumn-number1-level-number0)(text: `string`, type: [TokenType](generated-doc/enum.TokenType/README.md), nLine: `number`=1, nColumn: `number`=1, level: `number`=0)

>     📄 [text](generated-doc/class.Token/README.md#-text-string): `string`

>     📄 [type](generated-doc/class.Token/README.md#-type-tokentype): [TokenType](generated-doc/enum.TokenType/README.md)

>     📄 [nLine](generated-doc/class.Token/README.md#-nline-number): `number`

>     📄 [nColumn](generated-doc/class.Token/README.md#-ncolumn-number): `number`

>     📄 [level](generated-doc/class.Token/README.md#-level-number): `number`

>     ⚙ [toString](generated-doc/class.Token/README.md#-tostring-string)(): `string`

>     ⚙ [debug](generated-doc/class.Token/README.md#-debug-string)(): `string`

>     ⚙ [getValue](generated-doc/class.Token/README.md#-getvalue-string)(): `string`

>     ⚙ [getNumberValue](generated-doc/class.Token/README.md#-getnumbervalue-number--bigint)(): `number` | `bigint`

>     ⚙ [getRegExpValue](generated-doc/class.Token/README.md#-getregexpvalue-regexp)(): RegExp

> }

- `text` - original JavaScript token text.
- `type` - Token type.
- `nLine` - Line number where this token starts.
- `nColumn` - Column number on the line where this token starts.
- `level` - Nesting level. Entering `(`, `[` and `{` increments the level counter. Also the level is incremented when entering `${` parameters in string templates.

[toString()](generated-doc/class.Token/README.md#-tostring-string) method returns original JavaScript token (`this.text`), except for [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12), for which it returns empty string.

[getValue()](generated-doc/class.Token/README.md#-getvalue-string) method converts JavaScript token to it's JavaScript value, if the value is string.
- For [TokenType.COMMENT](generated-doc/enum.TokenType/README.md#comment--1) - it's the text after `//` or between `/*` and `*‎/`.
- For [TokenType.STRING](generated-doc/enum.TokenType/README.md#string--5) and all `TokenType.STRING_TEMPLATE*` types - it's the JavaScript value of the token.
- For [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12) - empty string.
- For others, including [TokenType.NUMBER](generated-doc/enum.TokenType/README.md#number--4) - it's the original JavaScript token.

[getNumberValue()](generated-doc/class.Token/README.md#-getnumbervalue-number--bigint) method returns `Number` or `BigInt` value of the token for [TokenType.NUMBER](generated-doc/enum.TokenType/README.md#number--4) tokens. For others returns `NaN`.

[getRegExpValue()](generated-doc/class.Token/README.md#-getregexpvalue-regexp) method returns `RegExp` object. For [TokenType.REGEXP](generated-doc/enum.TokenType/README.md#regexp--10) tokens it's the regular expression that this token represents.
For other token types this method returns just a default empty `RegExp` object.

[debug()](generated-doc/class.Token/README.md#-debug-string) method returns string with console.log()-ready representation of this `Token` object for debug purposes.

```ts
// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/jstok/v2.0.1/README.md' | perl -ne '$y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&($m||m~~)' > /tmp/example-pf4z.ts
// deno run --allow-read /tmp/example-pf4z.ts

import {jstok} from 'https://deno.land/x/[email protected]/mod.ts';

const code = await Deno.readTextFile(new URL(import.meta.url).pathname);
const tokens = [...jstok(code)];
console.log(tokens.map(t => t.debug()).join(',\n') + ',');
```

## TokenType

> `const` `enum` TokenType

> {

>     [WHITESPACE](generated-doc/enum.TokenType/README.md#whitespace--0) = 0

>     [COMMENT](generated-doc/enum.TokenType/README.md#comment--1) = 1

>     [ATTRIBUTE](generated-doc/enum.TokenType/README.md#attribute--2) = 2

>     [IDENT](generated-doc/enum.TokenType/README.md#ident--3) = 3

>     [NUMBER](generated-doc/enum.TokenType/README.md#number--4) = 4

>     [STRING](generated-doc/enum.TokenType/README.md#string--5) = 5

>     [STRING\_TEMPLATE](generated-doc/enum.TokenType/README.md#string_template--6) = 6

>     [STRING\_TEMPLATE\_BEGIN](generated-doc/enum.TokenType/README.md#string_template_begin--7) = 7

>     [STRING\_TEMPLATE\_MID](generated-doc/enum.TokenType/README.md#string_template_mid--8) = 8

>     [STRING\_TEMPLATE\_END](generated-doc/enum.TokenType/README.md#string_template_end--9) = 9

>     [REGEXP](generated-doc/enum.TokenType/README.md#regexp--10) = 10

>     [OTHER](generated-doc/enum.TokenType/README.md#other--11) = 11

>     [MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12) = 12

>     [ERROR](generated-doc/enum.TokenType/README.md#error--13) = 13

> }

- `WHITESPACE` - Any number of any whitespace characters. Multiple such token types are not generated in sequence.
- `COMMENT` - One single-line or multiline comment, or hashbang.
- `ATTRIBUTE` - Like `@Component`.
- `IDENT` - Can contain unicode letters. Private property names like `#flags` are also considered `IDENT`s.
- `NUMBER` - Number.
- `STRING` - String.
- `STRING_TEMPLATE` - Whole backtick-string, if it has no parameters.
- `STRING_TEMPLATE_BEGIN` - First part of a backtick-string, till it's first parameter. The contents of parameters will be tokenized separately, and returned as corresponding token types.
- `STRING_TEMPLATE_MID` - Part of backtick-string between two parameters.
- `STRING_TEMPLATE_END` - Last part of backtick-string.
- `REGEXP` - Regular expression literal.
- `OTHER` - Other tokens, like `+`, `++`, `?.`, etc.
- `MORE_REQUEST` - Before returning the last token found in the source string, [jstok()](generated-doc/function.jstok/README.md) generate this meta-token. If then you call `it.next(more)` with a nonempty string argument, this string will be appended to the last token, and the tokenization will continue.
- `ERROR` - This token type is returned in 2 situations: 1) invalid character occured; 2) unbalanced bracket occured.

## jstokStream() - Tokenize ReadableStream

This function allows to tokenize a `ReadableStream` of JavaScript or TypeScript source code.
It never generates [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12).

> `function` [jstokStream](generated-doc/function.jstokStream/README.md)(source: ReadableStream\, tabWidth: `number`=4, nLine: `number`=1, nColumn: `number`=1, decoder: TextDecoder=defaultDecoder): AsyncGenerator\<[Token](generated-doc/class.Token/README.md), `void`, `any`>

It will start counting lines and chars from the provided `nLine` and `nColumn` values. When counting chars, it will respect the desired `tabWidth`.

If `decoder` is provided, will use it to convert bytes to text.

```ts
// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/jstok/v2.0.1/README.md' | perl -ne '$y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&($m||m~~)' > /tmp/example-ksv8.ts
// deno run --allow-read /tmp/example-ksv8.ts

import {jstokStream} from 'https://deno.land/x/[email protected]/mod.ts';

const fh = await Deno.open(new URL(import.meta.url).pathname, {read: true});
for await (const token of jstokStream(fh.readable))
{ console.log(token);
}
```