Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jeremiah-shaulov/jstok
JavaScript and TypeScript source code tokenizer
https://github.com/jeremiah-shaulov/jstok
Last synced: 6 days ago
JSON representation
JavaScript and TypeScript source code tokenizer
- Host: GitHub
- URL: https://github.com/jeremiah-shaulov/jstok
- Owner: jeremiah-shaulov
- License: mit
- Created: 2021-10-09T18:14:37.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-11-22T15:03:16.000Z (about 2 months ago)
- Last Synced: 2024-11-22T15:23:46.867Z (about 2 months ago)
- Language: TypeScript
- Size: 85 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# jstok - JavaScript and TypeScript source code tokenizer
[Documentation Index](generated-doc/README.md)
Allows to iterate over tokens (code units) in Javascript or Typescript code.
## Example
```ts
// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/jstok/v2.0.1/README.md' | perl -ne '$y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&($m||m~~)' > /tmp/example-p9mn.ts
// deno run /tmp/example-p9mn.tsimport {jstok, TokenType} from 'https://deno.land/x/[email protected]/mod.ts';
import {assertEquals} from 'jsr:@std/[email protected]/equals';const source =
` // Comment
console.log(\`Current time: \${new Date}\`);
`;assertEquals
( [...jstok(source)].map(v => Object.assign, unknown>({}, v)),
[ {nLine: 1, nColumn: 1, level: 0, type: TokenType.WHITESPACE, text: "\t"},
{nLine: 1, nColumn: 5, level: 0, type: TokenType.COMMENT, text: "// Comment"},
{nLine: 1, nColumn: 15, level: 0, type: TokenType.WHITESPACE, text: "\n\t"},
{nLine: 2, nColumn: 5, level: 0, type: TokenType.IDENT, text: "console"},
{nLine: 2, nColumn: 12, level: 0, type: TokenType.OTHER, text: "."},
{nLine: 2, nColumn: 13, level: 0, type: TokenType.IDENT, text: "log"},
{nLine: 2, nColumn: 16, level: 0, type: TokenType.OTHER, text: "("},
{nLine: 2, nColumn: 17, level: 1, type: TokenType.STRING_TEMPLATE_BEGIN, text: "`Current time: ${"},
{nLine: 2, nColumn: 34, level: 2, type: TokenType.IDENT, text: "new"},
{nLine: 2, nColumn: 37, level: 2, type: TokenType.WHITESPACE, text: " "},
{nLine: 2, nColumn: 38, level: 2, type: TokenType.IDENT, text: "Date"},
{nLine: 2, nColumn: 42, level: 1, type: TokenType.STRING_TEMPLATE_END, text: "}`"},
{nLine: 2, nColumn: 44, level: 0, type: TokenType.OTHER, text: ")"},
{nLine: 2, nColumn: 45, level: 0, type: TokenType.OTHER, text: ";"},
{nLine: 2, nColumn: 46, level: 0, type: TokenType.MORE_REQUEST, text: "\n"},
{nLine: 2, nColumn: 46, level: 0, type: TokenType.WHITESPACE, text: "\n"},
]
);for (const token of jstok(source))
{ if (token.type != TokenType.MORE_REQUEST)
{ console.log(token);
}
}
```## jstok() - Tokenize string
> `function` [jstok](generated-doc/function.jstok/README.md)(source: `string`, tabWidth: `number`=4, nLine: `number`=1, nColumn: `number`=1): Generator\<[Token](generated-doc/class.Token/README.md), `void`, `string`>
This function returns iterator over JavaScript or TypeScript tokens found in a source code provided as a string.
It will start counting lines and chars from the provided `nLine` and `nColumn` values. When counting chars, it will respect the desired `tabWidth`.
Before returning the last token in the source, it generates [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12).
You can ignore it, or you can react by calling the next `it.next(more)` function on the iterator with a string argument, that contains code continuation.
This code will be concatenated with the contents of the [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12), and the tokenization process will continue.```ts
// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/jstok/v2.0.1/README.md' | perl -ne '$y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&($m||m~~)' > /tmp/example-65ya.ts
// deno run /tmp/example-65ya.tsimport {jstok, TokenType} from 'https://deno.land/x/[email protected]/mod.ts';
let source =
` // Comment
console.log(\`Current time: \${new Date}\`);
`;function getNextPart()
{ const part = source.slice(0, 10);
source = source.slice(10);
return part;
}const it = jstok(getNextPart());
let token;
L:while ((token = it.next().value))
{ while (token.type == TokenType.MORE_REQUEST)
{ token = it.next(getNextPart()).value;
if (!token)
{ break L;
}
}console.log(token);
}
```This library cannot be used to check source code syntax.
Though in 2 cases it returns [TokenType.ERROR](generated-doc/enum.TokenType/README.md#error--13):1. if invalid character occured
2. if unbalanced bracket occured## Token
> `class` Token
> {
> 🔧 [constructor](generated-doc/class.Token/README.md#-constructortext-string-type-tokentype-nline-number1-ncolumn-number1-level-number0)(text: `string`, type: [TokenType](generated-doc/enum.TokenType/README.md), nLine: `number`=1, nColumn: `number`=1, level: `number`=0)
> 📄 [text](generated-doc/class.Token/README.md#-text-string): `string`
> 📄 [type](generated-doc/class.Token/README.md#-type-tokentype): [TokenType](generated-doc/enum.TokenType/README.md)
> 📄 [nLine](generated-doc/class.Token/README.md#-nline-number): `number`
> 📄 [nColumn](generated-doc/class.Token/README.md#-ncolumn-number): `number`
> 📄 [level](generated-doc/class.Token/README.md#-level-number): `number`
> ⚙ [toString](generated-doc/class.Token/README.md#-tostring-string)(): `string`
> ⚙ [debug](generated-doc/class.Token/README.md#-debug-string)(): `string`
> ⚙ [getValue](generated-doc/class.Token/README.md#-getvalue-string)(): `string`
> ⚙ [getNumberValue](generated-doc/class.Token/README.md#-getnumbervalue-number--bigint)(): `number` | `bigint`
> ⚙ [getRegExpValue](generated-doc/class.Token/README.md#-getregexpvalue-regexp)(): RegExp
> }- `text` - original JavaScript token text.
- `type` - Token type.
- `nLine` - Line number where this token starts.
- `nColumn` - Column number on the line where this token starts.
- `level` - Nesting level. Entering `(`, `[` and `{` increments the level counter. Also the level is incremented when entering `${` parameters in string templates.[toString()](generated-doc/class.Token/README.md#-tostring-string) method returns original JavaScript token (`this.text`), except for [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12), for which it returns empty string.
[getValue()](generated-doc/class.Token/README.md#-getvalue-string) method converts JavaScript token to it's JavaScript value, if the value is string.
- For [TokenType.COMMENT](generated-doc/enum.TokenType/README.md#comment--1) - it's the text after `//` or between `/*` and `*/`.
- For [TokenType.STRING](generated-doc/enum.TokenType/README.md#string--5) and all `TokenType.STRING_TEMPLATE*` types - it's the JavaScript value of the token.
- For [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12) - empty string.
- For others, including [TokenType.NUMBER](generated-doc/enum.TokenType/README.md#number--4) - it's the original JavaScript token.[getNumberValue()](generated-doc/class.Token/README.md#-getnumbervalue-number--bigint) method returns `Number` or `BigInt` value of the token for [TokenType.NUMBER](generated-doc/enum.TokenType/README.md#number--4) tokens. For others returns `NaN`.
[getRegExpValue()](generated-doc/class.Token/README.md#-getregexpvalue-regexp) method returns `RegExp` object. For [TokenType.REGEXP](generated-doc/enum.TokenType/README.md#regexp--10) tokens it's the regular expression that this token represents.
For other token types this method returns just a default empty `RegExp` object.[debug()](generated-doc/class.Token/README.md#-debug-string) method returns string with console.log()-ready representation of this `Token` object for debug purposes.
```ts
// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/jstok/v2.0.1/README.md' | perl -ne '$y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&($m||m~~)' > /tmp/example-pf4z.ts
// deno run --allow-read /tmp/example-pf4z.tsimport {jstok} from 'https://deno.land/x/[email protected]/mod.ts';
const code = await Deno.readTextFile(new URL(import.meta.url).pathname);
const tokens = [...jstok(code)];
console.log(tokens.map(t => t.debug()).join(',\n') + ',');
```## TokenType
> `const` `enum` TokenType
> {
> [WHITESPACE](generated-doc/enum.TokenType/README.md#whitespace--0) = 0
> [COMMENT](generated-doc/enum.TokenType/README.md#comment--1) = 1
> [ATTRIBUTE](generated-doc/enum.TokenType/README.md#attribute--2) = 2
> [IDENT](generated-doc/enum.TokenType/README.md#ident--3) = 3
> [NUMBER](generated-doc/enum.TokenType/README.md#number--4) = 4
> [STRING](generated-doc/enum.TokenType/README.md#string--5) = 5
> [STRING\_TEMPLATE](generated-doc/enum.TokenType/README.md#string_template--6) = 6
> [STRING\_TEMPLATE\_BEGIN](generated-doc/enum.TokenType/README.md#string_template_begin--7) = 7
> [STRING\_TEMPLATE\_MID](generated-doc/enum.TokenType/README.md#string_template_mid--8) = 8
> [STRING\_TEMPLATE\_END](generated-doc/enum.TokenType/README.md#string_template_end--9) = 9
> [REGEXP](generated-doc/enum.TokenType/README.md#regexp--10) = 10
> [OTHER](generated-doc/enum.TokenType/README.md#other--11) = 11
> [MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12) = 12
> [ERROR](generated-doc/enum.TokenType/README.md#error--13) = 13
> }- `WHITESPACE` - Any number of any whitespace characters. Multiple such token types are not generated in sequence.
- `COMMENT` - One single-line or multiline comment, or hashbang.
- `ATTRIBUTE` - Like `@Component`.
- `IDENT` - Can contain unicode letters. Private property names like `#flags` are also considered `IDENT`s.
- `NUMBER` - Number.
- `STRING` - String.
- `STRING_TEMPLATE` - Whole backtick-string, if it has no parameters.
- `STRING_TEMPLATE_BEGIN` - First part of a backtick-string, till it's first parameter. The contents of parameters will be tokenized separately, and returned as corresponding token types.
- `STRING_TEMPLATE_MID` - Part of backtick-string between two parameters.
- `STRING_TEMPLATE_END` - Last part of backtick-string.
- `REGEXP` - Regular expression literal.
- `OTHER` - Other tokens, like `+`, `++`, `?.`, etc.
- `MORE_REQUEST` - Before returning the last token found in the source string, [jstok()](generated-doc/function.jstok/README.md) generate this meta-token. If then you call `it.next(more)` with a nonempty string argument, this string will be appended to the last token, and the tokenization will continue.
- `ERROR` - This token type is returned in 2 situations: 1) invalid character occured; 2) unbalanced bracket occured.## jstokStream() - Tokenize ReadableStream
This function allows to tokenize a `ReadableStream` of JavaScript or TypeScript source code.
It never generates [TokenType.MORE\_REQUEST](generated-doc/enum.TokenType/README.md#more_request--12).> `function` [jstokStream](generated-doc/function.jstokStream/README.md)(source: ReadableStream\, tabWidth: `number`=4, nLine: `number`=1, nColumn: `number`=1, decoder: TextDecoder=defaultDecoder): AsyncGenerator\<[Token](generated-doc/class.Token/README.md), `void`, `any`>
It will start counting lines and chars from the provided `nLine` and `nColumn` values. When counting chars, it will respect the desired `tabWidth`.
If `decoder` is provided, will use it to convert bytes to text.
```ts
// To download and run this example:
// curl 'https://raw.githubusercontent.com/jeremiah-shaulov/jstok/v2.0.1/README.md' | perl -ne '$y=$1 if /^```(.)?/; print $_ if $y&&$m; $m=$y&&($m||m~~)' > /tmp/example-ksv8.ts
// deno run --allow-read /tmp/example-ksv8.tsimport {jstokStream} from 'https://deno.land/x/[email protected]/mod.ts';
const fh = await Deno.open(new URL(import.meta.url).pathname, {read: true});
for await (const token of jstokStream(fh.readable))
{ console.log(token);
}
```