Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rryi/tokens.jl

parse text into tokens, build memory-efficient token lists and trees,
https://github.com/rryi/tokens.jl

lexer mutable-strings parser parsing string string-manipulation string-tree string-vector token tokenize

Last synced: 5 days ago
JSON representation

parse text into tokens, build memory-efficient token lists and trees,

Host: GitHub
URL: https://github.com/rryi/tokens.jl
Owner: rryi
License: mit
Created: 2019-09-28T22:24:09.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-10-23T13:58:27.000Z (about 3 years ago)
Last Synced: 2024-03-23T22:37:08.725Z (8 months ago)
Topics: lexer, mutable-strings, parser, parsing, string, string-manipulation, string-tree, string-vector, token, tokenize
Language: Julia
Homepage:
Size: 1.89 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Tokens

[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://rryi.github.io/Tokens.jl/stable)
[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://rryi.github.io/Tokens.jl/dev)
[![Build Status](https://travis-ci.com/rryi/Tokens.jl.svg?branch=master)](https://travis-ci.com/rryi/Tokens.jl)
[![Build Status](https://ci.appveyor.com/api/projects/status/github/rryi/Tokens.jl?svg=true)](https://ci.appveyor.com/project/rryi/Tokens-jl)
[![Codecov](https://codecov.io/gh/rryi/Tokens.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/rryi/Tokens.jl)
[![Build Status](https://api.cirrus-ci.com/github/rryi/Tokens.jl.svg)](https://cirrus-ci.com/github/rryi/Tokens.jl)

Tokens.jl supplies tools to work with small text fragments like those produced by a lexer stage during source code processing of computer languages.
A lexer segments source text into "tokens", usually short sequences which are classified e.g. into identifiers, numbers, whitespace and so forth.
A token in the context of Tokens.jl consists of a category ID (enumeration with 16 predefined values) and a string.

Design guidelines are memory and runtime efficiency, in particular for very short strings. Strings with up to 7 code units can be stored directly
in a minimal token structure of 8 bytes, tokens with longer strings use a separate buffer. A token vector uses one common buffer for all elements,
avoiding heap allocations per token creation.

Inspirations came from Nicholas Ormrod's talk on [strange details of std::string at Facebook](https://www.youtube.com/watch?v=kPR8h4-qZdk),
[ShortStrings](https://github.com/xiaodaigh/ShortStrings.jl) for interned strings, and
[WeakRefStrings](https://github.com/JuliaData/WeakRefStrings.jl) for string arrays using a shared content buffer.

# Current State

Tokens.jl compiles and runs defined tests. However test coverage is very incomplete, features are not mature, expect breaking changes in upcoming releases below release 1.0.

# Basic structures

## AbstractToken <: AbstractString
A string with an attached token category. There are 16 categories with predefined meaning, closely related to the lexer methods in the Tokens package.

## DToken, BToken, HToken <: AbstractToken
The type central to this package comes in three flavours, identified with a prefix character.

B in BToken stands for "buffered". Token contents is stored in a separate string buffer, a BToken instance holds a reference to it. memory layout and behavior is very similar to a SubString. An instance consists of 16 bytes (w/o buffer).

D in DToken stands for "direct": token contents is directly encoded within the instance, no separate buffer for contents is used. Instance conststs of 8 bytes. Contents size is limited to 7 code units. Compared to BToken, it saves memory, has increases data locality, and many operations are mich faster, e.g. comparisons.

H in HToken stands for "hybrid". HToken combines DToken and BToken in one type, similar to a julia union type. An instance consists of 16 bytes. On runtime, it is cast to a DToken or BToken instance. In scenarios where the percentage of short tokens, wich fit into a DToken, is high, it saves memory and has better overall performance.

## SharedIO <: IO
An IOBuffer derivate with elaborated token support. It uses a "copy on write" approach, keeping track of its parts which are
referenced by tokens and token vectors.

## TokenVector <: AbstractVector
Memory-efficient Vector implementation which uses one SharedIO for all of its token elements. Within the structure, each token consumes 8 bytes plus contents bytes if contents size exceeds 7 code units.

## TokenTree
A compact tree structure on top of a TokenVector. , intended as syntax treewenn suited to for tokens stored in

# AbstractToken interface