An open API service indexing awesome lists of open source software.

https://github.com/repcomm/recursive-descent-parser

Learning how to write a compiler
https://github.com/repcomm/recursive-descent-parser

Last synced: 10 months ago
JSON representation

Learning how to write a compiler

Awesome Lists containing this project

README

          

# recursive-descent-parser
Learning how to write a compiler

## Building
`npm run build` or `./build.sh`

## Methods described
There are several steps to run source code

A series of passes over the data make it easier to handle:

### Tokenize
The first step scans through the source code as a string

and returns a series of tokens

identified by their:
1. `type` - defines syntactic usage, such as identifier, keyword, operator, brackets, etc
2. `data` - typically the string represented by the type, but can be transformed by preprocessor

For instance, a preprocessor could take several tokens:

`{type: "parenthesis", data:"("}`,

`{type: "parenthesis", data:")"}`,

`{type: "operator", data:"="}`,

`{type: "operator", data:">"},`



And turn them into

`{type: "arrow-function", data:"()=>"}`,

3. line and char numbers (useful for debugging source)

### Preprocess
This part is still in the works, but it will essentially

be a function that passes over tokens and returns a

modified set.



What modifications actual entells is up to the preprocessor

but some examples are:
- source directives
- `.babelrc`
- special language features

not supported by a parser that

can be broken down into lower level codes.

### Parser
Creates a tree structure from a token array

called an Abstract Syntax Tree or AST

This is where the recursive descent part comes into play, and the part I came here to learn about.

### Interpreter / Codegen
I plan on implementing both an interpreter and code generator.

They will take an abstract syntax tree and

- run (interpreter)
- or compile (codegen)

it into some lower level code

(typically OP codes, or machine code)

## Implementation
In my process I've decided to take a language-agnostic

approach, even though my end goal is probably
something like `typescript/javascript`

For instance, the tokenizer process actually relies

on a `Scanner`, which is where language syntax will actually be handled,

and the `tokenize` function will already be implemented for you.

To handle your own language, you'll need to implement

a scanner subclass.

### Scanner
This is a class meant to be extended

It provides functionality to implement scanning text

an a more standard way, which should make debugging easier

- addPass - for adding more syntax handling
```ts
addPass(name: string, pass: ScannerPass): this
```
Where `name` is the token.type when pass is successful

and pass is a [scanner pass](#ScannerPass)

### ScannerPass
Each scanner pass is meant to handle a single type of

language syntax.

```ts
(data: string, offset: number): ScannerData
```
Where `data` is the source code data

`offset` the offset in the source to read from

and `return` expected to be a [ScannerData](#ScannerData)

### ScannerData
```ts
{
success: boolean //needs to be false when not finding data at offset that satisfies ScannerData.type
readChars: number //chars that fit this type before we read something we didn't like
readLines: number //obsolete, this will be handled by internal code soon
error?: string //optional - meant for when positive identification of error is determined, not necessarily every time success == false
}
```
Note that scanner data does not actually return the text that was read, only the char count.

This is to standardize the reading process, which should cause a lot less errors

between implementations of languages.

Basically: don't allow reading of chars that don't fit your specifications, and don't count ones that don't.