https://github.com/repcomm/recursive-descent-parser
Learning how to write a compiler
https://github.com/repcomm/recursive-descent-parser
Last synced: 10 months ago
JSON representation
Learning how to write a compiler
- Host: GitHub
- URL: https://github.com/repcomm/recursive-descent-parser
- Owner: RepComm
- Created: 2020-09-30T01:59:07.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-19T23:58:00.000Z (about 5 years ago)
- Last Synced: 2025-01-28T08:52:12.025Z (12 months ago)
- Language: TypeScript
- Size: 97.7 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: ReadMe.md
Awesome Lists containing this project
README
# recursive-descent-parser
Learning how to write a compiler
## Building
`npm run build` or `./build.sh`
## Methods described
There are several steps to run source code
A series of passes over the data make it easier to handle:
### Tokenize
The first step scans through the source code as a string
and returns a series of tokens
identified by their:
1. `type` - defines syntactic usage, such as identifier, keyword, operator, brackets, etc
2. `data` - typically the string represented by the type, but can be transformed by preprocessor
For instance, a preprocessor could take several tokens:
`{type: "parenthesis", data:"("}`,
`{type: "parenthesis", data:")"}`,
`{type: "operator", data:"="}`,
`{type: "operator", data:">"},`
And turn them into
`{type: "arrow-function", data:"()=>"}`,
3. line and char numbers (useful for debugging source)
### Preprocess
This part is still in the works, but it will essentially
be a function that passes over tokens and returns a
modified set.
What modifications actual entells is up to the preprocessor
but some examples are:
- source directives
- `.babelrc`
- special language features
not supported by a parser that
can be broken down into lower level codes.
### Parser
Creates a tree structure from a token array
called an Abstract Syntax Tree or AST
This is where the recursive descent part comes into play, and the part I came here to learn about.
### Interpreter / Codegen
I plan on implementing both an interpreter and code generator.
They will take an abstract syntax tree and
- run (interpreter)
- or compile (codegen)
it into some lower level code
(typically OP codes, or machine code)
## Implementation
In my process I've decided to take a language-agnostic
approach, even though my end goal is probably
something like `typescript/javascript`
For instance, the tokenizer process actually relies
on a `Scanner`, which is where language syntax will actually be handled,
and the `tokenize` function will already be implemented for you.
To handle your own language, you'll need to implement
a scanner subclass.
### Scanner
This is a class meant to be extended
It provides functionality to implement scanning text
an a more standard way, which should make debugging easier
- addPass - for adding more syntax handling
```ts
addPass(name: string, pass: ScannerPass): this
```
Where `name` is the token.type when pass is successful
and pass is a [scanner pass](#ScannerPass)
### ScannerPass
Each scanner pass is meant to handle a single type of
language syntax.
```ts
(data: string, offset: number): ScannerData
```
Where `data` is the source code data
`offset` the offset in the source to read from
and `return` expected to be a [ScannerData](#ScannerData)
### ScannerData
```ts
{
success: boolean //needs to be false when not finding data at offset that satisfies ScannerData.type
readChars: number //chars that fit this type before we read something we didn't like
readLines: number //obsolete, this will be handled by internal code soon
error?: string //optional - meant for when positive identification of error is determined, not necessarily every time success == false
}
```
Note that scanner data does not actually return the text that was read, only the char count.
This is to standardize the reading process, which should cause a lot less errors
between implementations of languages.
Basically: don't allow reading of chars that don't fit your specifications, and don't count ones that don't.