Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shubhamai/grad

A custom programming language grad and its compiler, written in Rust
https://github.com/shubhamai/grad

egui rust

Last synced: about 1 month ago
JSON representation

A custom programming language grad and its compiler, written in Rust

Awesome Lists containing this project

README

        

# grad

This project implements a custom programming language `grad` and its compiler, written in Rust. The compiler follows a multi-stage process to transform source code into executable bytecode, which is then interpreted by a custom Stack Based Virtual Machine (VM).

## Getting Started

Try the language in the [playground](https://grad-lang.vercel.app).

### Example

```bash
cargo install grad
echo "let a = 10; print(a);" > example.grad
grad run example.grad
```

## Table of Contents

1. [Compiler Overview](#compiler-overview)
2. [Lexical Analysis](#lexical-analysis)
3. [Parsing](#parsing)
4. [Abstract Syntax Tree (AST)](#abstract-syntax-tree-ast)
5. [Code Generation](#code-generation)
6. [Virtual Machine](#virtual-machine)
7. [String Interning](#string-interning)
8. [Example: Program Compilation and Execution](#example-program-compilation-and-execution)
9. [Future Improvements](#future-improvements)

## Compiler Overview

The compiler follows these main stages:

1. [Lexical Analysis](./src/scanner.rs) - Tokenizes the input source code.
2. [Parsing](./src/ast.rs) - Builds an Abstract Syntax Tree (AST).
3. [Code Generation](./src/compiler.rs) - Transforms the AST into bytecode.
4. [Virtual Machine](./src/vm.rs) - Executes the generated bytecode.

## Lexical Analysis

The lexical analysis is performed by the `Lexer` struct. It tokenizes/splits the input source code into a series of [`Token`s](./src/scanner.rs).

```rust
pub struct Lexer {
pub tokens: Vec,
}

#[derive(Debug, PartialEq, Clone)]
pub struct Token {
pub token_type: TokenType,
pub lexeme: String,
pub span: std::ops::Range,
}
```

The `Lexer` uses [logos](https://github.com/maciejhirsz/logos) to identify different token types such as keywords, identifiers, literals, and operators. It also return the span (start and end positions) of each token in the source code.

## Parsing

The parsing stage is implemented using a recursive descent parser with [Pratt parsing](https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html) for expressions.

```rust
pub struct Parser<'a> {
lexer: &'a mut Lexer,
}
```

The parser uses methods like `parse_statement()`, `parse_expression()`, and various other parsing functions to build the Abstract Syntax Tree (AST).

The expression parsing uses the [Pratt parsing](https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html) technique for handling operator precedence:

```rust
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> ParseResult {
// ... (Pratt parsing implementation by matklad)
}
```

This allows for efficient and correct parsing of complex expressions with different operator precedences.

The AST is represented using the `ASTNode` enum:

```rust
pub enum ASTNode {
IntNumber(i64),
FloatNumber(f64),
Identifier(String),
Boolean(bool),
String(String),
Op(Ops, Vec),
Callee(String, Vec),
Let(String, Vec),
Assign(String, Vec),
If(Vec, Vec, Option>),
While(Vec, Vec),
Print(Vec),
Function(String, Vec, Vec),
Block(Vec),
}
```

This structure allows for representing various language constructs, including literals, variables, function calls, control flow statements, and more.

## Code Generation

The code generation phase transforms the AST into bytecode that can be executed by the Virtual Machine. This process is handled by the `Compiler` struct:

```rust
pub struct Compiler {
chunk: Chunk,
interner: Interner,
locals: Vec,
local_count: usize,
scope_depth: u8,
functions: Vec,
function_count: usize,
}
```

The compiler emits bytecode instructions represented by the `OpCode` enum:

```rust
#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
#[repr(u8)]
pub enum OpCode {
OpConstant,
OpNil,
OpTrue,
OpFalse,
// ... (other opcodes)
}
```

These instructions, along with their operands, are stored in a `Chunk`:

```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk {
pub code: Vec,
pub constants: Vec,
}
```

The `VectorType` enum allows for storage of both opcodes and constant indices in the same vector.

## Virtual Machine

The Virtual Machine (VM) is executes the generated bytecode. It's implemented in the `VM` struct:

```rust
pub struct VM {
pub chunk: Chunk,
ip: usize,
stack: [ValueType; STACK_MAX],
stack_top: usize,
pub interner: Interner,
globals: HashMap,
call_frames: Vec,
frame_index: usize,
}
```

The VM uses a stack-based architecture for executing instructions. It maintains a stack for operands and local variables, a global variable table, and call frames for function calls.

The main execution loop of the VM interprets each opcode and performs the corresponding operation:

```rust
pub fn run(&mut self) -> Result {
// ... (main execution loop)
}
```

## String Interning

To optimize string handling, the compiler uses string interning via the `Interner` struct:

```rust
pub struct Interner {
pub map: HashMap,
vec: Vec,
}
```

It allows for efficient storage and comparison of strings by assigning unique indices to each unique string.

## Example: Program Compilation and Execution

Here's a simple program to demonstrate how it progresses through each stage of the compiler.

### Sample Program

```
print("Hello, world!");

let a = 4.0;
let b = 2**2;

print(a + b);
```

### Stage 1: Lexical Analysis

The lexer breaks down the program into tokens:

```
[
Token { token_type: PRINT, lexeme: "print", span: 0..5 }
Token { token_type: LeftParen, lexeme: "(", span: 5..6 }
Token { token_type: String, lexeme: "\"Hello, world!\"", span: 6..21 }
Token { token_type: RightParen, lexeme: ")", span: 21..22 }
Token { token_type: SEMICOLON, lexeme: ";", span: 22..23 }
Token { token_type: LET, lexeme: "let", span: 25..28 }
Token { token_type: Identifier, lexeme: "a", span: 29..30 }
...
]
```

### Stage 2: Parsing and AST Generation

The parser creates an Abstract Syntax Tree:

```
Print
String(""Hello, world!"")
Let(a)
FloatNumber(4)
Let(b)
Op(PostfixOp(StarStar))
IntNumber(2)
IntNumber(2)
Print
Op(BinaryOp(Add))
Identifier(a)
Identifier(b)
```

### Stage 3: Code Generation

The compiler generates bytecode:

```
0000 OP_CONSTANT 0 | intr->"Hello, world!"
0002 OP_PRINT
0003 OP_CONSTANT 2 | 4
0005 OP_DEFINE_GLOBAL 1 | intr->a
0007 OP_CONSTANT 4 | 2
0009 OP_CONSTANT 5 | 2
0011 OP_POWER
0012 OP_DEFINE_GLOBAL 3 | intr->b
0014 OP_GET_GLOBAL 6 | intr->a
0016 OP_GET_GLOBAL 7 | intr->b
0018 OP_ADD
0019 OP_PRINT
0020 OP_RETURN
```

### Stage 4: Execution

The VM executes the bytecode, resulting in the output:

```
Hello, world!
8
```

## Future Improvements

While this compiler implements core functionality, there are several areas for potential improvements:

1. **Type System**: Implement a static type system with type inference for improved safety and performance.
2. **Optimization**: Add optimization passes to improve the generated bytecode's efficiency.
3. **Error Handling**: Enhance error reporting with more detailed messages and source code locations.
4. **Garbage Collection**: Implement a garbage collector for automatic memory management.
5. **REPL**: Implement a Read-Eval-Print Loop for interactive programming.