https://github.com/vonderklaas/tiny-lexer
A program written in pure C language, that can perform lexical tokenization of an arbitrary programming language, 'tinylang' in this particular case.
https://github.com/vonderklaas/tiny-lexer
c lexer lexer-parser lexical-analysis
Last synced: 3 months ago
JSON representation
A program written in pure C language, that can perform lexical tokenization of an arbitrary programming language, 'tinylang' in this particular case.
- Host: GitHub
- URL: https://github.com/vonderklaas/tiny-lexer
- Owner: vonderklaas
- Created: 2023-11-08T14:59:01.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-07T07:57:40.000Z (about 2 years ago)
- Last Synced: 2025-04-24T02:43:10.705Z (about 1 year ago)
- Topics: c, lexer, lexer-parser, lexical-analysis
- Language: C
- Homepage:
- Size: 50.8 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Description
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a lexer program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types.
### Examples
This is source code
```c
a : integer = 0
a := 0
b : integer
b := 0
defun foo (a:integer, b:integer):integer {
}
```
These are broken down tokens
```c
Token 0: a
Token 1: :
Token 2: integer
Token 3: =
Token 4: 0
Token 5: a
Token 6: :
Token 7: =
Token 8: 0
Token 9: b
Token 10: :
Token 11: integer
Token 12: b
Token 13: :
Token 14: =
Token 15: 0
Token 16: defun
Token 17: foo
Token 18: (
Token 19: a
Token 20: :
Token 21: integer
Token 22: ,
Token 23: b
Token 24: :
Token 25: integer
Token 26: )
Token 27: :
Token 28: integer
Token 29: {
Token 30: }
```
### Compilation Stages
**Preprocessing** — ✅
Input: Source Code
Output: Modified Source Code
**Tokenization** — ✅
Input: Preprocessed Source Code
Output: Stream of Tokens
(WIP)
**Syntax Analysis**
Input: Tokens from Lexical Analysis (Tokenization)
Output: AST
(WIP)
**Semantic Analysis**
Input: AST
Output: Annotated AST with Semantic Information
(WIP)
**Intermediate Code Generation**
Input: Annotated AST
Output: IR
(WIP)
**Optimization**
Input: IR
Output: Optimized IR
(WIP)
**Code Generation**
Input: Optimized IR
Output: Machine Code or Assembly
**Linking**
Input: Compiled Machine Code
Output: Single Executable for Specific Architecture