{"id":16499172,"url":"https://github.com/hardbyte/rpython-post","last_synced_at":"2025-10-28T00:30:53.376Z","repository":{"id":79897805,"uuid":"157534473","full_name":"hardbyte/rpython-post","owner":"hardbyte","description":null,"archived":false,"fork":false,"pushed_at":"2018-11-14T11:01:50.000Z","size":895,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-09T07:09:45.907Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hardbyte.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-14T10:50:45.000Z","updated_at":"2022-10-20T13:29:05.000Z","dependencies_parsed_at":"2023-05-31T09:31:03.134Z","dependency_job_id":null,"html_url":"https://github.com/hardbyte/rpython-post","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hardbyte%2Frpython-post","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hardbyte%2Frpython-post/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hardbyte%2Frpython-post/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hardbyte%2Frpython-post/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hardbyte","download_url":"https://codeload.github.com/hardbyte/rpython-post/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238574738,"owners_count":19494723,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-11T14:51:24.695Z","updated_at":"2025-10-28T00:30:47.985Z","avatar_url":"https://github.com/hardbyte.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This is a tutorial style post that walks through using the RPython translation\ntoolchain to create a REPL that executes basic math expressions. \n\nWe will do that by scanning the user's input into tokens, compiling those \ntokens into bytecode and running that bytecode in our own virtual machine. Don't\nworry if that sounds horribly complicated, we are going to explain it step by\nstep. \n\nThis post is a bit of a diversion while on my journey to create a compliant \n[lox](http://www.craftinginterpreters.com/the-lox-language.html) implementation\nusing the [RPython translation toolchain](https://rpython.readthedocs.io). The \nmajority of this work is a direct RPython translation of the low level C \nguide from Bob Nystrom ([@munificentbob](https://twitter.com/munificentbob)) in the\nexcellent book [craftinginterpreters.com](https://www.craftinginterpreters.com)\nspecifically the chapters 14 – 17.\n\n\n## The road ahead\n\nAs this post is rather long I'll break it into a few major sections. In each section we will\nhave something that translates with RPython, and at the end it all comes together. \n\n- [REPL](#a-repl)\n- [Virtual Machine](#a-virtual-machine)\n- [Scanning the source](#scanning-the-source)\n- [Compiling Expressions](#compiling-expressions)\n- [End to end](#end-to-end)\n\n\n## A REPL\n\nSo if you're a Python programmer you might be thinking this is pretty trivial right?\n\nI mean if we ignore input errors, injection attacks etc couldn't we just do something\nlike this:\n\n```python\n\"\"\"\nA pure python REPL that can parse simple math expressions\n\"\"\"\nwhile True:\n    print(eval(raw_input(\"\u003e \")))\n```\n\nWell it does appear to do the trick:\n```\n$ python2 section-1-repl/main.py\n\u003e 3 + 4 * ((1.0/(2 * 3 * 4)) + (1.0/(4 * 5 * 6)) - (1.0/(6 * 7 * 8)))\n3.1880952381\n```\n\nSo can we just ask RPython to translate this into a binary that runs magically\nfaster?\n\nLet's see what happens. We need to add two functions for RPython to\nget its bearings (`entry_point` and `target`) and call the file `targetXXX`:\n\n[`targetrepl1.py`](section-1-repl/targetrepl1.py)\n\n```python\ndef repl():\n    while True:\n        print eval(raw_input('\u003e '))\n\n\ndef entry_point(argv):\n    repl()\n    return 0\n\n\ndef target(driver, *args):\n    return entry_point, None\n```\n\nWhich at translation time gives us this admonishment that accurately tells us\nwe are trying to call a Python built-in `raw_input` that is unfortunately not \nvalid RPython.\n\n```\n$ rpython ./1/targetrepl1.py\n...SNIP...\n[translation:ERROR] AnnotatorError: \n\nobject with a __call__ is not RPython: \u003cbuilt-in function raw_input\u003e\nProcessing block:\n block@18 is a \u003cclass 'rpython.flowspace.flowcontext.SpamBlock'\u003e \n in (target1:2)repl \n containing the following operations: \n       v0 = simple_call((builtin_function raw_input), ('\u003e ')) \n       v1 = simple_call((builtin_function eval), v0) \n       v2 = str(v1) \n       v3 = simple_call((function rpython_print_item), v2) \n       v4 = simple_call((function rpython_print_newline)) \n\n```\n\nOk so we can't use `raw_input` or `eval` but that doesn't faze us. Let's get \nthe input from a stdin stream and just print it out (no evaluation).\n \n\n[`targetrepl2.py`](section-1-repl/targetrepl2.py)\n```python\nfrom rpython.rlib import rfile\n\nLINE_BUFFER_LENGTH = 1024\n\n\ndef repl(stdin):\n    while True:\n        print \"\u003e \",\n        line = stdin.readline(LINE_BUFFER_LENGTH)\n        print line\n\n\ndef entry_point(argv):\n    stdin, stdout, stderr = rfile.create_stdio()\n    try:\n        repl(stdin)\n    except:\n        return 0\n\n\ndef target(driver, *args):\n    return entry_point, None\n\n```\n\nTranslate `targetrepl2.py` – we can add an optimization level if we\nare so inclined:\n\n```\n$ rpython --opt=2 section-1-repl/targetrepl2.py\n...SNIP...\n[Timer] Timings:\n[Timer] annotate                       ---  1.2 s\n[Timer] rtype_lltype                   ---  0.9 s\n[Timer] backendopt_lltype              ---  0.6 s\n[Timer] stackcheckinsertion_lltype     ---  0.0 s\n[Timer] database_c                     --- 15.0 s\n[Timer] source_c                       ---  1.6 s\n[Timer] compile_c                      ---  1.9 s\n[Timer] =========================================\n[Timer] Total:                         --- 21.2 s\n```\n\nNo errors!? Let's try it out:\n```\n$ ./target2-c \n1 + 2\n\u003e  1 + 2\n\n^C\n```\n\nAhh our first success – let's quickly deal with the flushing fail by using the \nstdout stream directly as well. Let's print out the input in quotes:\n\n```python\nfrom rpython.rlib import rfile\n\nLINE_BUFFER_LENGTH = 1024\n\n\ndef repl(stdin, stdout):\n    while True:\n        stdout.write(\"\u003e \")\n        line = stdin.readline(LINE_BUFFER_LENGTH)\n        print '\"%s\"' % line.strip()\n\n\ndef entry_point(argv):\n    stdin, stdout, stderr = rfile.create_stdio()\n    try:\n        repl(stdin, stdout)\n    except:\n        pass\n    return 0\n\n\ndef target(driver, *args):\n    return entry_point, None\n```\n\nTranslation works, and the test run too:\n\n```\n$ ./target3-c \n\u003e hello this seems better\n\"hello this seems better\"\n\u003e ^C\n```\n\nSo we are in a good place with taking user input and printing output... What about\nthe whole math evaluation thing we were promised? For that we are can probably leave\nour RPython REPL behind for a while and connect it up at the end.\n\n## A virtual machine\n\nA virtual machine is the execution engine of our basic math interpreter. It will be very simple,\nonly able to do simple tasks like addition. I won't go into any depth to describe why we want\na virtual machine, but it is worth noting that many languages including Java and Python make \nthis decision to compile to an intermediate bytecode representation and then execute that with\na virtual machine. Alternatives are compiling directly to native machine code like (earlier versions of) the V8\nJavaScript engine, or at the other end of the spectrum executing an abstract syntax tree – \nwhich is what the [Truffle approach to building VMs](https://blog.plan99.net/graal-truffle-134d8f28fb69) is based on. \n\nWe are going to keep things very simple. We will have a stack where we can push and pop values,\nwe will only support floats, and our VM will only implement a few very basic operations.\n\n### OpCodes\n\nIn fact our entire instruction set is:\n\n    OP_CONSTANT\n    OP_RETURN\n    OP_NEGATE\n    OP_ADD\n    OP_SUBTRACT\n    OP_MULTIPLY\n    OP_DIVIDE\n\nSince we are targeting RPython we can't use the nice `enum` module from the Python standard\nlibrary, so instead we just define a simple class with class attributes.\n \nWe should start to get organized, so we will create a new file \n[`opcodes.py`](section-2-vm/opcodes.py) and add this:\n\n```python\nclass OpCode:\n    OP_CONSTANT = 0\n    OP_RETURN = 1\n    OP_NEGATE = 2\n    OP_ADD = 3\n    OP_SUBTRACT = 4\n    OP_MULTIPLY = 5\n    OP_DIVIDE = 6\n```\n\n### Chunks\n\nTo start with we need to get some infrastructure in place before we write the VM engine.\n\nFollowing [craftinginterpreters.com](https://www.craftinginterpreters.com/chunks-of-bytecode.html)\nwe start with a `Chunk` object which will represent our bytecode. In RPython we have access \nto Python-esq lists so our `code` object will just be a list of `OpCode` values – which are \njust integers. A list of ints, couldn't get much simpler.\n\n`section-2-vm/chunk.py`\n```python\nclass Chunk:\n    code = None\n\n    def __init__(self):\n        self.code = []\n\n    def write_chunk(self, byte):\n        self.code.append(byte)\n\n    def disassemble(self, name):\n        print \"== %s ==\\n\" % name\n        i = 0\n        while i \u003c len(self.code):\n            i = disassemble_instruction(self, i)\n```\n\n_From here on I'll only present minimal snippets of code instead of the whole lot, but \nI'll link to the repository with the complete example code. For example the \nvarious debugging including `disassemble_instruction` isn't particularly interesting\nto include verbatim. See the [github repo](https://github.com/hardbyte/rpython-post/) for full details_\n\n\nWe need to check that we can create a chunk and disassemble it. The quickest way to do this\nis to use Python during development and debugging then every so often try to translate it.\n\nGetting the disassemble part through the RPython translator was a hurdle for me as I\nquickly found that many `str` methods such as `format` are not supported, and only very basic\n`%` based formatting is supported. I ended up creating helper functions for string manipulation\nsuch as:\n\n```python\ndef leftpad_string(string, width, char=\" \"):\n    l = len(string)\n    if l \u003e width:\n        return string\n    return char * (width - l) + string\n```\n\nLet's write a new `entry_point` that creates and disassembles a chunk of bytecode. We can\nset the target output name to `vm1` at the same time:\n\n[`targetvm1.py`](section-2-vm/targetvm1.py)\n```python\ndef entry_point(argv):\n    bytecode = Chunk()\n    bytecode.write_chunk(OpCode.OP_ADD)\n    bytecode.write_chunk(OpCode.OP_RETURN)\n    bytecode.disassemble(\"hello world\")\n    return 0\n\ndef target(driver, *args):\n    driver.exe_name = \"vm1\"\n    return entry_point, None\n```\n\nRunning this isn't going to be terribly interesting, but it is always nice to\nknow that it is doing what you expect:\n\n```\n$ ./vm1 \n== hello world ==\n\n0000 OP_ADD       \n0001 OP_RETURN    \n```\n\n\n### Chunks of data\n\nRef: http://www.craftinginterpreters.com/chunks-of-bytecode.html#constants\n\nSo our bytecode is missing a very crucial element – the values to operate on!\n\nAs with the bytecode we can store these constant values as part of the chunk\ndirectly in a list. Each chunk will therefore have a constant data component,\nand a code component. \n\nEdit the `chunk.py` file and add the new instance attribute `constants` as an\nempty list, and a new method `add_constant`.\n\n```python\n    def add_constant(self, value):\n        self.constants.append(value)\n        return len(self.constants) - 1\n\n```\n\nNow to use this new capability we can modify our example chunk\nto write in some constants before the `OP_ADD`:\n\n```python\n    bytecode = Chunk()\n    constant = bytecode.add_constant(1.0)\n    bytecode.write_chunk(OpCode.OP_CONSTANT)\n    bytecode.write_chunk(constant)\n\n    constant = bytecode.add_constant(2.0)\n    bytecode.write_chunk(OpCode.OP_CONSTANT)\n    bytecode.write_chunk(constant)\n\n    bytecode.write_chunk(OpCode.OP_ADD)\n    bytecode.write_chunk(OpCode.OP_RETURN)\n\n    bytecode.disassemble(\"adding constants\")\n```\n\n\nWhich still translates with RPython and when run gives us the following disassembled\nbytecode:\n\n```$ ./vm2\n== adding constants ==\n\n0000 OP_CONSTANT  (00)        '1'\n0002 OP_CONSTANT  (01)        '2'\n0004 OP_ADD       \n0005 OP_RETURN\n```\n\nWe won't go down the route of serializing the bytecode to disk, but this bytecode chunk\n(including the constant data) could be saved and executed on our VM later – like a Java\n`.class` file. Instead we will pass the bytecode directly to our VM after we've created\nit during the compilation process. \n\n### Emulation  \n\nSo those four instructions of bytecode combined with the constant value mapping\n`00 -\u003e 1.0` and `01 -\u003e 2.0` describes individual steps for our virtual machine\nto execute. One major point in favor of defining our own bytecode is we can \ndesign it to be really simple to execute – this makes the VM really easy to implement.\n\nAs I mentioned earlier this virtual machine will have a stack, so let's begin with that.\nNow the stack is going to be a busy little beast – as our VM takes instructions like \n`OP_ADD` it will pop off the top two values from the stack, and push the result of adding \nthem together back onto the stack. Although dynamically resizing Python lists \nare marvelous, they can be a little slow. RPython can take advantage of a constant sized\nlist which doesn't make our code much more complicated.\n\nTo do this we will define a constant sized list and track the `stack_top` directly. Note\nhow we can give the RPython translator hints by adding assertions about the state that\nthe `stack_top` will be in.\n \n\n```python\nclass VM(object):\n    STACK_MAX_SIZE = 256\n    stack = None\n    stack_top = 0\n\n    def __init__(self):\n        self._reset_stack()\n\n    def _reset_stack(self):\n        self.stack = [0] * self.STACK_MAX_SIZE\n        self.stack_top = 0\n\n    def _stack_push(self, value):\n        assert self.stack_top \u003c self.STACK_MAX_SIZE\n        self.stack[self.stack_top] = value\n        self.stack_top += 1\n\n    def _stack_pop(self):\n        assert self.stack_top \u003e= 0\n        self.stack_top -= 1\n        return self.stack[self.stack_top]\n\n    def _print_stack(self):\n        print \"         \",\n        if self.stack_top \u003c= 0:\n            print \"[]\",\n        else:\n            for i in range(self.stack_top):\n                print \"[ %s ]\" % self.stack[i],\n        print\n\n```\n\nNow we get to the main event, the hot loop, the VM engine. Hope I haven't built it up to\nmuch, it is actually really simple! We loop until the instructions tell us to stop \n(`OP_RETURN`), and dispatch to other simple methods based on the instruction.\n\n```python\n    def _run(self):\n        while True:\n            instruction = self._read_byte()\n\n            if instruction == OpCode.OP_RETURN:\n                print \"%s\" % self._stack_pop()\n                return InterpretResultCode.INTERPRET_OK\n            elif instruction == OpCode.OP_CONSTANT:\n                constant = self._read_constant()\n                self._stack_push(constant)\n            elif instruction == OpCode.OP_ADD:\n                self._binary_op(self._stack_add)    \n```\n\n\nNow the `_read_byte` method will have to keep track of which instruction we are up \nto. So add an instruction pointer (`ip`) to the VM with an initial value of `0`.\nThen `_read_byte` is simply getting the next bytecode (int) from the chunk's `code`:\n\n```python\n    def _read_byte(self):\n        instruction = self.chunk.code[self.ip]\n        self.ip += 1\n        return instruction\n``` \n\nIf the instruction is `OP_CONSTANT` we take the constant's address from the next byte\nof the chunk's `code`, retrieve that constant value and add it to the VM's stack.\n\n```python\n    def _read_constant(self):\n        constant_index = self._read_byte()\n        return self.chunk.constants[constant_index]\n```\n\nFinally our first arithmetic operation `OP_ADD`, what it has to achieve doesn't \nrequire much explanation: pop two values from the stack, add them together, push \nthe result. But since a few operations all have the same template we introduce a\nlayer of indirection – or abstraction – by introducing a reusable `_binary_op` \nhelper method.\n\n```python\n    @specialize.arg(1)\n    def _binary_op(self, operator):\n        op2 = self._stack_pop()\n        op1 = self._stack_pop()\n        result = operator(op1, op2)\n        self._stack_push(result)\n\n    @staticmethod\n    def _stack_add(op1, op2):\n        return op1 + op2\n\n``` \n\nNote we tell RPython to specialize `_binary_op` on the first argument. This causes\nRPython to make a copy of `_binary_op` for every value of the first argument passed,\nwhich means that each copy contains a call to a particular operator, which can then be\ninlined.\n\nTo be able to run our bytecode the only thing left to do is to pass in the chunk \nand call `_run()`:\n\n```python\n    def interpret_chunk(self, chunk):\n        if self.debug_trace:\n            print \"== VM TRACE ==\"\n        self.chunk = chunk\n        self.ip = 0\n        try:\n            result = self._run()\n            return result\n        except:\n            return InterpretResultCode.INTERPRET_RUNTIME_ERROR\n```\n\n[`targetvm3.py`](./section-2-vm/targetvm3.py) connects the pieces:\n\n```python\ndef entry_point(argv):\n    bytecode = Chunk()\n    constant = bytecode.add_constant(1)\n    bytecode.write_chunk(OpCode.OP_CONSTANT)\n    bytecode.write_chunk(constant)\n    constant = bytecode.add_constant(2)\n    bytecode.write_chunk(OpCode.OP_CONSTANT)\n    bytecode.write_chunk(constant)\n    bytecode.write_chunk(OpCode.OP_ADD)\n    bytecode.write_chunk(OpCode.OP_RETURN)\n\n    vm = VM()\n    vm.interpret_chunk(bytecode)\n\n    return 0\n```\n\nI've added some trace debugging so we can see what the VM and stack is doing.\n\nThe whole thing translates with RPython, and when run gives us:\n\n```\n./vm3\n== VM TRACE ==\n          []\n0000 OP_CONSTANT  (00)        '1'\n          [ 1 ]\n0002 OP_CONSTANT  (01)        '2'\n          [ 1 ] [ 2 ]\n0004 OP_ADD       \n          [ 3 ]\n0005 OP_RETURN    \n3\n```\n\nYes we just computed the result of `1+2`. Pat yourself on the back. \n\nAt this point it is probably valid to check that the translated executable is actually\nfaster than running our program directly in Python. For this trivial example under \n`Python2`/`pypy` this `targetvm3.py` file runs in the 20ms – 90ms region, and the \ncompiled `vm3` runs in \u003c5ms. Something useful must be happening during the translation.\n\nI won't go through the code adding support for our other instructions as they are\nvery similar and straightforward. Our VM is ready to execute our chunks of bytecode,\nbut we haven't yet worked out how to take the entered expression and turn that into\nthis simple bytecode. This is broken into two steps, scanning and compiling.\n\n## Scanning the source\n\n_All the source for this section can be found in \n[section-3-scanning](./section-3-scanning)._\n\nThe job of the scanner is to take the raw expression string and transform it into\na sequence of tokens. This scanning step will strip out whitespace and comments, \ncatch errors with invalid token and tokenize the string. For example the input \n`\"( 1 + 2 )` would get tokenized into `LEFT_PAREN, NUMBER(1), PLUS, NUMBER(2), RIGHT_PAREN`.\n\nAs with our `OpCodes` we will just define a simple Python class to define an `int`\nfor each type of token:\n\n```python\nclass TokenTypes:\n    ERROR = 0\n    EOF = 1\n    LEFT_PAREN = 2\n    RIGHT_PAREN = 3\n    MINUS = 4\n    PLUS = 5\n    SLASH = 6\n    STAR = 7\n    NUMBER = 8\n\n```\n\nA token has to keep some other information as well – keeping track of the `location` and \n`length` of the token will be helpful for error reporting. The `NUMBER` token clearly needs \nsome data about the value it is representing: we could include a copy of the source lexeme \n(e.g. the string `2.0`), or parse the value and store that, or – what we will do in this \nblog – use the `location` and `length` information as pointers into the original source \nstring. Every token type (except perhaps `ERROR`) will use this simple data structure: \n\n```python\nclass Token(object):\n\n    def __init__(self, start, length, token_type):\n        self.start = start\n        self.length = length\n        self.type = token_type\n```\n\nOur soon to be created scanner will create these `Token` objects which refer back to \naddresses in some source. If the scanner sees the source `\"( 1 + 2.0 )\"` it would emit\nthe following tokens:\n\n```python\nToken(0, 1, TokenTypes.LEFT_PAREN)\nToken(2, 1, TokenTypes.NUMBER)\nToken(4, 1, TokenTypes.PLUS)\nToken(6, 3, TokenTypes.NUMBER)\nToken(10, 1, TokenTypes.RIGHT_PAREN)\n```\n\n### Scanner\n\nLet's walk through the scanner [implementation](section-3-scanning/scanner.py) method\nby method. The scanner will take the source and pass through it once, creating tokens\nas it goes.\n\n```python\nclass Scanner(object):\n\n    def __init__(self, source):\n        self.source = source\n        self.start = 0\n        self.current = 0\n```\n\nThe `start` and `current` variables are character indices in the source string that point to \nthe current substring being considered as a token. \n\nFor example in the string `\"(51.05+2)\"` while we are tokenizing the number `51.05`\nwe will have `start` pointing at the `5`, and advance `current` character by character\nuntil the character is no longer part of a number. Midway through scanning the number \nthe `start` and `current` values might point to `1` and `4` respectively:\n\n\n| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | \n|---|---|---|---|---|---|---|---|---|\n|\"(\"|\"5\"|\"1\"|\".\"|\"0\"|\"5\"|\"+\"|\"2\"|\")\"| \n|   | ^ |   |   | ^ |   |   |   |   |\n\nFrom `current=4` the scanner peeks ahead and sees that the next character (`5`) is\na digit, so will continue to advance.\n\n| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | \n|---|---|---|---|---|---|---|---|---|\n|\"(\"|\"5\"|\"1\"|\".\"|\"0\"|\"5\"|\"+\"|\"2\"|\")\"| \n|   | ^ |   |   |   | ^ |   |   |   |\n\nWhen the scanner peeks ahead and sees the `\"+\"` it will create the number\ntoken and emit it. The method that carry's out this tokenizing is `_number`:\n\n```python\n    def _number(self):\n        while self._peek().isdigit():\n            self.advance()\n\n        # Look for decimal point\n        if self._peek() == '.' and self._peek_next().isdigit():\n            self.advance()\n            while self._peek().isdigit():\n                self.advance()\n\n        return self._make_token(TokenTypes.NUMBER)\n```\n\nIt relies on a few helpers to look ahead at the upcoming characters:\n\n```python\n    def _peek(self):\n        if self._is_at_end():\n            return '\\0'\n        return self.source[self.current]\n\n    def _peek_next(self):\n        if self._is_at_end():\n            return '\\0'\n        return self.source[self.current+1]\n\n    def _is_at_end(self):\n        return len(self.source) == self.current\n```\n\nIf the character at `current` is still part of the number we want to call `advance`\nto move on by one character.\n\n```python\n    def advance(self):\n        self.current += 1\n        return self.source[self.current - 1]\n```\n\nOnce the `isdigit()` check fails in `_number()` we call `_make_token()` to emit the\ntoken with the `NUMBER` type.\n\n```python\n    def _make_token(self, token_type):\n        return Token(\n            start=self.start,\n            length=(self.current - self.start),\n            token_type=token_type\n        )\n```\n\nNote again that the token is linked to an index address in the source, rather than \nincluding the string value.\n\nOur scanner is pull based, a token will be requested via `scan_token`. First we skip \npast whitespace and depending on the characters emit the correct token:\n\n```python\n    def scan_token(self):\n        # skip any whitespace\n        while True:\n            char = self._peek()\n            if char in ' \\r\\t\\n':\n                self.advance()\n            break\n        \n        self.start = self.current\n\n        if self._is_at_end():\n            return self._make_token(TokenTypes.EOF)\n\n        char = self.advance()\n\n        if char.isdigit():\n            return self._number()\n\n        if char == '(':\n            return self._make_token(TokenTypes.LEFT_PAREN)\n        if char == ')':\n            return self._make_token(TokenTypes.RIGHT_PAREN)\n        if char == '-':\n            return self._make_token(TokenTypes.MINUS)\n        if char == '+':\n            return self._make_token(TokenTypes.PLUS)\n        if char == '/':\n            return self._make_token(TokenTypes.SLASH)\n        if char == '*':\n            return self._make_token(TokenTypes.STAR)\n\n        return ErrorToken(\"Unexpected character\", self.current)\n``` \n\nIf this was a real programming language we were scanning, this would be the point where we \nadd support for different types of literals and any language identifiers/reserved words.\n\nAt some point we will need to parse the literal value for our numbers, but we leave that\njob for some later component, for now we'll just add a `get_token_string` helper. To make\nsure that RPython is happy to index arbitrary slices of `source` we add range assertions:\n\n```python\n    def get_token_string(self, token):\n        if isinstance(token, ErrorToken):\n            return token.message\n        else:\n            end_loc = token.start + token.length\n            assert end_loc \u003c len(self.source)\n            assert end_loc \u003e 0\n            return self.source[token.start:end_loc]\n\n```\n\nA simple entry point can be used to test our scanner with a hard coded \nsource string:\n\n[`targetscanner1.py`](./section-3-scanning/targetscanner1.py)\n```python\nfrom scanner import Scanner, TokenTypes, TokenTypeToName\n\n\ndef entry_point(argv):\n\n    source = \"(   1   + 2.0 )\"\n\n    scanner = Scanner(source)\n    t = scanner.scan_token()\n    while t.type != TokenTypes.EOF and t.type != TokenTypes.ERROR:\n        print TokenTypeToName[t.type],\n        if t.type == TokenTypes.NUMBER:\n            print \"(%s)\" % scanner.get_token_string(t),\n        print\n        t = scanner.scan_token()\n    return 0\n```\n\nRPython didn't complain, and lo it works:\n```\n$ ./scanner1 \nLEFT_PAREN\nNUMBER (1)\nPLUS\nNUMBER (2.0)\nRIGHT_PAREN\n```\n\nLet's connect our REPL to the scanner.\n\n[`targetscanner2.py`](./section-3-scanning/targetscanner2.py)\n```python\nfrom rpython.rlib import rfile\nfrom scanner import Scanner, TokenTypes, TokenTypeToName\n\nLINE_BUFFER_LENGTH = 1024\n\n\ndef repl(stdin, stdout):\n    while True:\n        stdout.write(\"\u003e \")\n        source = stdin.readline(LINE_BUFFER_LENGTH)\n\n        scanner = Scanner(source)\n        t = scanner.scan_token()\n        while t.type != TokenTypes.EOF and t.type != TokenTypes.ERROR:\n            print TokenTypeToName[t.type],\n            if t.type == TokenTypes.NUMBER:\n                print \"(%s)\" % scanner.get_token_string(t),\n            print\n            t = scanner.scan_token()\n\n\ndef entry_point(argv):\n    stdin, stdout, stderr = rfile.create_stdio()\n    try:\n        repl(stdin, stdout)\n    except:\n        pass\n    return 0\n\n```\n\nWith our REPL hooked up we can now scan tokens from arbitrary input:\n\n```\n$ ./scanner2\n\u003e (3 *4) - -3\nLEFT_PAREN\nNUMBER (3)\nSTAR\nNUMBER (4)\nRIGHT_PAREN\nMINUS\nMINUS\nNUMBER (3)\n\u003e ^C\n```\n\n## Compiling expressions\n\n### References\n\n- https://www.craftinginterpreters.com/compiling-expressions.html\n- http://effbot.org/zone/simple-top-down-parsing.htm\n\nThe final piece is to turn this sequence of tokens into our low level \nbytecode instructions for the virtual machine to execute. Buckle up, \nwe are about to write us a compiler.\n\nOur compiler will take a single pass over the tokens using \n[Vaughan Pratt’s](https://en.wikipedia.org/wiki/Vaughan_Pratt) \nparsing technique, and output a chunk of bytecode – if we do it\nright it will be compatible with our existing virtual machine.\n\nRemember the bytecode we defined above is really simple – by relying \non our stack we can transform a nested expression into a sequence of\nour bytecode operations.\n\nTo make this more concrete let's go through by hand translating an\nexpression into bytecode.\n\nOur source expression:\n```\n(3 + 2) - (7 * 2)\n```\n \nIf we were to make an abstract syntax tree we'd get something \nlike this:\n\n![AST](./images/ast.jpg)\n\nNow if we start at the first sub expression `(3+2)` we can clearly\nnote from the first open bracket that we *must* see a close bracket,\nand that the expression inside that bracket *must* be valid on its \nown. Not only that but regardless of the inside we know that the whole\nexpression still has to be valid. Let's focus on this first bracketed\nexpression, let our attention recurse into it so to speak.\n\nThis gives us a much easier problem – we just want to get our virtual\nmachine to compute `3 + 2`. In this bytecode dialect we would load the \ntwo constants, and then add them with `OP_ADD` like so:  \n\n```\nOP_CONSTANT  (00) '3.000000'\nOP_CONSTANT  (01) '2.000000'\nOP_ADD\n```\n\nThe effect of our vm executing these three instructions is that sitting\npretty at the top of the stack is the result of the addition. Winning.\n\nJumping back out from our bracketed expression, our next token is `MINUS`,\nat this point we have a fair idea that it must be used in an infix position. \nIn fact whatever token followed the bracketed expression it **must** be a \nvalid infix operator, if not the expression is over or had a syntax error. \n\nAssuming the best from our user (naive), we handle `MINUS` the same way\nwe handled the first `PLUS`. We've already got the first operand on the\nstack, now we compile the right operand and **then** write out the bytecode\nfor `OP_SUBTRACT`.\n\nThe right operand is another simple three instructions:\n\n```\nOP_CONSTANT  (02) '7.000000'\nOP_CONSTANT  (03) '2.000000'\nOP_MULTIPLY\n```\n\nThen we finish our top level binary expression and write a `OP_RETURN` to\nreturn the value at the top of the stack as the execution's result. Our\nfinal hand compiled program is:\n\n    \n```\nOP_CONSTANT  (00) '3.000000'\nOP_CONSTANT  (01) '2.000000'\nOP_ADD\nOP_CONSTANT  (02) '7.000000'\nOP_CONSTANT  (03) '2.000000'\nOP_MULTIPLY\nOP_SUBTRACT\nOP_RETURN\n```\n\nOk that wasn't so hard was it? Let's try make our code do that.\n\nWe define a parser object which will keep track of where we are, and\nwhether things have all gone horribly wrong:\n\n```python\nclass Parser(object):\n    def __init__(self):\n        self.had_error = False\n        self.panic_mode = False\n        self.current = None\n        self.previous = None\n```\n\nThe compiler will also be a class, we'll need one of our `Scanner` instances\nto pull tokens from, and since the output is a bytecode `Chunk` let's go ahead\nand make one of those in our compiler initializer:\n\n```python\nclass Compiler(object):\n\n    def __init__(self, source):\n        self.parser = Parser()\n        self.scanner = Scanner(source)\n        self.chunk = Chunk()\n```\n\nSince we have this (empty) chunk of bytecode we will make a helper method\nto add individual bytes. Every instruction will pass from our compiler into\nan executable program through this simple .\n\n```python\n    def emit_byte(self, byte):\n        self.current_chunk().write_chunk(byte)\n```\n\nTo quote from Bob Nystrom on the Pratt parsing technique:\n\n\u003e the implementation is a deceptively-simple handful of deeply intertwined code\n\nI don't actually think I can do justice to this section. Instead I suggest \nreading his treatment in \n[Pratt Parsers: Expression Parsing Made Easy](http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/)\nwhich explains the magic behind the parsing component. Our only major difference is \ninstead of creating an AST we are going to directly emit bytecode for our VM.\n\nNow that I've absolved myself from taking responsibility in explaining this somewhat\ntricky concept, I'll discuss some of the code from \n[`compiler.py`](section-4-compiler/compiler.py), and walk through what happens \nfor a particular rule.\n\nI'll jump straight to the juicy bit the table of parse rules. We define a `ParseRule`\nfor each token, and each rule comprises:\n- an optional handler for when the token is as a _prefix_ (e.g. the minus in `(-2)`),\n- an optional handler for whet the token is used _infix_ (e.g. the slash in `2/47`)\n- a precedence value (a number that determines what is of higher precedence)\n\n\n```python\nrules = [\n    ParseRule(None,              None,            Precedence.NONE),   # ERROR\n    ParseRule(None,              None,            Precedence.NONE),   # EOF\n    ParseRule(Compiler.grouping, None,            Precedence.CALL),   # LEFT_PAREN\n    ParseRule(None,              None,            Precedence.NONE),   # RIGHT_PAREN\n    ParseRule(Compiler.unary,    Compiler.binary, Precedence.TERM),   # MINUS\n    ParseRule(None,              Compiler.binary, Precedence.TERM),   # PLUS\n    ParseRule(None,              Compiler.binary, Precedence.FACTOR), # SLASH\n    ParseRule(None,              Compiler.binary, Precedence.FACTOR), # STAR\n    ParseRule(Compiler.number,   None,            Precedence.NONE),   # NUMBER\n]\n```\n\nThese rules really are the magic of our compiler. When we get to a particular\ntoken such as `MINUS` we see if it is an infix operator and if so we've gone and\ngot its first operand ready. At all times we rely on the relative precedence; consuming \neverything with higher precedence than the operator we are currently evaluating.\n\nIn the expression:\n```\n2 + 3 * 4\n```\n\nThe `*` has higher precedence than the `+`, so `3 * 4` will be parsed together\nas the second operand to the first infix operator (the `+`) which follows\nthe [BEDMAS](https://en.wikipedia.org/wiki/Order_of_operations#Mnemonics) \norder of operations I was taught at high school.\n\nTo encode these precedence values we make another Python object moonlighting\nas an enum:\n\n```python\nclass Precedence(object):\n    NONE = 0\n    DEFAULT = 1\n    TERM = 2        # + -\n    FACTOR = 3      # * /\n    UNARY = 4       # ! - +\n    CALL = 5        # ()\n    PRIMARY = 6\n```\n\nWhat happens in our compiler when turning `-2.0` into bytecode? Assume we've just \npulled the token `MINUS` from the scanner. Every expression **has** to start with some\ntype of prefix – whether that is:\n- a bracket group `(`, \n- a number `2`, \n- or a prefix unary operator `-`. \n\nKnowing that, our compiler assumes there is a `prefix` handler in the rule table – in\nthis case it points us at the `unary` handler.\n\n```python\n    def parse_precedence(self, precedence):\n        # parses any expression of a given precedence level or higher\n        self.advance()\n        prefix_rule = self._get_rule(self.parser.previous.type).prefix\n        prefix_rule(self)\n``` \n\n`unary` is called:\n\n```python\n    def unary(self):\n        op_type = self.parser.previous.type\n        # Compile the operand\n        self.parse_precedence(Precedence.UNARY)\n        # Emit the operator instruction\n        if op_type == TokenTypes.MINUS:\n            self.emit_byte(OpCode.OP_NEGATE)\n```\n\nHere – before writing the `OP_NEGATE` opcode we recurse back into `parse_precedence`\nto ensure that _whatever_ follows the `MINUS` token is compiled – provided it has \nhigher precedence than `unary` – e.g. a bracketed group. \nCrucially at run time this recursive call will ensure that the result is left \non top of our stack. Armed with this knowledge, the `unary` method just\nhas to emit a single byte with the `OP_NEGATE` opcode.\n\n\n### Test compilation\n\nNow we can test our compiler by outputting disassembled bytecode\nof our user entered expressions. Create a new entry_point \n[`targetcompiler`](section-4-compiler/targetcompiler1.py):\n \n```python\nfrom rpython.rlib import rfile\nfrom compiler import Compiler\n\nLINE_BUFFER_LENGTH = 1024\n\n\ndef entry_point(argv):\n    stdin, stdout, stderr = rfile.create_stdio()\n\n    try:\n        while True:\n            stdout.write(\"\u003e \")\n            source = stdin.readline(LINE_BUFFER_LENGTH)\n            compiler = Compiler(source, debugging=True)\n            compiler.compile()\n    except:\n        pass\n    return 0\n```\n\nTranslate it and test it out:\n```\n$ ./compiler1 \n\u003e (2/4 + 1/2)\n== code ==\n\n0000 OP_CONSTANT  (00) '2.000000'\n0002 OP_CONSTANT  (01) '4.000000'\n0004 OP_DIVIDE    \n0005 OP_CONSTANT  (02) '1.000000'\n0007 OP_CONSTANT  (00) '2.000000'\n0009 OP_DIVIDE    \n0010 OP_ADD       \n0011 OP_RETURN\n```\n\nNow if you've made it this far you'll be eager to finally connect everything\ntogether by executing this bytecode with the virtual machine.\n\n## End to end\n\nAll the pieces slot together rather easily at this point, create a new \nfile [`targetcalc.py`](section-5-execution/targetcalc.py) and define our \nentry point:\n\n```python\nfrom rpython.rlib import rfile\nfrom compiler import Compiler\nfrom vm import VM\n\nLINE_BUFFER_LENGTH = 4096\n\n\ndef entry_point(argv):\n    stdin, stdout, stderr = rfile.create_stdio()\n    vm = VM()\n    try:\n        while True:\n            stdout.write(\"\u003e \")\n            source = stdin.readline(LINE_BUFFER_LENGTH)\n            if source:\n                compiler = Compiler(source, debugging=False)\n                compiler.compile()\n                vm.interpret_chunk(compiler.chunk)\n    except:\n        pass\n    return 0\n\n\ndef target(driver, *args):\n    driver.exe_name = \"calc\"\n    return entry_point, None\n``` \n\nLet's try catch it out with a double negative:\n\n```\n$ ./calc \n\u003e 2--3\n== VM TRACE ==\n          []\n0000 OP_CONSTANT  (00) '2.000000'\n          [ 2.000000 ]\n0002 OP_CONSTANT  (01) '3.000000'\n          [ 2.000000 ] [ 3.000000 ]\n0004 OP_NEGATE    \n          [ 2.000000 ] [ -3.000000 ]\n0005 OP_SUBTRACT  \n          [ 5.000000 ]\n0006 OP_RETURN    \n5.000000\n```\n\nOk well let's evaluate the first 50 terms of the \n[Nilakantha Series](https://en.wikipedia.org/wiki/Pi#Infinite_series):\n\n```\n$ ./calc\n\u003e 3 + 4 * ((1/(2 * 3 * 4)) + (1/(4 * 5 * 6)) - (1/(6 * 7 * 8)) + (1/(8 * 9 * 10)) - (1/(10 * 11 * 12)) + (1/(12 * 13 * 14)) - (1/(14 * 15 * 16)) + (1/(16 * 17 * 18)) - (1/(18 * 19 * 20)) + (1/(20 * 21 * 22)) - (1/(22 * 23 * 24)) + (1/(24 * 25 * 26)) - (1/(26 * 27 * 28)) + (1/(28 * 29 * 30)) - (1/(30 * 31 * 32)) + (1/(32 * 33 * 34)) - (1/(34 * 35 * 36)) + (1/(36 * 37 * 38)) - (1/(38 * 39 * 40)) + (1/(40 * 41 * 42)) - (1/(42 * 43 * 44)) + (1/(44 * 45 * 46)) - (1/(46 * 47 * 48)) + (1/(48 * 49 * 50)) - (1/(50 * 51 * 52)) + (1/(52 * 53 * 54)) - (1/(54 * 55 * 56)) + (1/(56 * 57 * 58)) - (1/(58 * 59 * 60)) + (1/(60 * 61 * 62)) - (1/(62 * 63 * 64)) + (1/(64 * 65 * 66)) - (1/(66 * 67 * 68)) + (1/(68 * 69 * 70)) - (1/(70 * 71 * 72)) + (1/(72 * 73 * 74)) - (1/(74 * 75 * 76)) + (1/(76 * 77 * 78)) - (1/(78 * 79 * 80)) + (1/(80 * 81 * 82)) - (1/(82 * 83 * 84)) + (1/(84 * 85 * 86)) - (1/(86 * 87 * 88)) + (1/(88 * 89 * 90)) - (1/(90 * 91 * 92)) + (1/(92 * 93 * 94)) - (1/(94 * 95 * 96)) + (1/(96 * 97 * 98)) - (1/(98 * 99 * 100)) + (1/(100 * 101 * 102)))\n\n== VM TRACE ==\n          []\n0000 OP_CONSTANT  (00) '3.000000'\n          [ 3.000000 ]\n0002 OP_CONSTANT  (01) '4.000000'\n...SNIP...\n0598 OP_CONSTANT  (101) '102.000000'\n          [ 3.000000 ] [ 4.000000 ] [ 0.047935 ] [ 1.000000 ] [ 10100.000000 ] [ 102.000000 ]\n0600 OP_MULTIPLY  \n          [ 3.000000 ] [ 4.000000 ] [ 0.047935 ] [ 1.000000 ] [ 1030200.000000 ]\n0601 OP_DIVIDE    \n          [ 3.000000 ] [ 4.000000 ] [ 0.047935 ] [ 0.000001 ]\n0602 OP_ADD       \n          [ 3.000000 ] [ 4.000000 ] [ 0.047936 ]\n0603 OP_MULTIPLY  \n          [ 3.000000 ] [ 0.191743 ]\n0604 OP_ADD       \n          [ 3.191743 ]\n0605 OP_RETURN    \n3.191743\n```\n\nWe just executed 605 virtual machine instructions to compute pi to 1dp!\n\nThis brings us to the end of this tutorial. To recap we've walked through the whole \ncompilation process: from the user providing an expression string on the REPL, scanning\nthe source string into tokens, parsing the tokens while accounting for relative \nprecedence via a Pratt parser, generating bytecode, and finally executing the bytecode \non our own VM. RPython translated what we wrote into C and compiled it, meaning\nour resulting `calc` REPL is really fast.\n\n\u003e “The world is a thing of utter inordinate complexity and richness and strangeness that is absolutely awesome.”\n\u003e\n\u003e ― Douglas Adams \n\n\nMany thanks to Bob Nystrom for writing the book that inspired this post, and thanks to \nCarl Friedrich and Matt Halverson for reviewing.\n\n― Brian (@thorneynz)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhardbyte%2Frpython-post","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhardbyte%2Frpython-post","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhardbyte%2Frpython-post/lists"}