Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/artagnon/rhine-ml

🏞 an OCaml compiler for an untyped lisp
https://github.com/artagnon/rhine-ml

compiler llvm ocaml programming-language

Last synced: about 1 month ago
JSON representation

🏞 an OCaml compiler for an untyped lisp

Awesome Lists containing this project

README

        

# Rhine

Rhine is a Clojure-inspired Lisp on LLVM JIT featuring variable-length
untyped arrays, first-class functions, closures, and macros. While
Clojure hides the lower-level details by running atop the JVM, Rhine
aims to expose how common Lisp constructs map to hardware.

## Building

First, `opam switch 4.02.1` to make sure that you're running a
custom-built ocaml (for camlp4). First, run `brew install libffi`.
Then, run `opam install ocamlfind menhir core textutils ctypes`,
open a new shell to refresh env, and invoke `make`.

## Troubleshooting the build

There are a number of reasons for the build failing:

1. Silly things like `git submdoule --init` failing can be fixed
easily; just anchor the submodule to a valid commit and send a PR.

2. opam problems. If you run into one of these after following the
instructions presented above, open an issue. An upstream update
probably screwed something up.

3. Silly build issues usually arise from the build not being perfectly
parallel: due to races, the `-j8` picks the dependee too
earlier. Either go into `llvm-build/` and keep hitting `make -j8`
until it succeeds, or drop the `-j8` together, waiting for a longer
time for a predictable result.

4. Running into more serious build issues usually means that llvm
upstream has changed in a trivial way; you can attempt to fix this
yourself, or open an issue.

## How it works

An untyped system means that all values are boxed/unboxed from a
`value_t` structure at runtime:

```llvm
%value_t = type {
i32, ; type of data
i64, ; integer
i1, ; bool
i8*, ; string
%value_t**, ; array/fenv
i64, ; array/string length
double, ; double
%value_t* (i32, %value_t**, ...)*, ; function
i8 ; char
}
```

The overhead of boxing/unboxing is paid by all dynamic languages,
although multiple optimizations (including speculative optimization)
can reduce the overhead. Rhine currently only implements the basic
optimizations bundled with LLVM.

Rhine does automatic type conversions, so `(+ 3 4.2)` will do the
right thing. To implement this, IR is generated to inspect the types
of the operands (zeroth member of `value_t`), and a br (branch) is
used to take the right codepath. A possible optimization is to
generate a branchless codepath for all-integer arguments.

LLVM provides the [array
type](http://llvm.org/docs/LangRef.html#array-type) and [vector
type](http://llvm.org/docs/LangRef.html#vector-type). They cannot be
used since they are fixed-length; i.e. the length must be known at
compile-time. The problem is that a construct like `(cons 8 coll)`
generates a runtime length which is equal to the `(length coll)` + 1.
So, we malloc, getelementptr, and store by hand. It has the type
specified by the fourth member of `value_t`.

To implement first-class functions, note that all functions must have
the same type; i.e. the type of the function pointer (the seventh
member of `value_t`). How else would you implement:

```clojure
(defn map
[f coll]
(if coll
(cons (f (first coll))
(map f (rest coll)))
[]))

(defn map2
[f c1 c2]
(if (and c1 c2)
(cons (f (first c1) (first c2))
(map2 f (rest c1) (rest c2)))
[]))
```

Here, `f` takes one argument in `map`, but takes two arguments in
`map2`. A function pointer type embracing variable arguments is
implemented using the [varargs
framework](http://llvm.org/docs/LangRef.html#variable-argument-handling-intrinsics)
of LLVM. Note that `va_arg` doesn't work on x86, so Rhine extracts the
values by hand. The first argument gives the number of arguments, and
is used to implement varargs in the Rhine language. The second
argument is the closure environment (which has the same type as an
array).

Closures are simple to implement with this framework in place. First,
when a function is declared, parse out all the unbound variable names
(not present in arguments or `let`), sort the names, put it in a
hashtable for later reference, and codegen the code required to bind
the names from the `env` argument in order. At the callsite, look up
this hashtable, and pack all the corresponding environment variables
into the `env` argument. So, stuff like this will work:

```clojure
(defn quux [] (let [a aenv] (println a) (println env)))
(let [env 12 aenv 17] (quux))
```

But there's a problem because we have first-class functions. What
happens to this?

```clojure
(defn t [y] (+ x y))
(defn f [x] t)
(let [g (f 3)] (println (g 4)))
```

Here, the callsite for `t` is not in `f`, but in the anonymous
function at the end. But the anonymous function doesn't have the
environment variable `x` that `t` requires, `f` does. So, we have to
augment function pointers with the environment (resuing the fourth
argument of `value_t`).

It's important to realize that macros require that we go
back-and-forth between LLVM values and the OCaml codegen engine. How
else would you evaluate something like:

```clojure
(defmacro baz [x]
`[1 2 ~x])
(baz (+ 2 2))
```

Note that macro arguments must be passed unevaluated at the
callsite. The maco-expand stage now needs to codegen `[1 2
]`, where that `` itself needs to be codegen'ed
by evaluating the argument: the result from the compiler must be
returned as an AST object to OCaml. This requires some involved
construction of OCaml objects from C.

Another subtle point to note is that macros must be lifted out of the
program and macro-expanded at the beginning of the program. This is
because we can't suddenly codegen segments required for macroexpansion
in the middle of codegen'ing another function.

## Todo

- Lambdas. Requires lifting them out and codegen'ing them first.

- Garbage collection. LLVM provides several garbage collection
intrinsincs, but the main challenge is to make sure that the C
bindings don't leak memory.

- Custom optimizations.

- Copy-on-write for persistent data structures, and persistent
variables with `setq`. How do you access a variable's history?

- FFI to C. It's simply a question of defining a good interface,
because we already use malloc and memcpy internally in LLVM IR.

- Self-hosting compiler. Necessary to co-develop the language and
compiler.

- Optional typesystem. Use the :- syntax to provide type safety and
optimizations!

- Concurrency primitives. See Clojure's core.async.

- Support for programs spanning multiple files.

- Polished error reporting with file:line annotations.

## Notes

- LLVM codegen statements all have side-effects. In most places, order
in which variables are codegen'ed are unimportant, and lets can be
moved around; but in conditionals, statements must be codegen'ed
exactly in order: this means let statements can't be permuted,
leading to imperative-style OCaml code.

- Since LLVM IR is strongly typed, it is possible to inspect the
_types_ of llvales from OCaml at generation-time. However, there is
no way for the codegen-stage to inspect the _values_ of variables
while the program is running. This has the consequence that a loop
hinged upon an llvalue must be implemented in LLVM IR; hence, for
functions that require iterating on variable-length arrays, we end
up writing tedious LLVM IR generation code instead of an equivalent
OCaml code.

- Debugging errors from LLVM bindings is hard. Using ocamldebug does
not work, since the callstack leading up to an LLVM call in C++ is
not owned by OCaml. The alternative is to use lldb, but matching up
the C++ call with a line in the OCaml code is non-trivial.

- Implementing complex functions as builtins (by directly generating
IR) is perhaps more efficient than implementing them in Rhine, but
the gap is very small due to optimization passes applied on the
Rhine-generated IR. The marginal benefit outweighs the cost of
correctly weaving complex IR.