https://github.com/artagnon/rhine

🔬 a C++ compiler middle-end, using an LLVM backend
https://github.com/artagnon/rhine

c-plus-plus compiler compiler-design llvm programming-language

Last synced: about 2 months ago
JSON representation

🔬 a C++ compiler middle-end, using an LLVM backend

Host: GitHub
URL: https://github.com/artagnon/rhine
Owner: artagnon
License: mit
Archived: true
Created: 2015-04-02T14:40:18.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2022-02-12T08:47:36.000Z (about 3 years ago)
Last Synced: 2024-07-31T22:50:42.165Z (9 months ago)
Topics: c-plus-plus, compiler, compiler-design, llvm, programming-language
Language: C++
Homepage:
Size: 1.16 MB
Stars: 164
Watchers: 7
Forks: 8
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # rhine: a C++ compiler middle-end for a typed ruby

[![Build Status](https://travis-ci.org/artagnon/rhine.svg?branch=master)](https://travis-ci.org/artagnon/rhine)

rhine is designed to be a fast language utilizing the LLVM JIT featuring N-d

tensors, first-class functions, and type inference; specifying argument

types is enough. It has a full blown AST into which it embeds a UseDef graph.

rhine started off as [rhine-ml](https://github.com/artagnon/rhine-ml), and

rhine-ml was called rhine earlier.

- Effort put into rhine-ml: 2 months

- Effort put into rhine: 1 year, 1 month

## Language Features

```elixir

def bar(arithFn Function(Int -> Int -> Int)) do

  println $ arithFn 2 4

end

def addCandidate(alpha Int, beta Int) do

  ret $ alpha + beta

end

def subCandidate(gamma Int, delta Int) do

  ret $ gamma - delta

end

def main() do

  if false do

    bar addCandidate

  else

    bar subCandidate

  end

  mu = {{2}, {3}}

  println mu[1][0]

end

```

`Int` is a type annotation, and only argument types need to be annotated,

return type is inferred. `Function(Int -> Int -> Int)` is a function that takes

two integers and returns one integer, mixing in some Haskell syntax. `$` is

again from Haskell, which is basically like putting the RHS in parens.

rhine-ml, in contrast, has arrays, first-class functions, closures, variadic

arguments, macros. It's also much less buggy.

## The recursive-descent parser

rhine uses a handwritten recursive-descent parser, which is faster and reports

better errors, than the former Bison one. You will need to use a one-token

lookahead atleast, if you want to keep the code simple. This gives you one level

of:

```cpp

parseSymbol(); // Oops, the lexed token indicates that we're not in the right

               // function

parseInstruction(); // Ask it to use an existing token, not lex a new one

```

Another minor consideration is that newlines must be handled explicitly if you

want to substitute ; with a newline in the language.

```cpp

void Parser::getTok() {

  LastTok = CurTok;

  CurTok = Driver->Lexx->lex(&CurSema, &CurLoc);

  LastTokWasNewlineTerminated = false;

  while (CurTok == NEWLINE) {

    LastTokWasNewlineTerminated = true;

    CurTok = Driver->Lexx->lex(&CurSema, &CurLoc);

  }

}

```

## The AST

The AST is heavily inspired by LLVM IR, although it has some higher-level

concepts like `Tensor`. It's an SSA and has a UseDef graph embedded in it,

making analysis and transformation easy.

The main classes are `Type` and `Value`. All types like `IntType`, `FloatType`

inherit from `Type`, most of the others inherit from `Value`. A `BasicBlock` is

a `Value`, and so is `ConstantInt`.

A `BasicBlock` is a vector of `Instruction`, and this is how the AST is an SSA:

assignments are handled as a `StoreInst`; there is no real LHS, just RHS

references.

```cpp

StoreInst::StoreInst(Value *MallocedValue, Value *NewValue);

```

## UseDef in AST

`Value` is uniquified using LLVM's `FoldingSet`, and `Use` wraps it, so we can

replace one `Value` with another.

```cpp

/// A Use is basically a linked list of Value wrappers

class Use {

  Value *Val;

  Use *Prev;

  Use *Next;

   // Laid out in memory as [User] - [Use1] - [Use2]. Use2 has DistToUser 2

  unsigned DistToUser;

};

```

An `Instruction` is a `User`. `User` and its `Use` values are laid out

sequentially in memory, so it's possible to reach all the `Use` values from the

`User`. It's also possible to reach the `User` from any `Use`, using

`DistToUser`.

```cpp

class User : public Value {

protected:

  unsigned NumOperands;

};

class Instruction : User;

```

The `User` has a custom `new` to allocate memory for the `Use` instances

as well.

```cpp

  void *User::operator new(size_t Size, unsigned Us) {

    void *Storage = ::operator new (Us * sizeof(Use) + Size);

    auto Start = static_cast(Storage);

    auto End = Start + Us;

    for (unsigned Iter = 0; Iter < Us; Iter++) {

      new (Start + Iter) Use(Us - Iter);

    }

    auto Obj = reinterpret_cast(End);

    return Obj;

  }

};

```

## The Context

The Context is a somewhat large object that keeps the uniqified `Type` and

`Value` instances. It also keeps track of `Externals`, the external C functions

that are provided as part of a "standard library". Unique `llvm::Builder` and

`llvm::Context` objects, as well as the `DiagnosticPrinter` are exposed member

variables. Finally, it is necessary for symbol resolution, and keeps the

`ResolutionMap`.

## Symbol resolution

src/Transform/Resolve is an example of something that utilizes the UseDef embedded

in the AST.

```elixir

  B = A + 2

```

creates one `UnresolvedValue`, `A`, an `AddInst`, and a `MallocInst`,

which takes the string "B" and `AddInst` as operands.

The transform basically goes over all the `Instruction` in the `BasicBlock`,

resolves `UnresolvedValue` instances, and sets the `Use` to the resolved value.

It hence replaces the `Value` underneath the `Use`, and since the `Instruction`

is referencing `Use` instances, there are no dangling references.

```cpp

if (auto S = K->Map.get(V, Block)) {

  /// %S = 2;

  ///  ^

  /// Came from here (MallocInst, Argument, or Prototype)

  ///

  /// Foo(%S);

  ///      ^

  ///  UnresolvedValue; replace with %Replacement

  if (auto M = dyn_cast(S)) {

    if (dyn_cast(U->getUser()))

      U.set(M);

  }

}

```

## Type Inference

Type Inference is too simple. One `visit` function is overloaded for all

possible `Value` classes.

```cpp

Type *TypeInfer::visit(MallocInst *V) {

  V->setType(visit(V->getVal()));

  assert(!V->isUnTyped() && "unable to type infer MallocInst");

  return VoidType::get(K);

}

```

## Building

The desired directory structure is:

```

bin/ ; if you downloaded the tarball for this

    cmake

    ninja

    flex

src/

    rhine/

            README.md

            llvm/ ; git submodule update --init to get the sources

            llvm-build/

                        bin/

                            llvm-config ; you need to call this to build

            rhine-build/

                        rhine ; the executable

```

On an OSX where you have everything:

```sh

$ brew install flex

$ brew link --force flex

$ git submodule update --init

$ cd llvm-build

# rhine is buggy; without debugging symbols, you can't report a useful bug

$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ../llvm

$ export PATH=`pwd`/bin:$PATH

$ cd ../rhine-build

$ cmake -GNinja ..

# this will run the packages unittests, which should all pass

$ ninja check

```

On a Linux where you have nothing (and no root privileges are required):

Get [git-lfs](https://git-lfs.github.com/), and fetch cmake-ninja-flex.tar.bz2

```sh

$ git lfs fetch

```

Untar it and set up environment variables.

```sh

$ tar xf cmake-ninja-flex.tar.bz2

$ cd cmake-ninja-flex

# for bash/zsh

$ export TOOLS_ROOT=`pwd`

$ export PATH=$TOOLS_ROOT:$PATH

# for csh

$ setenv TOOLS_ROOT `pwd`

$ setenv PATH $TOOLS_ROOT:$PATH

```

Then,

```sh

$ git submodule update --init

$ cd llvm-build

# rhine is buggy; without debugging symbols, you can't report a useful bug

$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ../llvm

$ ninja

$ export PATH=`pwd`/bin:$PATH

$ cd ../rhine-build

# flex isn't picked up from $PATH

$ cmake -GNinja -DTOOLS_ROOT=$TOOLS_ROOT -DFLEX_EXECUTABLE=$TOOLS_ROOT/flex ..

# if there are build (usually link) errors, please open an issue

# tests are currently failing on Linux, need to look into it

$ ninja check

```

## Commentary

An inefficient untyped language is easy to implement. `println` taking 23 and

"twenty three" as arguments is a simple matter of switching on

type-when-unboxed. There's no need to rewrite the value in IR, and certainly no

need to come up with an overloading scheme.

[Crystal](http://crystal-lang.org/) made a good decision to start with Ruby. If

your idea is to self-host, then the original language's efficiency does not

matter. All you need is good generated assembly (which LLVM makes easy).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/artagnon/rhine

Awesome Lists containing this project

README