https://github.com/maroontress/clione.java

A C17 lexical parser written in Java.
https://github.com/maroontress/clione.java
c17 java lexical-parser parser
Last synced: over 1 year ago
JSON representation
A C17 lexical parser written in Java.
Host: GitHub
URL: https://github.com/maroontress/clione.java
Owner: maroontress
License: other
Created: 2022-01-22T11:31:46.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-01-24T17:30:07.000Z (over 4 years ago)
Last Synced: 2025-01-21T21:32:02.291Z (over 1 year ago)
Topics: c17, java, lexical-parser, parser
Language: Java
Homepage: https://maroontress.github.io/Clione-Java/
Size: 104 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Clione

Clione is a Java implementation of a lexical parser that tokenizes source code

written in C17 and other C-like programming languages.

The main facility is a tokenization API corresponding to the C preprocessor

layer. It includes the features of trigraph replacement, line splicing, and

tokenization but does not include macro expansion and directive handling.

## Example

[A typical usage example](src/test/java/com/example/TokenDemo.java) would be as

follows:

```java

package com.example;

import java.io.IOException;

import java.nio.file.FileSystems;

import java.nio.file.Files;

import com.maroontress.clione.LexicalParser;

import com.maroontress.clione.Token;

public final class TokenDemo {

    public static void main(String[] args) {

        var path = FileSystems.getDefault().getPath(args[0]);

        try (var parser = LexicalParser.of(Files.newBufferedReader(path))) {

            run(parser);

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

    public static void run(LexicalParser parser) throws IOException {

        for (;;) {

            var maybeToken = parser.next();

            if (maybeToken.isEmpty()) {

                break;

            }

            var token = maybeToken.get();

            printToken(token, "");

        }

    }

    public static void printToken(Token token, String indent) {

        var type = token.getType();

        var value = token.getValue();

        var span = token.getSpan();

        var s = switch (type) {

            case DELIMITER, DIRECTIVE_END

                    -> "'" + value.replaceAll("\n", "\\\\n") + "'";

            default -> value;

        };

        System.out.printf("%s%s: %s: %s%n", indent, span, type, s);

        for (var child : token.getChildren()) {

            printToken(child, indent + "| ");

        }

    }

}

```

And [`helloworld.c`](src/test/resources/com/example/helloworld.c) would be as

follows:

```c

#include 

int main(void)

{

    printf("hello world\n");

}

```

In this example, the result of "`java com.example.TokenDemo helloworld.c`" is

as follows:

```plaintext

L1:1--19: DIRECTIVE: #

| L1:2--8: DIRECTIVE_NAME: include

| L1:9: DELIMITER: ' '

| L1:10--18: STANDARD_HEADER: 

| L1:19: DIRECTIVE_END: '\n'

L2:1: DELIMITER: '\n'

L3:1--3: RESERVED: int

L3:4: DELIMITER: ' '

L3:5--8: IDENTIFIER: main

L3:9: PUNCTUATOR: (

L3:10--13: RESERVED: void

L3:14: PUNCTUATOR: )

L3:15: DELIMITER: '\n'

L4:1: PUNCTUATOR: {

L4:2--L5:4: DELIMITER: '\n    '

L5:5--10: IDENTIFIER: printf

L5:11: PUNCTUATOR: (

L5:12--26: STRING: "hello world\n"

L5:27: PUNCTUATOR: )

L5:28: PUNCTUATOR: ;

L5:29: DELIMITER: '\n'

L6:1: PUNCTUATOR: }

L6:2: DELIMITER: '\n'

```

## Tokens

The `LexicalParser` object creates and returns a token from the stream of the

source file. It often extracts the ones from the source file, but trigraph and

digraph substitution and line concatenation may result in tokens that are not

in the source file. It returns an empty token when it finally reaches the end

of the source file.

The `Token` objects that the `next()` method of `LexicalParser` instance

returns are the preprocessing tokens. So, the evaluation is necessary before

using their content. In other words, they can be incomplete according to the

token type. For example, the string literal or comment may not terminate, the

preprocessing number may not represent valid integer and floating-point

constants, and so on.

As in the example above, `Token` objects can have children, which means they

can be in a tree structure. For tokens that the `next()` method returns, tokens

of type `TokenType.DIRECTIVE` only have children.

The `Token` object has its type, span, and characters. The type is one of the

constants defined in `enum TokenType`, the span represents the range of the

source file where the token occurs, and the characters are `SourceChar` objects

that compose it.

## Characters

The `SourceChar` object represents a character that composes the token or EOF.

It may also have one or more child characters in some cases. For example, it is

the case that it represents:

- the character which is substituted for any digraph or trigraph sequence

- the character that follows a backslash (`\`) at the end of the line

[The following code](src/test/java/com/example/SourceCharDemo.java) shows an

example:

```java

package com.example;

import java.io.IOException;

import java.nio.file.FileSystems;

import java.nio.file.Files;

import java.util.List;

import com.maroontress.clione.LexicalParser;

import com.maroontress.clione.SourceChar;

import com.maroontress.clione.Token;

public final class SourceCharDemo {

    public static void main(String[] args) {

        var path = FileSystems.getDefault().getPath(args[0]);

        try (var parser = LexicalParser.of(Files.newBufferedReader(path))) {

            run(parser);

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

    public static void run(LexicalParser parser) throws IOException {

        for (;;) {

            var maybeToken = parser.next();

            if (maybeToken.isEmpty()) {

                break;

            }

            printToken(maybeToken.get());

        }

    }

    public static void printToken(Token token) {

        var type = token.getType();

        var value = token.getValue();

        var span = token.getSpan();

        var s = switch (type) {

            case DELIMITER, DIRECTIVE_END

                    -> "'" + value.replaceAll("\n", "\\\\n") + "'";

            default -> value;

        };

        System.out.printf("%s: %s: %s%n", span, type, s);

        printChars(token.getChars(), "  ");

    }

    private static void printChars(List chars, String indent) {

        for (var c : chars) {

            var span = c.getSpan();

            var value = c.toChar();

            var s = (value == '\n')

                    ? "'\\n'"

                    : Character.isHighSurrogate(value)

                    ? "H(0x" + Integer.toString((int) value, 16) + ")"

                    : Character.isLowSurrogate(value)

                    ? "L(0x" + Integer.toString((int) value, 16) + ")"

                    : String.valueOf(value);

            System.out.printf("%s%s: %s%n", indent, span, s);

            printChars(c.getChildren(), indent + "| ");

        }

    }

}

```

And

[`main.c`](src/test/resources/com/example/main.c) would be as follows:

```c

ma??/

in

```

In this example, the result of "`java com.example.SourceCharDemo main.c`" is as follows:

```plaintext

L1:1--L2:2: IDENTIFIER: main

  L1:1: m

  L1:2: a

  L1:3--L2:1: i

  | L1:3--5: \

  | | L1:3: ?

  | | L1:4: ?

  | | L1:5: /

  | L1:6: '\n'

  | L2:1: i

  L2:2: n

⋮

```

The result illustrates that the character `i` in the identifier `main` has

child characters: a backslash (`\`), a newline (`\n`), and `i`. Furthermore,

the backslash character has child characters: `?`, `?`, and `/`. Of course,

what happens is that the trigraph sequence `??/` is replaced with a backslash

at first, and then the backslash at the end of the line results in the line

concatenation.

## Surrogate pairs

A character corresponds to a column. So, one `char` value often represents one

column. However, in the case of a character represented with a surrogate pair,

the two `char` values in the pair represent one column. Here is an example

[`emojicat.c`](src/test/resources/com/example/emojicat.c):

```c

char *cat = u8"🐱";

```

The result of "`java com.example.SourceCharDemo emojicat.c`" is as follows:

```plaintext

⋮

L1:19--23: STRING: u8"🐱"

  L1:13: u

  L1:14: 8

  L1:15: "

  L1:16: H(0xd83d)

  L1:16: L(0xdc31)

  L1:17: "

⋮

```

This example shows that the high and low surrogate characters are in the same

column.

## Phases of translation

The lexical parser starts tokenization after trigraph replacement and line

splicing, according to the

[_phases of translation_][wikipedia-phases-of-translation].

### Newlines

Before anything else, the lexical parser substitutes `\n` for all newlines,

that is, line feed (LF), carriage return and line feed (CRLF), and carriage

return (CR) in the stream, even if different newlines are mixed in the stream.

It indicates `\n` as a newline (NL) character, regardless of platform.

### Trigraphs

After unifying newline characters, the lexical parser replaces

[trigraph sequences][wikipedia-trigraph] with the new `SourceChar` objects they

represent. The new one becomes the parent of the replaced characters and

represents their equivalent. The following table lists all trigraphs:

| Trigraph  | Equivalent |

| :---:  | :---: |

| `??<`  | `{`   |

| `??>`  | `}`   |

| `??[`  | `(`   |

| `??]`  | `)`   |

| `??=`  | `#`   |

| `??/`  | `\`   |

| `??'`  | `^`   |

| `??!`  | `\|`  |

| `??-`  | `~`   |

### Line splicing

Next to the trigraph replacement, the lexical parser removes the backslash

character at the end of the line. To be more precise, it replaces the

backslash, the newline character, and the next character with a new

`SourceChar` object. The new one becomes the parent of the replaced characters

and represents the character that followed the backslash and newline

characters.

A pair of the backslash and newline characters may appear two or more times

with consecutive occurrences. In that case, the new substituted one becomes the

parent of both their characters and the next character.

### Tokenization

After line splicing, the lexical parser starts to break the `SourceChar` stream

into `Token`s. A `Token` object may be either:

- delimiters (that are sequences of whitespace characters)

- comments

- directives

- preprocessing tokens (that are standard header names, identifiers,

  preprocessing numbers, character constants, string literals, operators and

  punctuators, or unknown token)

## Delimiters

A delimiter is a separator between tokens. Strictly speaking, it is not a

token, but the lexical parser returns the delimiter as a token. Some

applications may completely ignore delimiters (for example, code formatters).

The space, horizontal tab (HT), form feed (FF), vertical tab (VT), and NL

characters are delimiters within any non-directive line. The space and HT

characters are delimiters within any directive lines.

> ☕ By the way, have you seen source code including FF and VT characters? In

> the past, people often printed source code on paper. In the 1980s, I saw some

> source code that included a FF character inserted between functions. It

> resulted in a page break, so each function started at the top of the page. As

> far as a VT character goes, I have never seen it in the source code.

The token type of delimiters is `TokenType.DELIMITER`.

## Comments

A comment also can be a delimiter, because C preprocessors replace each comment

with a space character.

There are two types of comments. The one starts with `/*` and ends with `*/`.

The other starts with `//` and ends with a newline character. No comment can be

inside a character constant, a string literal, a standard header name, or a

filename in either case.

The content of the token can be incomplete. For example, it may not terminate,

and so on.

The token type of comments is `TokenType.COMMENT`.

## Identifiers

An identifier is a preprocessing token.

The first character of an identifier name must be one of:

- an underscore character or an uppercase or lowercase letter (`[_A-Za-z]`)

- universal character names (`\uXXXX` or `\UXXXXXXXX`, `X` is a hexadecimal

  digit)

- other implementation-defined characters

The second and subsequent character must be one of them or a digit (`[0-9]`).

The _other implementation-defined characters_ that `LexicalParser`'s

implementation defines are of

[Unicode Identifier](https://unicode.org/reports/tr31/) that is as follows:

- The first character: a character with which the

  [Character.isUnicodeIdentifierStart(int)][isUnicodeIdentifierStart]

  method returns `true`

- The second and subsequent character: a character with which the

  [Character.isUnicodeIdentifierPart(int)][isUnicodeIdentifierPart]

  method returns `true`

So, the lexical parser can parse the following C code:

```c

char *\U0001f431 = "cat";

```

However, it does NOT support the following code because Unicode Identifier does

not contain the emoji characters such as 🐱:

```c

char *🐱 = "cat";

```

Note that the recent famous C compilers (like GCC, Clang, etc.) can compile the

code where an identifier contains emoji characters like this.

The token type of identifiers is `TokenType.IDENTIFIER`.

## Reserved words

Reserved words are equivalent to identifiers, but they are in the set of

keywords, which you can specify with the factory method of `LexicalParser`.

The token type of reserved words is `TokenType.RESERVED`.

## Character constants

A character constant is a preprocessing token.

It consists of one or more characters enclosed in single quotes. The quotes may

follow a prefix either `L`, `u`, or `U`. It may contain

[escape sequences][wikipedia-escape-character]. It may not contain a newline

character.

The content of the token can be incomplete. For example, it may not terminate,

it may contain no character, two or more characters, or invalid escape

sequences inside the single quotes, and so on.

The token type of character constants is `TokenType.CHARACTER`.

## String literals

A string literal is a preprocessing token.

It consists of zero or more characters enclosed in double quotes. The quotes

may follow a prefix either `L`, `u`, `U`, or `u8`. It may contain

[escape sequences][wikipedia-escape-character]. It may not contain a newline

character.

The content of the token can be incomplete. For example, it may not terminate,

it may contain invalid escape sequences inside the double quotes, and so on.

The token type of string literals is `TokenType.STRING`.

## Preprocessing numbers

A preprocessing number is a preprocessing token.

It includes all integer and floating-point constants but does other forms

except them.

The content of the token can be incomplete. For example, it may not represent

both integer and floating-point constants, and so on.

The token type of preprocessing numbers is `TokenType.NUMBER`.

## Operators and punctuators

Operator or punctuator tokens are preprocessing tokens. The following table

lists valid tokens of which the type is `TokenType.OPERATOR`:

```plaintext

+       -       *       /       %       ++      --      ==      !=

>       <       >=      <=      !       &&      ||      ~       &

|       ^       <<      >>      =       +=      -=      *=      /=

%=      &=      |=      ^=      <<=     >>=     ->      .       ?

```

Note that these are preprocessing tokens, not C operators. For example,

`sizeof` is an operator in C, but a reserved word (or an identifier) as a

preprocessing token.

The following table lists all valid tokens of which the type is

`TokenType.PUNCTUATOR`:

```plaintext

(       )       [       ]       {       }       :

;       ,       ...     <:      :>      <%      %>

```

The lexical parser specially treats the four tokens: `#`, `%:`, `##`, and

`%:%:`. The type of them is `TokenType.OPERATOR` in directive lines. Otherwise,

`#` and `%:` are of type `TokenType.DIRECTIVE`, `##` and `%:%:` are of type

`TokenType.UNKNOWN` as follows:

| Tokens | In directive lines | Otherwise |

|:---:|:---:|:---:|

| `#` `%:`    | `TokenType.OPERATOR` | `TokenType.DIRECTIVE` |

| `##` `%:%:` | `TokenType.OPERATOR` | `TokenType.UNKNOWN`   |

The following table lists all tokens that are digraphs:

| Token  | Equivalent |

| :---:  | :---: |

| `<:`   | `[`   |

| `:>`   | `]`   |

| `<%`   | `{`   |

| `%>`   | `}`   |

| `%:`   | `#`   |

| `%:%:` | `##`  |

The lexical parser replaces the digraphs with their equivalents. The

substituted characters have the child characters that represent the replaced

ones.

## Directives

A directive token consists of a number sign (or hash) character (`#`) and the

child tokens. The null directive has no child tokens.

The child tokens must include a directive name, arguments (depending on the

directive name), and the end of the directive (that is a newline character).

They also may include delimiters and comments. The last of them must be the end

of the directive.

The content of the child tokens can be incomplete. For example, they may

represent an invalid directive, they may not end with the end of the directive,

and so on.

The token type of directives is `TokenType.DIRECTIVE`.

The tokens that represent the directive names must have the content which is

either: `define`, `undef`, `include`, `if`, `ifdef`, `ifndef`, `else`, `elif`,

`endif`, `line`, `error`, or `pragma`. Their token type is

`TokenType.DIRECTIVE_NAME`.

The tokens that represents the end of the directive must have a newline

character as the content. Their token type is `TokenType.DIRECTIVE_END`.

### Include directives

When the directive name equals `include`, the argument must be either:

- a standard header name between angle brackets (`<` and `>`)

- a filename between double quotes (`"` and `"`)

- any other form that expands to a standard header name or a filename after

  macro replacement

A standard header name and a filename are preprocessing tokens.

The content of the tokens can be incomplete. For example, they may not

terminate, and so on.

The token types of standard header names and filenames are

`TokenType.STANDARD_HEADER` and `TokenType.FILENAME`, respectively.

## Unknown tokens

When the lexical parser encounters characters that do not fit the above

description, it returns an unknown token containing them.

The token type of unknown tokens is `TokenType.UNKNOWN`.

## API Reference

- [com.maroontress.clione][apiref-maroontress.clione] module

[isUnicodeIdentifierPart]:

  https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Character.html#isUnicodeIdentifierPart(int)

[isUnicodeIdentifierStart]:

  https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Character.html#isUnicodeIdentifierStart(int)

[apiref-maroontress.clione]:

  https://maroontress.github.io/Clione-Java/api/latest/html/index.html

[wikipedia-trigraph]:

  https://en.wikipedia.org/wiki/Digraphs_and_trigraphs#C

[wikipedia-escape-character]:

  https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences

[wikipedia-phases-of-translation]:

  https://en.wikipedia.org/wiki/C_preprocessor#Phases
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/maroontress/clione.java

Awesome Lists containing this project

README