{"id":21257808,"url":"https://github.com/maroontress/clione.java","last_synced_at":"2025-03-15T06:23:21.399Z","repository":{"id":103784566,"uuid":"450798449","full_name":"maroontress/Clione.Java","owner":"maroontress","description":"A C17 lexical parser written in Java.","archived":false,"fork":false,"pushed_at":"2022-01-24T17:30:07.000Z","size":106,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-21T21:32:02.291Z","etag":null,"topics":["c17","java","lexical-parser","parser"],"latest_commit_sha":null,"homepage":"https://maroontress.github.io/Clione-Java/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maroontress.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-22T11:31:46.000Z","updated_at":"2022-02-02T06:18:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"f71b41e7-9015-4b02-8c24-21b79be1f395","html_url":"https://github.com/maroontress/Clione.Java","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maroontress%2FClione.Java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maroontress%2FClione.Java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maroontress%2FClione.Java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maroontress%2FClione.Java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maroontress","download_url":"https://codeload.github.com/maroontress/Clione.Java/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243691256,"owners_count":20331932,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c17","java","lexical-parser","parser"],"created_at":"2024-11-21T04:05:56.534Z","updated_at":"2025-03-15T06:23:21.393Z","avatar_url":"https://github.com/maroontress.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Clione\n\nClione is a Java implementation of a lexical parser that tokenizes source code\nwritten in C17 and other C-like programming languages.\n\nThe main facility is a tokenization API corresponding to the C preprocessor\nlayer. It includes the features of trigraph replacement, line splicing, and\ntokenization but does not include macro expansion and directive handling.\n\n## Example\n\n[A typical usage example](src/test/java/com/example/TokenDemo.java) would be as\nfollows:\n\n```java\npackage com.example;\n\nimport java.io.IOException;\nimport java.nio.file.FileSystems;\nimport java.nio.file.Files;\n\nimport com.maroontress.clione.LexicalParser;\nimport com.maroontress.clione.Token;\n\npublic final class TokenDemo {\n\n    public static void main(String[] args) {\n        var path = FileSystems.getDefault().getPath(args[0]);\n        try (var parser = LexicalParser.of(Files.newBufferedReader(path))) {\n            run(parser);\n        } catch (IOException e) {\n            e.printStackTrace();\n        }\n    }\n\n    public static void run(LexicalParser parser) throws IOException {\n        for (;;) {\n            var maybeToken = parser.next();\n            if (maybeToken.isEmpty()) {\n                break;\n            }\n            var token = maybeToken.get();\n            printToken(token, \"\");\n        }\n    }\n\n    public static void printToken(Token token, String indent) {\n        var type = token.getType();\n        var value = token.getValue();\n        var span = token.getSpan();\n        var s = switch (type) {\n            case DELIMITER, DIRECTIVE_END\n                    -\u003e \"'\" + value.replaceAll(\"\\n\", \"\\\\\\\\n\") + \"'\";\n            default -\u003e value;\n        };\n        System.out.printf(\"%s%s: %s: %s%n\", indent, span, type, s);\n        for (var child : token.getChildren()) {\n            printToken(child, indent + \"| \");\n        }\n    }\n}\n```\n\nAnd [`helloworld.c`](src/test/resources/com/example/helloworld.c) would be as\nfollows:\n\n```c\n#include \u003cstdio.h\u003e\n\nint main(void)\n{\n    printf(\"hello world\\n\");\n}\n```\n\nIn this example, the result of \"`java com.example.TokenDemo helloworld.c`\" is\nas follows:\n\n```plaintext\nL1:1--19: DIRECTIVE: #\n| L1:2--8: DIRECTIVE_NAME: include\n| L1:9: DELIMITER: ' '\n| L1:10--18: STANDARD_HEADER: \u003cstdio.h\u003e\n| L1:19: DIRECTIVE_END: '\\n'\nL2:1: DELIMITER: '\\n'\nL3:1--3: RESERVED: int\nL3:4: DELIMITER: ' '\nL3:5--8: IDENTIFIER: main\nL3:9: PUNCTUATOR: (\nL3:10--13: RESERVED: void\nL3:14: PUNCTUATOR: )\nL3:15: DELIMITER: '\\n'\nL4:1: PUNCTUATOR: {\nL4:2--L5:4: DELIMITER: '\\n    '\nL5:5--10: IDENTIFIER: printf\nL5:11: PUNCTUATOR: (\nL5:12--26: STRING: \"hello world\\n\"\nL5:27: PUNCTUATOR: )\nL5:28: PUNCTUATOR: ;\nL5:29: DELIMITER: '\\n'\nL6:1: PUNCTUATOR: }\nL6:2: DELIMITER: '\\n'\n```\n\n## Tokens\n\nThe `LexicalParser` object creates and returns a token from the stream of the\nsource file. It often extracts the ones from the source file, but trigraph and\ndigraph substitution and line concatenation may result in tokens that are not\nin the source file. It returns an empty token when it finally reaches the end\nof the source file.\n\nThe `Token` objects that the `next()` method of `LexicalParser` instance\nreturns are the preprocessing tokens. So, the evaluation is necessary before\nusing their content. In other words, they can be incomplete according to the\ntoken type. For example, the string literal or comment may not terminate, the\npreprocessing number may not represent valid integer and floating-point\nconstants, and so on.\n\nAs in the example above, `Token` objects can have children, which means they\ncan be in a tree structure. For tokens that the `next()` method returns, tokens\nof type `TokenType.DIRECTIVE` only have children.\n\nThe `Token` object has its type, span, and characters. The type is one of the\nconstants defined in `enum TokenType`, the span represents the range of the\nsource file where the token occurs, and the characters are `SourceChar` objects\nthat compose it.\n\n## Characters\n\nThe `SourceChar` object represents a character that composes the token or EOF.\nIt may also have one or more child characters in some cases. For example, it is\nthe case that it represents:\n\n- the character which is substituted for any digraph or trigraph sequence\n- the character that follows a backslash (`\\`) at the end of the line\n\n[The following code](src/test/java/com/example/SourceCharDemo.java) shows an\nexample:\n\n```java\npackage com.example;\n\nimport java.io.IOException;\nimport java.nio.file.FileSystems;\nimport java.nio.file.Files;\nimport java.util.List;\n\nimport com.maroontress.clione.LexicalParser;\nimport com.maroontress.clione.SourceChar;\nimport com.maroontress.clione.Token;\n\npublic final class SourceCharDemo {\n\n    public static void main(String[] args) {\n        var path = FileSystems.getDefault().getPath(args[0]);\n        try (var parser = LexicalParser.of(Files.newBufferedReader(path))) {\n            run(parser);\n        } catch (IOException e) {\n            e.printStackTrace();\n        }\n    }\n\n    public static void run(LexicalParser parser) throws IOException {\n        for (;;) {\n            var maybeToken = parser.next();\n            if (maybeToken.isEmpty()) {\n                break;\n            }\n            printToken(maybeToken.get());\n        }\n    }\n\n    public static void printToken(Token token) {\n        var type = token.getType();\n        var value = token.getValue();\n        var span = token.getSpan();\n        var s = switch (type) {\n            case DELIMITER, DIRECTIVE_END\n                    -\u003e \"'\" + value.replaceAll(\"\\n\", \"\\\\\\\\n\") + \"'\";\n            default -\u003e value;\n        };\n        System.out.printf(\"%s: %s: %s%n\", span, type, s);\n        printChars(token.getChars(), \"  \");\n    }\n\n    private static void printChars(List\u003cSourceChar\u003e chars, String indent) {\n        for (var c : chars) {\n            var span = c.getSpan();\n            var value = c.toChar();\n            var s = (value == '\\n')\n                    ? \"'\\\\n'\"\n                    : Character.isHighSurrogate(value)\n                    ? \"H(0x\" + Integer.toString((int) value, 16) + \")\"\n                    : Character.isLowSurrogate(value)\n                    ? \"L(0x\" + Integer.toString((int) value, 16) + \")\"\n                    : String.valueOf(value);\n            System.out.printf(\"%s%s: %s%n\", indent, span, s);\n            printChars(c.getChildren(), indent + \"| \");\n        }\n    }\n}\n```\n\nAnd\n[`main.c`](src/test/resources/com/example/main.c) would be as follows:\n\n```c\nma??/\nin\n```\n\nIn this example, the result of \"`java com.example.SourceCharDemo main.c`\" is as follows:\n\n```plaintext\nL1:1--L2:2: IDENTIFIER: main\n  L1:1: m\n  L1:2: a\n  L1:3--L2:1: i\n  | L1:3--5: \\\n  | | L1:3: ?\n  | | L1:4: ?\n  | | L1:5: /\n  | L1:6: '\\n'\n  | L2:1: i\n  L2:2: n\n⋮\n```\n\nThe result illustrates that the character `i` in the identifier `main` has\nchild characters: a backslash (`\\`), a newline (`\\n`), and `i`. Furthermore,\nthe backslash character has child characters: `?`, `?`, and `/`. Of course,\nwhat happens is that the trigraph sequence `??/` is replaced with a backslash\nat first, and then the backslash at the end of the line results in the line\nconcatenation.\n\n## Surrogate pairs\n\nA character corresponds to a column. So, one `char` value often represents one\ncolumn. However, in the case of a character represented with a surrogate pair,\nthe two `char` values in the pair represent one column. Here is an example\n[`emojicat.c`](src/test/resources/com/example/emojicat.c):\n\n```c\nchar *cat = u8\"🐱\";\n```\n\nThe result of \"`java com.example.SourceCharDemo emojicat.c`\" is as follows:\n\n```plaintext\n⋮\nL1:19--23: STRING: u8\"🐱\"\n  L1:13: u\n  L1:14: 8\n  L1:15: \"\n  L1:16: H(0xd83d)\n  L1:16: L(0xdc31)\n  L1:17: \"\n⋮\n```\n\nThis example shows that the high and low surrogate characters are in the same\ncolumn.\n\n## Phases of translation\n\nThe lexical parser starts tokenization after trigraph replacement and line\nsplicing, according to the\n[_phases of translation_][wikipedia-phases-of-translation].\n\n### Newlines\n\nBefore anything else, the lexical parser substitutes `\\n` for all newlines,\nthat is, line feed (LF), carriage return and line feed (CRLF), and carriage\nreturn (CR) in the stream, even if different newlines are mixed in the stream.\nIt indicates `\\n` as a newline (NL) character, regardless of platform.\n\n### Trigraphs\n\nAfter unifying newline characters, the lexical parser replaces\n[trigraph sequences][wikipedia-trigraph] with the new `SourceChar` objects they\nrepresent. The new one becomes the parent of the replaced characters and\nrepresents their equivalent. The following table lists all trigraphs:\n\n| Trigraph  | Equivalent |\n| :---:  | :---: |\n| `??\u003c`  | `{`   |\n| `??\u003e`  | `}`   |\n| `??[`  | `(`   |\n| `??]`  | `)`   |\n| `??=`  | `#`   |\n| `??/`  | `\\`   |\n| `??'`  | `^`   |\n| `??!`  | `\\|`  |\n| `??-`  | `~`   |\n\n### Line splicing\n\nNext to the trigraph replacement, the lexical parser removes the backslash\ncharacter at the end of the line. To be more precise, it replaces the\nbackslash, the newline character, and the next character with a new\n`SourceChar` object. The new one becomes the parent of the replaced characters\nand represents the character that followed the backslash and newline\ncharacters.\n\nA pair of the backslash and newline characters may appear two or more times\nwith consecutive occurrences. In that case, the new substituted one becomes the\nparent of both their characters and the next character.\n\n### Tokenization\n\nAfter line splicing, the lexical parser starts to break the `SourceChar` stream\ninto `Token`s. A `Token` object may be either:\n\n- delimiters (that are sequences of whitespace characters)\n- comments\n- directives\n- preprocessing tokens (that are standard header names, identifiers,\n  preprocessing numbers, character constants, string literals, operators and\n  punctuators, or unknown token)\n\n## Delimiters\n\nA delimiter is a separator between tokens. Strictly speaking, it is not a\ntoken, but the lexical parser returns the delimiter as a token. Some\napplications may completely ignore delimiters (for example, code formatters).\n\nThe space, horizontal tab (HT), form feed (FF), vertical tab (VT), and NL\ncharacters are delimiters within any non-directive line. The space and HT\ncharacters are delimiters within any directive lines.\n\n\u003e ☕ By the way, have you seen source code including FF and VT characters? In\n\u003e the past, people often printed source code on paper. In the 1980s, I saw some\n\u003e source code that included a FF character inserted between functions. It\n\u003e resulted in a page break, so each function started at the top of the page. As\n\u003e far as a VT character goes, I have never seen it in the source code.\n\nThe token type of delimiters is `TokenType.DELIMITER`.\n\n## Comments\n\nA comment also can be a delimiter, because C preprocessors replace each comment\nwith a space character.\n\nThere are two types of comments. The one starts with `/*` and ends with `*/`.\nThe other starts with `//` and ends with a newline character. No comment can be\ninside a character constant, a string literal, a standard header name, or a\nfilename in either case.\n\nThe content of the token can be incomplete. For example, it may not terminate,\nand so on.\n\nThe token type of comments is `TokenType.COMMENT`.\n\n## Identifiers\n\nAn identifier is a preprocessing token.\n\nThe first character of an identifier name must be one of:\n\n- an underscore character or an uppercase or lowercase letter (`[_A-Za-z]`)\n- universal character names (`\\uXXXX` or `\\UXXXXXXXX`, `X` is a hexadecimal\n  digit)\n- other implementation-defined characters\n\nThe second and subsequent character must be one of them or a digit (`[0-9]`).\n\nThe _other implementation-defined characters_ that `LexicalParser`'s\nimplementation defines are of\n[Unicode Identifier](https://unicode.org/reports/tr31/) that is as follows:\n\n- The first character: a character with which the\n  [Character.isUnicodeIdentifierStart(int)][isUnicodeIdentifierStart]\n  method returns `true`\n- The second and subsequent character: a character with which the\n  [Character.isUnicodeIdentifierPart(int)][isUnicodeIdentifierPart]\n  method returns `true`\n\nSo, the lexical parser can parse the following C code:\n\n```c\nchar *\\U0001f431 = \"cat\";\n```\n\nHowever, it does NOT support the following code because Unicode Identifier does\nnot contain the emoji characters such as 🐱:\n\n```c\nchar *🐱 = \"cat\";\n```\n\nNote that the recent famous C compilers (like GCC, Clang, etc.) can compile the\ncode where an identifier contains emoji characters like this.\n\nThe token type of identifiers is `TokenType.IDENTIFIER`.\n\n## Reserved words\n\nReserved words are equivalent to identifiers, but they are in the set of\nkeywords, which you can specify with the factory method of `LexicalParser`.\n\nThe token type of reserved words is `TokenType.RESERVED`.\n\n## Character constants\n\nA character constant is a preprocessing token.\n\nIt consists of one or more characters enclosed in single quotes. The quotes may\nfollow a prefix either `L`, `u`, or `U`. It may contain\n[escape sequences][wikipedia-escape-character]. It may not contain a newline\ncharacter.\n\nThe content of the token can be incomplete. For example, it may not terminate,\nit may contain no character, two or more characters, or invalid escape\nsequences inside the single quotes, and so on.\n\nThe token type of character constants is `TokenType.CHARACTER`.\n\n## String literals\n\nA string literal is a preprocessing token.\n\nIt consists of zero or more characters enclosed in double quotes. The quotes\nmay follow a prefix either `L`, `u`, `U`, or `u8`. It may contain\n[escape sequences][wikipedia-escape-character]. It may not contain a newline\ncharacter.\n\nThe content of the token can be incomplete. For example, it may not terminate,\nit may contain invalid escape sequences inside the double quotes, and so on.\n\nThe token type of string literals is `TokenType.STRING`.\n\n## Preprocessing numbers\n\nA preprocessing number is a preprocessing token.\n\nIt includes all integer and floating-point constants but does other forms\nexcept them.\n\nThe content of the token can be incomplete. For example, it may not represent\nboth integer and floating-point constants, and so on.\n\nThe token type of preprocessing numbers is `TokenType.NUMBER`.\n\n## Operators and punctuators\n\nOperator or punctuator tokens are preprocessing tokens. The following table\nlists valid tokens of which the type is `TokenType.OPERATOR`:\n\n```plaintext\n+       -       *       /       %       ++      --      ==      !=\n\u003e       \u003c       \u003e=      \u003c=      !       \u0026\u0026      ||      ~       \u0026\n|       ^       \u003c\u003c      \u003e\u003e      =       +=      -=      *=      /=\n%=      \u0026=      |=      ^=      \u003c\u003c=     \u003e\u003e=     -\u003e      .       ?\n```\n\nNote that these are preprocessing tokens, not C operators. For example,\n`sizeof` is an operator in C, but a reserved word (or an identifier) as a\npreprocessing token.\n\nThe following table lists all valid tokens of which the type is\n`TokenType.PUNCTUATOR`:\n\n```plaintext\n(       )       [       ]       {       }       :\n;       ,       ...     \u003c:      :\u003e      \u003c%      %\u003e\n```\n\nThe lexical parser specially treats the four tokens: `#`, `%:`, `##`, and\n`%:%:`. The type of them is `TokenType.OPERATOR` in directive lines. Otherwise,\n`#` and `%:` are of type `TokenType.DIRECTIVE`, `##` and `%:%:` are of type\n`TokenType.UNKNOWN` as follows:\n\n| Tokens | In directive lines | Otherwise |\n|:---:|:---:|:---:|\n| `#` `%:`    | `TokenType.OPERATOR` | `TokenType.DIRECTIVE` |\n| `##` `%:%:` | `TokenType.OPERATOR` | `TokenType.UNKNOWN`   |\n\nThe following table lists all tokens that are digraphs:\n\n| Token  | Equivalent |\n| :---:  | :---: |\n| `\u003c:`   | `[`   |\n| `:\u003e`   | `]`   |\n| `\u003c%`   | `{`   |\n| `%\u003e`   | `}`   |\n| `%:`   | `#`   |\n| `%:%:` | `##`  |\n\nThe lexical parser replaces the digraphs with their equivalents. The\nsubstituted characters have the child characters that represent the replaced\nones.\n\n## Directives\n\nA directive token consists of a number sign (or hash) character (`#`) and the\nchild tokens. The null directive has no child tokens.\n\nThe child tokens must include a directive name, arguments (depending on the\ndirective name), and the end of the directive (that is a newline character).\nThey also may include delimiters and comments. The last of them must be the end\nof the directive.\n\nThe content of the child tokens can be incomplete. For example, they may\nrepresent an invalid directive, they may not end with the end of the directive,\nand so on.\n\nThe token type of directives is `TokenType.DIRECTIVE`.\n\nThe tokens that represent the directive names must have the content which is\neither: `define`, `undef`, `include`, `if`, `ifdef`, `ifndef`, `else`, `elif`,\n`endif`, `line`, `error`, or `pragma`. Their token type is\n`TokenType.DIRECTIVE_NAME`.\n\nThe tokens that represents the end of the directive must have a newline\ncharacter as the content. Their token type is `TokenType.DIRECTIVE_END`.\n\n### Include directives\n\nWhen the directive name equals `include`, the argument must be either:\n\n- a standard header name between angle brackets (`\u003c` and `\u003e`)\n- a filename between double quotes (`\"` and `\"`)\n- any other form that expands to a standard header name or a filename after\n  macro replacement\n\nA standard header name and a filename are preprocessing tokens.\n\nThe content of the tokens can be incomplete. For example, they may not\nterminate, and so on.\n\nThe token types of standard header names and filenames are\n`TokenType.STANDARD_HEADER` and `TokenType.FILENAME`, respectively.\n\n## Unknown tokens\n\nWhen the lexical parser encounters characters that do not fit the above\ndescription, it returns an unknown token containing them.\n\nThe token type of unknown tokens is `TokenType.UNKNOWN`.\n\n## API Reference\n\n- [com.maroontress.clione][apiref-maroontress.clione] module\n\n[isUnicodeIdentifierPart]:\n  https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Character.html#isUnicodeIdentifierPart(int)\n[isUnicodeIdentifierStart]:\n  https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Character.html#isUnicodeIdentifierStart(int)\n[apiref-maroontress.clione]:\n  https://maroontress.github.io/Clione-Java/api/latest/html/index.html\n[wikipedia-trigraph]:\n  https://en.wikipedia.org/wiki/Digraphs_and_trigraphs#C\n[wikipedia-escape-character]:\n  https://en.wikipedia.org/wiki/Escape_sequences_in_C#Table_of_escape_sequences\n[wikipedia-phases-of-translation]:\n  https://en.wikipedia.org/wiki/C_preprocessor#Phases\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaroontress%2Fclione.java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaroontress%2Fclione.java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaroontress%2Fclione.java/lists"}