{"id":19564340,"url":"https://github.com/antlr/performance","last_synced_at":"2025-02-26T09:21:36.256Z","repository":{"id":66220456,"uuid":"456016119","full_name":"antlr/performance","owner":"antlr","description":"Test the performance of ANTLR parsers (initially just Java target)","archived":false,"fork":false,"pushed_at":"2022-11-11T19:06:42.000Z","size":12204,"stargazers_count":2,"open_issues_count":1,"forks_count":1,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-01-08T23:41:50.399Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/antlr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-06T00:04:30.000Z","updated_at":"2024-03-02T00:42:12.000Z","dependencies_parsed_at":"2023-03-17T12:00:35.004Z","dependency_job_id":null,"html_url":"https://github.com/antlr/performance","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antlr%2Fperformance","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antlr%2Fperformance/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antlr%2Fperformance/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antlr%2Fperformance/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/antlr","download_url":"https://codeload.github.com/antlr/performance/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240823421,"owners_count":19863434,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T05:21:33.688Z","updated_at":"2025-02-26T09:21:36.221Z","avatar_url":"https://github.com/antlr.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Test the performance of ANTLR parsers (initially just Java target)\n\nParser performance for the same grammar across code generation targets varies a great deal for two reasons:\n\n1. raw performance of the underlying implementation language, such as python versus C++\n2. the runtime support library must be carefully tuned, such as we did for the Java target; e.g., the [hash function](https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/atn/ATNConfigSet.java#L47) used has to be appropriate for our unusual use case\n\nIt's also the case that different grammars for the same language can exhibit radically different performance. For example, in the [OOPSLA paper](https://dl.acm.org/doi/pdf/10.1145/2660193.2660202) we compared the performance of two different grammars for Java. The grammar from the Java language specification converted to ANTLR notation one to one performed much worse than one we hand tune to reduce lookahead requirements.\n\n*A note on testing the performance of ANTLR parsers.* ANTLR v4 generates ALL(\\*) parsers, which use a form of decision caching in order to improve future performance on the same or similar input statements.  That implies there is a warm up period associated with the parsers before they reach their final throughput speed, and of course, Java's JIT also has a warm up period (if using the Java target).\n\n## Build and test Java Target on 3 grammars\n\nAt this point, the test script compares the performance of a tuned Java grammar (on all generated Java code from all three grammars), Postgresql, and sparql with sample input. For convenience, a snapshot of upcoming ANTLR 4.10 release is in the `lib` dir and is used by the scripts.\n\n```bash\ncd /tmp\ngit clone git@github.com:antlr/performance.git\ncd performance/java\n```\n\nThen you can run the build script, which will download the grammar repository, pull out the 3 grammars of interest, generate code via ANTLR:\n\n```bash\n$ ./build.sh \nDownload sample grammars\n  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n100   133  100   133    0     0    565      0 --:--:-- --:--:-- --:--:--   568\n100 27.1M    0 27.1M    0     0  5110k      0 --:--:--  0:00:05 --:--:-- 6674k\nUnzip sample grammars\nerror:  cannot create grammars-v4-master/molecule/examples/NiC2O4 -? 2H2O.txt\n        Illegal byte sequence\nCopy sample grammars and input\nBuilding parsers\nwarning(146): PostgreSQLLexer.g4:2610:0: non-fragment lexer rule AfterEscapeStringConstantMode_NotContinued can match the empty string\nwarning(146): PostgreSQLLexer.g4:2629:0: non-fragment lexer rule AfterEscapeStringConstantWithNewlineMode_NotContinued can match the empty string\n\nCompiling parsers and test rig\nNote: TestANTLR4.java uses or overrides a deprecated API.\nNote: Recompile with -Xlint:deprecation for details.\n```\n\nHere is how to test the throughput:\n\n```bash\n$ ./throughput.sh \nSPARQL\n4 files\nParsed 4 files 24 lines 721 bytes in  151ms at       158 lines/sec 4,774 chars/sec\nParsed 4 files 24 lines 721 bytes in    7ms at     3,428 lines/sec 103,000 chars/sec\nParsed 4 files 24 lines 721 bytes in    6ms at     4,000 lines/sec 120,166 chars/sec\nParsed 4 files 24 lines 721 bytes in   17ms at     1,411 lines/sec 42,411 chars/sec\nParsed 4 files 24 lines 721 bytes in    7ms at     3,428 lines/sec 103,000 chars/sec\nParsed 4 files 24 lines 721 bytes in    9ms at     2,666 lines/sec 80,111 chars/sec\nParsed 4 files 24 lines 721 bytes in    7ms at     3,428 lines/sec 103,000 chars/sec\nParsed 4 files 24 lines 721 bytes in   10ms at     2,400 lines/sec 72,100 chars/sec\nParsed 4 files 24 lines 721 bytes in    6ms at     4,000 lines/sec 120,166 chars/sec\nParsed 4 files 24 lines 721 bytes in    6ms at     4,000 lines/sec 120,166 chars/sec\naverage parse 8.857ms, min 6.582ms, stddev=3.891ms (First 3 runs skipped for JIT warmup)\nJava\n16 files\nParsed 16 files 137,280 lines 4,341,964 bytes in 1455ms at    94,350 lines/sec 2,984,167 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  429ms at   320,000 lines/sec 10,121,128 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  312ms at   440,000 lines/sec 13,916,551 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  327ms at   419,816 lines/sec 13,278,177 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  290ms at   473,379 lines/sec 14,972,289 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  284ms at   483,380 lines/sec 15,288,605 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  363ms at   378,181 lines/sec 11,961,333 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  433ms at   317,043 lines/sec 10,027,630 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  232ms at   591,724 lines/sec 18,715,362 chars/sec\nParsed 16 files 137,280 lines 4,341,964 bytes in  245ms at   560,326 lines/sec 17,722,302 chars/sec\naverage parse 310.571ms, min 232.870ms, stddev=70.249ms (First 3 runs skipped for JIT warmup)\npostgresql\n3 files\nParsed 3 files 5,232 lines 181,261 bytes in 5506ms at       950 lines/sec 32,920 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in 1203ms at     4,349 lines/sec 150,674 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in  812ms at     6,443 lines/sec 223,227 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in  683ms at     7,660 lines/sec 265,389 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in  737ms at     7,099 lines/sec 245,944 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in  694ms at     7,538 lines/sec 261,182 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in  619ms at     8,452 lines/sec 292,828 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in  620ms at     8,438 lines/sec 292,356 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in  623ms at     8,398 lines/sec 290,948 chars/sec\nParsed 3 files 5,232 lines 181,261 bytes in  603ms at     8,676 lines/sec 300,598 chars/sec\naverage parse 654.143ms, min 603.979ms, stddev=50.453ms (First 3 runs skipped for JIT warmup)\n```\n\n## Improving parser performance\n\nYou'll notice that the SQL parser has much lower throughput than the Java parser. Part of this is due to the complexity of the grammars:\n\n```bash\n$ wc grammars/*.g4\n     241     723    6877 grammars/JavaLexer.g4\n     750    1825   16397 grammars/JavaParser.g4\n    2650    6672   30064 grammars/PostgreSQLLexer.g4\n    5327   12002  103840 grammars/PostgreSQLParser.g4\n     505    1217    9337 grammars/Sparql.g4\n```\n\nBut there is usually room to improve performance by left-factoring common grammatical prefixes. Using the intellij plug-in's built-in profiler, we can see that the amount of lookahead for even a small `create` statement is quite large.  Consider the highlighted section here:\n\n\u003cimg width=\"800\" alt=\"Screen Shot 2022-02-07 at 11 27 23 AM\" src=\"https://user-images.githubusercontent.com/178777/152858274-872c152c-da7e-46b4-9b92-40cad07cfac5.png\"\u003e\n\n(Open `PostgreSQLParser.g4` in Intellij, right-click on rule `root` and select `Test rule root`, enter sample input in the ANTLR Preview tool pane, then click on the profiler, click on the various headers to sort forward and backwards.)\n\nThe parser needs 12 tokens of lookahead because of the way the grammar is expressed:\n\n\u003cimg width=\"150\" alt=\"Screen Shot 2022-02-07 at 11 28 33 AM\" src=\"https://user-images.githubusercontent.com/178777/152858185-cac8af97-3a6e-42cb-a077-27f4783c3134.png\"\u003e\n\nThere are multiple statements that either start with the same left prefix or are variations on a `create` statement. In this case, it looks like the parser is scanning the entire statement until the semicolon before deciding between the various grammatical forms. SQL is a very complex (or big at least) language and so merging grammatical rules might put a larger burden on a semantic analyzer phase and might also make the grammar less readable. This is a trade off to keep in mind when either designing languages or implementing grammars. :)\n\nFor example, let's say we have a simple rule in isolation that is very similar looking from the left edge:\n\n```\ncreate\n    :    'create' 'table' ID '(' ID 'integer' ')' ';'\n    |    'create' 'table' ID '(' ')' ';'\n    ;\n```\n\nALL(\\*) has no problem with that rule; it just dynamically scans ahead until it can distinguish the alternatives of any decision it faces. The cost is having to look ahead to the token beyond '(' in order to make a decision. In this case, it looks ahead 5 tokens.\n\nWithout changing the language recognized, we can left factor and collapse the alternatives of that rule so that it needs only one symbol of look ahead to match the optional `(ID 'integer')?` subrule:\n\n```\ncreate\n    :    'create' 'table' ID '(' (ID 'integer')? ')' ';'\n    ;\n```\n\nBut, of course, this grammar might be less easy to read. The language is the same but we have expressed it differently to improve performance; the usual trade off in optimizations.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fantlr%2Fperformance","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fantlr%2Fperformance","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fantlr%2Fperformance/lists"}