https://github.com/nikic/phlexy
Lexing experiments in PHP
https://github.com/nikic/phlexy
Last synced: 9 months ago
JSON representation
Lexing experiments in PHP
- Host: GitHub
- URL: https://github.com/nikic/phlexy
- Owner: nikic
- License: other
- Created: 2012-10-04T16:06:28.000Z (over 13 years ago)
- Default Branch: master
- Last Pushed: 2021-08-24T18:55:09.000Z (over 4 years ago)
- Last Synced: 2024-10-13T23:29:21.392Z (over 1 year ago)
- Language: PHP
- Size: 77.1 KB
- Stars: 161
- Watchers: 15
- Forks: 11
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Phlexy
======
This project is a followup to [my post on fast lexing in PHP][lexing_blog_post]. It contains a few lexer implementations
(both stateless and stateful) and related performance tests.
Usage
-----
Lexers are created from a lexer definition using a factory class.
For example, if you want to create a MARK based stateless CSV lexer, you can use the following code:
```php
createLexer(array(
'[^",\r\n]+' => 0, // 0, 1, 2, 3 are the tokens
'"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"' => 1, // they should really be constants
',' => 2,
'\r?\n' => 3,
));
$tokens = $lexer->lex("hallo world,foo bar,more foo,more bar,\"rare , escape\",some more,stuff\n...");
```
Similarly a stateful lexer:
```php
createLexer($lexerDefinition, 'i');
```
For an example of a stateful lexer definition, you can look the [definition for lexing PHP source
code][php_lexer_definition].
Performance
-----------
A performance comparison for the different lexer implementations can be done using the [performance testing
script][performance_test_file]:
```
$ php-7.2 examples/performanceTests.php
Timing lexing of CVS data:
Took 0.55736708641052 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.526859998703 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.49272608757019 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.5570011138916 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.46333193778992 seconds (Phlexy\Lexer\Stateless\UsingMarks)
Timing alphabet lexing of all "a":
Took 0.58650183677673 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.754310131073 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.70682787895203 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.76406478881836 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.62837815284729 seconds (Phlexy\Lexer\Stateless\UsingMarks)
Timing alphabet lexing of all "z":
Took 0.79967403411865 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.30202317237854 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.29198718070984 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.36609601974487 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.12433409690857 seconds (Phlexy\Lexer\Stateless\UsingMarks)
Timing alphabet lexing of random string:
Took 1.1720998287201 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.5946900844574 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.55696296691895 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.6708779335022 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)
Took 0.33155107498169 seconds (Phlexy\Lexer\Stateless\UsingMarks)
Timing PHP lexing of this file:
Took 0.151211977005 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.025480031967163 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.007037878036499 seconds (Phlexy\Lexer\Stateful\UsingMarks)
Timing PHP lexing of larger TestAbstract file:
Took 0.49794602394104 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.083348035812378 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)
Took 0.019592046737671 seconds (Phlexy\Lexer\Stateful\UsingMarks)
```
`Stateless\Simple` and `Stateful\Simple` are trivial lexer implementations (which loop through the regular expressions).
`Stateless\WithoutCapturingGroups`, `Stateless\WithCapturingGroups` and `Stateful\UsingCompiledRegex` use the compiled
regex approach described in the blog post mentioned above.
`Stateless\UsingPregReplace` is an extension of the compiled regex approach, where the looping through the regular
expression is done by (mis)using `preg_replace_callback`.
`Stateless\UsingMarks` and `Stateful\UsingMark` use the `(*MARK)` mechanism that was exposed in PHP 5.5.
As the above performance measurments show, the `Simple` approach is a good bit slower than using a compiled regex approach. Mark based implementation perform much better than group offset based ones. The benefits increase with lexer size: For the CSV lexer there is relatively little difference, while for the PHP lexer the mark based implementation is 25x faster than the naive one.
[lexing_blog_post]: http://nikic.github.com/2011/10/23/Improving-lexing-performance-in-PHP.html
[php_lexer_definition]: https://github.com/nikic/Phlexy/blob/master/examples/phpLexerDefinition.php
[performance_test_file]: https://github.com/nikic/Phlexy/blob/master/examples/performanceTests.php