Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dolthub/grammar-crawler


https://github.com/dolthub/grammar-crawler

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# Grammar Crawler

Grammar Crawler is a configurable SQL fuzzer that works by crawling an ANTLR4 grammar to generate a wide variety of
statements with valid syntax. It is database-agnostic, but ships with a copy of
the [MySQL 8 ANTLR grammar from Oracle's MySQL Workbench project](https://github.com/mysql/mysql-workbench/blob/8.0/library/parsers/grammars/MySQLParser.g4)
to facilitate testing with MySQL's grammar.

Grammar Crawler was built to help expand the testing for [Dolt DB](https://doltdb.com/) and to help measure compliance
with MySQL's syntax. Generated statements from Grammar Crawler are fed into a fork
of [SqlLogicTest](https://github.com/dolthub/sqllogictest) to validate them against MySQL, and then all successful
statements are added to the set of test statements that run against Dolt DB as part of nightly test runs. Grammar
crawler is still in early development, but has already helped identify gaps in Dolt DB's MySQL syntax compliance.

## Why build another SQL fuzzer?

SQL statement fuzzing is a well-established technique and there are many existing SQL fuzzers that are popular and work
well. At DoltHub, we already use multiple tools with fuzzing features as part of our testing (
i.e. [SqlLogicTest](https://github.com/dolthub/sqllogictest),
[DoltHub/fuzzer](https://github.com/dolthub/fuzzer)); Grammar Crawler takes a different approach and compliments those
existing tools.
It exploits the fact that there is a fully defined grammar for the syntax we want to support, and
allows us to thoroughly explore that space.

# Get Crawling 🐛

Before we cover any more details, let's see some quick examples of Grammar Crawler in action...

## Example: Generate All Drop Statements

This first example shows how to configure the crawler to perform a complete crawl on the `dropStatement` rule in the
MySQL grammar, with just a few rules skipped, completed statements output to stdout, and summary statistics printed
out at the end.

Open up the `DropStatementsGenerator` class in the repo, run the `main` method, and you'll see results like this:

```text
DROP UNDO TABLESPACE "text0";
DROP UNDO TABLESPACE "text1" ENGINE "text2" ENGINE "text3";
DROP UNDO TABLESPACE "text4" ENGINE "text5" ENGINE 'text6';
DROP UNDO TABLESPACE "text7" ENGINE "text8" ENGINE "text9";
DROP UNDO TABLESPACE "text10" ENGINE "text11" ENGINE `t5`;
...
DROP DATABASE IF EXISTS "text21636";
DROP DATABASE IF EXISTS `t3a33`;
DROP DATABASE IF EXISTS t3a34;

Total Statements: 14,900
Literal Element Coverage:
- Total: 47
- Used: 47
- Unused: 0
- Frequent: 31
- Coverage: 100.00%
```

## Example: Generate Create Table Statements

This next example is a bit more involved. The syntax for `create table` statements is much more complex than the syntax
for `drop` statements, so this example configures more skipped rules to prune down the crawl space. This example also
uses a different `CrawlStrategy` – instead of crawling every possible path (i.e. exploring the full syntax space), it
uses the `CoverageAwareCrawlStrategy` to try and make good decisions about whether to crawl a path in the grammar or not
based on the current coverage of the grammar. It also demonstrates the configuration for `crawler.setStatementLimit`,
which allows you to cap the maximum number of statements generated by the crawler. Finally, this example configures
multiple StatementWriters – one to write statements to stdout and a second to write SqlLogicTest protos to a file,
in a format ready to be consumed by SqlLogicTest.

Open up the `CreateTableStatementsGenerator` class in the repo, run the `main` method, and you'll see results similar to
this:

```text
CREATE TABLE `t1` (LIKE `c1`);
CREATE TABLE `t2` (LIKE c1);
CREATE TABLE `t3` LIKE `c1`;
...
CREATE TABLE `t4c4b3e` (c1 TIMESTAMP (12) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968898' ENGINE 'text7968899';
CREATE TABLE `t4c4b3f` (c1 TIMESTAMP (11) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968900' ENGINE `c3`;
CREATE TABLE `t4c4b40` (c1 TIMESTAMP (14) UNIQUE, c2 CHAR VARYING (4) KEY) PASSWORD 'text7968901' ENGINE c3;

Total Statements: 1,000,000
Literal Element Coverage:
- Total: 123
- Used: 112
- Unused: 11
- Coverage: 91.06%
```

# Crawler Components

## Crawler Configuration

The crawler can be configured to skip parts of the grammar, to control how the crawler selects paths in the grammar
graph to crawl, the maximum number of statements to generate, the output format for generated statements, and more.

## Crawl Strategies

The crawler supports three crawl strategies:

* **Full Crawl** – every path through the grammar graph will be explored, with some caveats (e.g. cycles are detected
and skipped). This mode works well for small grammars or small subsets of a grammar, but can quickly produce a LOT of
generated statement templates.
* **Random Crawl** – this naive random crawl strategy will randomly select whether or not to continue crawling a path
in the grammar graph. This is useful when you have a large solution space and only want a sample of generated
statements.
* **Coverage-Aware Crawl** – this crawl strategy uses information from the crawler on which literal elements have been
used in completed statements and attempts to maximize coverage of the grammar by preferring paths through the grammar
that will include unused literal elements.

## Reification

After the crawler finishes a complete path through the grammar graph and has generated the template for a valid
statement, it reifies the template into a valid SQL statement. This involves plugging in literal values for placeholders
in the template and doing any minor cleanup to the statement to help increase the chances of it executing cleanly. This
is a critical step for producing semantically valid statements instead of just syntactically valid statements. For
example, the reification logic should ideally understand that when plugging in a literal value for a "DEFAULT "
clause in the column specification of a table, the selected value needs to be compatible with the type of the column
being defined. Reification logic in grammar crawler is still simplistic and needs further refinement in order to
increase the crawler's ability to generate semantically correct statements.

## Statement Output

After a statement has been reified, it is sent to one or more `StatementWriter` implementations configured for the
Crawler. There are currently two implementations of `StatementWriter` available:

* `StdOutStatementWriter` – This implementation simply outputs the statement directly to StdOut.
* `SQLLogicProtoStatementWriter` – This implementation writes out a proto format suitable
for [SqlLogicTest](https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki) to process.

# Contributions

We're happy to accept contributions if you want to use the grammar crawler in your work and need additional behavior.
Although we are focused exclusively on MySQL grammar compliance, there is nothing inherent in Grammar Crawler's design
that would prevent it from working with other grammars.

Feel free to cut issues or Pull Requests or [come join the Dolt DB Discord server](https://discord.com/invite/RFwfYpu)
and chat with us about ideas.