Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/svenslaggare/sqlgrep

sqlgrep = SQL + grep + tail -f
https://github.com/svenslaggare/sqlgrep

grep log-analysis log-parser logging rust sql

Last synced: 19 days ago
JSON representation

sqlgrep = SQL + grep + tail -f

Awesome Lists containing this project

README

        

![sqlgrep](assets/logo_small.png)
# sqlgrep
Combines SQL with regular expressions to provide a new way to filter and process text files.

## Build
* Requires cargo (https://rustup.rs/).
* Build with: `cargo build --release`
* Build output in `target/release/sqlgrep`

## Example
First, a schema needs to be defined that will transform text lines into structured data:
```
CREATE TABLE connections(
line = 'connection from ([0-9.]+) \\((.+)?\\) at ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([0-9]+)',

line[1] => ip TEXT,
line[2] => hostname TEXT,
line[9] => year INT,
line[4] => month TEXT,
line[5] => day INT,
line[6] => hour INT,
line[7] => minute INT,
line[8] => second INT
);
```

If we want to know the IP and hostname for all connections which have a hostname in the file `testdata/ftpd_data.txt` with the table definition above in `testdata/ftpd.txt` we can do:

```
sqlgrep -d testdata/ftpd.txt testdata/ftpd_data.txt -c "SELECT ip, hostname FROM connections WHERE hostname IS NOT NULL"
```

We can also do it "live" by tailing following the file (note the `-f` argument):

```
sqlgrep -d testdata/ftpd.txt testdata/ftpd_data.txt -f -c "SELECT ip, hostname FROM connections WHERE hostname IS NOT NULL"
```

If we want to know how many connection attempts we get per hostname (i.e. a group by query):

```
sqlgrep -d testdata/ftpd.txt testdata/ftpd_data.txt -c "SELECT hostname, COUNT() AS count FROM connections GROUP BY hostname"
```

See `testdata` folder and `src/integration_tests.rs` for more examples.

# Documentation
Tries to follow the SQL standard, so you should expect that normal SQL queries work. However, not every feature is supported yet.

## Queries
Supported features:
* Where.
* Group by.
* Aggregates.
* Having.
* Inner & outer joins. The joined table is loaded completely in memory.
* Limits.
* Extract(x FROM y) for timestamps.
* Case expressions.

Supported aggregates:
* `count(x)`
* `min(x)`
* `max(x)`
* `sum(x)`
* `avg(x)`
* `stddev(x)`
* `variance(x)`
* `percentile(x, p)`: calculates the `p` percentile of x where `p` in interval `[0.0, 1.0]`
* `bool_and(x)`
* `bool_or(x)`
* `array_agg(x)`
* `string_agg(x, delimiter)`

Supported functions:
* `least(INT|REAL|INTERVAL, INT|REAL|INTERVAL) => INT|REAL|INTERVAL`
* `greatest(INT|REAL|INTERVAL, INT|REAL|INTERVAL) => INT|REAL|INTERVAL`
* `abs(INT|REAL|INTERVAL) => INT|REAL|INTERVAL`
* `sqrt(REAL) => REAL`
* `pow(REAL, REAL) => REAL`
* `regex_matches(TEXT, TEXT) => BOOLEAN`
* `length(TEXT) => INT`
* `upper(TEXT) => TEXT`
* `lower(TEXT) => TEXT`
* `array_unique(ARRAY) => ARRAY`
* `array_length(ARRAY) => INT`
* `array_cat(ARRAY, ARRAY) => ARRAY`
* `array_append(ARRAY, ANY) => ARRAY`
* `array_prepend(ANY, ARRAY) => ARRAY`
* `now() => TIMESTAMP`
* `make_timestamp(INT, INT, INT, INT, INT, INT, INT) => TIMESTAMP`
* `date_trunc(TEXT, TIMESTAMP) => TIMESTAMP`

## Special features
The input file can either be specified using the CLI or as an additional argument to the `FROM` statement as following:
```
SELECT * FROM connections::'file.log';
```

## Tables
### Syntax
```
CREATE TABLE (
Separate pattern and column definition. Pattern can be used in multiple column definitions.
= '',
[] => ,

Use regex splits instead of matches.
= split '',

Inline regex. Will be bound to the first group
'' =>

Array pattern. Will create array of fixed sized based on the given patterns.
[], [], ... => [],

Timestamp pattern. Will create a timestamp. Year, month, day, hour, minute, second. Each part is optional.
[], [], ... => TIMESTAMP,

Json pattern. Will access the given attribute.
{ .field1.field2 } => ,
{ .field1[] } => ,
);
```
Multiple tables can be defined in the same file.

### Supported types
* `TEXT`: String type.
* `INT`: 64-bits integer type.
* `REAL`: 64-bits float type.
* `BOOLEAN`: Boolean type. When extracting data, it means the _existence_ of a group.
* `[]`: Array types such as `real[]`.
* `TIMESTAMP`: Timestamp type.
* `INTERVAL`: Interval type.

### Modifiers
Placed after the column type and adds additional constraints/transforms when extracting vale for a column.
* `NOT NULL`: The column cannot be `NULL`. If a not null column gets a null value, the row is skipped.
* `TRIM`: Trim string types for whitespaces.
* `CONVERT`: Tries to convert a string value into the value type.
* `DEFAULT `: Use this as default value instead of NULL.
* `MICROSECONDS`: The decimal second part is in microseconds, not milliseconds.