An open API service indexing awesome lists of open source software.

https://github.com/viragtripathi/crdb-sql-audit

CLI tool to extract, deduplicate, and analyze SQL logs for CockroachDB compatibility
https://github.com/viragtripathi/crdb-sql-audit

cockroach cockroach-cloud cockroach-database cockroachdb postgres postgresql

Last synced: 5 months ago
JSON representation

CLI tool to extract, deduplicate, and analyze SQL logs for CockroachDB compatibility

Awesome Lists containing this project

README

          

[![PyPI version](https://img.shields.io/pypi/v/crdb-sql-audit)](https://pypi.org/project/crdb-sql-audit/)
[![Python version](https://img.shields.io/pypi/pyversions/crdb-sql-audit)](https://pypi.org/project/crdb-sql-audit/)
[![License](https://img.shields.io/pypi/l/crdb-sql-audit)](https://pypi.org/project/crdb-sql-audit/)
[![Build status](https://github.com/viragtripathi/crdb-sql-audit/actions/workflows/python-ci.yml/badge.svg)](https://github.com/viragtripathi/crdb-sql-audit/actions)
[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://github.com/codespaces/new?repo=viragtripathi/crdb-sql-audit&machine=standardLinux32gb&devcontainer_path=.devcontainer%2Fdevcontainer.json)

# crdb-sql-audit

A powerful CLI tool to extract, deduplicate, and analyze SQL logs for **CockroachDB compatibility** using a flexible, rule-based engine.

## ๐Ÿš€ Features
- Works with **any SQL dialect** (PostgreSQL, MySQL, Oracle, etc.)
- Extracts SQL and function calls using customizable search terms (e.g. `execute`, `pg_`)
- Deduplicates repeated SQL statements from logs
- Analyzes SQL using a **YAML-based rule engine**
- Supports default compatibility rules (PostgreSQL โžœ CockroachDB)
- Allows **custom rule sets** via `--rules`
- Logs analysis output to both terminal and `crdb_sql_audit.log`
- Automatically detects SQL statement types (e.g. SELECT, DELETE)
- Friendly CLI with `--help` and `--version`
- Export full reports in multiple formats:
- `.sql`: Deduplicated queries
- `.csv`: Raw compatibility issue list
- `.md`: Developer-friendly Markdown report
- `.html`: Interactive browser report with sorting/filtering
- `.png`: Visual bar chart of issues

## ๐Ÿ–ผ Sample Output

| Report Type | Preview |
|-------------|-----------------------------------------------------------------------------------------------------------------------|
| HTML | ![HTML Report Screenshot](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_report.png) |
| Chart | ![Bar Chart](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_chart.png) |
| CSV | ![CSV Snippet](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_csv.png) |
| SQL | ![SQL Snippet](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_sql.png) |
| Markdown | ![Markdown Snippet](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_md.png) |

## ๐Ÿ“ฆ Installation

### Option A: Quick Install from PyPI

```bash
pip install crdb-sql-audit
```

### Option B: Local Dev Install
```bash
git clone https://github.com/your-org/crdb-sql-audit.git
cd crdb-sql-audit
python -m venv venv
source venv/bin/activate
pip install .
```

### Option C: Build via `pyproject.toml`
```bash
python -m build
pip install dist/crdb_sql_audit-0.2.0-py3-none-any.whl
```

## ๐Ÿงช Usage

```bash
crdb-sql-audit \
--dir /path/to/logs \
--filters execute,pg_ \
--out output/report
```

You can also analyze a single file:

```bash
crdb-sql-audit \
--file /path/to/logfile.log \
--filters SELECT,INSERT \
--raw \
--out output/single_file_report
```

> โš ๏ธ You must provide either `--dir` or `--file`, but not both.

### ๐Ÿ”ง Additional Options

```bash
--dir Directory containing SQL log files (mutually exclusive with --file)
--file Single SQL log file (mutually exclusive with --dir)
--filters Comma-separated search keywords to extract SQL (default: 'LOG: execute', 'pg_', 'LOG: statement:')
--raw Treat each matching line as a raw SQL statement (default: False)
--rules Path to YAML rules file (optional, default: built-in PostgreSQL rules)
--out Output file prefix (default: crdb_audit_output/report)
--debug Enable debug-level logging
--help Show usage help
--version Show current version
```

### ๐Ÿ“˜ CLI Help Example

```bash
crdb-sql-audit --help
```

![CLI help screenshot](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/cli_help.png)

### Custom Rules Example

```bash
crdb-sql-audit \
--dir ./logs \
--filters execute,pg_ \
--rules ./rules/mysql_to_crdb.yaml \
--out output/mysql_report
```

> ๐Ÿ’ก This tool supports auditing **any SQL dialect** โ€” just provide a rule set for your source database (e.g., PostgreSQL, MySQL, Oracle).

## ๐Ÿ“ Output
```
output/
โ”œโ”€โ”€ report.sql # Deduplicated SQL
โ”œโ”€โ”€ report.csv # Compatibility issues
โ”œโ”€โ”€ report.md # Markdown summary
โ”œโ”€โ”€ report.html # Interactive dashboard
โ”œโ”€โ”€ report_chart.png # Visual chart of issues
โ”œโ”€โ”€ crdb_sql_audit.log # Full run log
```

## ๐Ÿงน Preparing Your Log Files

To analyze SQL logs effectively, we recommend the following preprocessing steps:

### 1. Extract SQL-related Lines
```bash
grep "execute" app.log > sql_only.log
# or to include pg_ built-in function usage:
grep -E "execute|pg_" app.log > sql_only.log
```

### 2. Split Into Manageable Chunks (Optional but Recommended)
```bash
split -b 50M sql_only.log chunks/sql_chunk_
```

### 3. Run the Audit
```bash
crdb-sql-audit --dir chunks --filters execute,pg_ --out output/report
```

### ๐Ÿ—œ Supported Log Formats

This tool automatically supports reading:

* โœ… Regular `.log` or `.txt` files
* โœ… Compressed files: `.gz`, `.xz`
* โœ… Folders with mixed log formats

You can pass these directly using `--file` or `--dir`:

```bash
crdb-sql-audit --file logs/app.log.gz --out output/report_from_gz
```

### ๐Ÿงช Raw Mode vs. Filtered Mode

This tool supports two modes of SQL log analysis:

| Mode | Behavior |
|-----------------------|-------------------------------------------------------------------------|
| `--filters` (default) | Filters log lines using keywords like `LOG: execute`, `pg_`, etc. |
| `--raw` | Analyzes every line as a potential SQL statement โ€” no filtering applied |

> โœ… Use `--raw` if you want the most complete coverage, especially for mixed-format or unknown logs.
> โš ๏ธ Warning: large logs + `--raw` + `--debug` may generate gigabytes of audit output.

## ๐Ÿ“š Rule Engine Format

Rules are written in YAML and matched against each SQL line. Example:
> ๐Ÿ’ก This is also the default rule if you don't provide `--rules` param.

```yaml
# postgres_to_crdb.yaml โ€” Comprehensive CRDB Compatibility Rules based on https://www.cockroachlabs.com/docs/v25.2/sql-feature-support

- id: malformed_dml_statements
match: '^(SELECT|INSERT|UPDATE|DELETE FROM)\s*$'
message: "Possibly malformed or incomplete SQL statement"
level: warning
tags: [syntax]

- id: special_char_in_identifier
match: '"[^\"]*#\w*"'
message: "Table name contains unsupported special character (#)"
level: error
tags: [table, identifier]

- id: pg_builtins
match: '^.*\bpg_\w+\s*\(.*$'
message: "PostgreSQL pg_* function not supported in CockroachDB"
level: error
tags: [function]

- id: with_cte
match: '^\s*WITH\s+'
message: "CTE (WITH clause) detected"
level: warning
tags: [cte, syntax]

- id: upsert_syntax
match: '^\s*UPSERT\s+'
message: "UPSERT syntax (CockroachDB supports but should be reviewed)"
level: info
tags: [upsert, insert]

- id: json_ops
match: '->|->>|::json[b]?' # Look for JSON navigation or cast
message: "JSON/JSONB usage detected"
level: info
tags: [json]

- id: row_values
match: '\(.*\).*IN\s*\(' # e.g., WHERE (a, b) IN ((1, 2))
message: "ROW VALUES in IN clause"
level: warning
tags: [rowvalues, comparison]

- id: window_function
match: '\bOVER\s*\('
message: "Window function usage (e.g., RANK, ROW_NUMBER)"
level: info
tags: [window, analytics]

- id: set_ops
match: '\s+(UNION|INTERSECT|EXCEPT)\s+'
message: "Set operation (UNION, INTERSECT, EXCEPT)"
level: info
tags: [setops]

- id: case_expr
match: '\bCASE\b.*\bWHEN\b.*\bTHEN\b'
message: "CASE expression detected"
level: info
tags: [case, conditional]

- id: time_interval
match: 'INTERVAL\s+[''\"]'
message: "TIME INTERVAL expression"
level: info
tags: [interval, time]

- id: group_by_rollup
match: 'GROUP BY ROLLUP\('
message: "ROLLUP clause used"
level: warning
tags: [aggregation, rollup]

- id: filter_clause
match: 'FILTER\s*\(\s*WHERE'
message: "FILTER clause used in aggregation"
level: warning
tags: [aggregation, filter]
```

> ๐Ÿ“ฆ Multiple rule sets can be created to target different SQL dialects (e.g., `postgres_to_crdb.yaml`, `mysql_to_crdb.yaml`, etc.)

## ๐Ÿงช Validate Your Regex Rules

### ๐Ÿ” Online (Recommended)
Use [regex101.com](https://regex101.com/?flavor=python) to test your patterns:
- Set the **flavor to Python**
- Paste your rule into the regex field
- Paste a sample SQL line into the test area

### ๐Ÿ In Python
You can also test your rules directly:
```python
import re
pattern = re.compile(r'^.*\bpg_\w+\s*\(.*$', re.IGNORECASE)
sql = "SELECT pg_backend_pid()"
print(bool(pattern.search(sql))) # โœ… True
```

### ๐Ÿ›  Validate with Shell
You can use basic Unix commands to check for patterns like pg_ functions directly in your log chunks:

| Task | Command |
|------------------------------------|---------------------------------------------------------------------------|
| Total matches across chunks | `grep -oE '\bpg_[a-zA-Z0-9_]+\(' chunks/* \| wc -l` |
| Unique function names | `grep -oE '\bpg_[a-zA-Z0-9_]+\(' chunks/* \| sort \| uniq` |
| Count occurrences of each function | `grep -oE '\bpg_[a-zA-Z0-9_]+\(' chunks/* \| sort \| uniq -c \| sort -nr` |
| Full SQL lines containing pg\_\* | `grep -E '\bpg_[a-zA-Z0-9_]+\(' chunks/*` |

Also, before or after running `crdb-sql-audit`, you can inspect your logs to see how often common filters appear.

For example, to count usage of PostgreSQL built-ins and log patterns:

```bash
{
echo "๐Ÿ” pg_* function usage:"
grep -oE '\bpg_[a-zA-Z0-9_]+\(' chunks/* | sort | uniq -c | sort -nr
echo ""
echo "๐Ÿ” PostgreSQL LOG prefixes:"
grep -oE 'LOG: execute|LOG: statement:|LOG: duration:' chunks/* | sort | uniq -c | sort -nr
}
````

This will show counts of:

* Each `pg_` function used (e.g. `pg_backend_pid(`)
* Number of log lines using `LOG: execute`, `LOG: statement:`, and `LOG: duration:`

> โœ… Useful for checking whether your filters (`--filters`) are likely to match anything in the input.

---

## ๐Ÿงช Running Tests

This project includes a test suite using sample logs and rules to validate behavior.

### ๐Ÿ”ง To run locally:

```bash
python tests/test_runner.py
```

### ๐Ÿงช What it does:

* Runs `crdb-sql-audit` on a small sample of PostgreSQL-style logs
* Uses `tests/rules/test_rules.yaml`
* Verifies that a CSV report is created with expected issues

โœ… This runs automatically in GitHub Actions on every commit to `main`.

---

๐Ÿ““ [Try it in a Jupyter notebook](notebooks/demo_crdb_sql_audit.ipynb)