https://github.com/tracktor/padmy

Last synced: 8 months ago
JSON representation
Host: GitHub
URL: https://github.com/tracktor/padmy
Owner: Tracktor
License: mit
Created: 2022-08-22T12:50:14.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2025-03-25T11:15:25.000Z (over 1 year ago)
Last Synced: 2025-03-25T11:31:20.752Z (over 1 year ago)
Language: Python
Size: 326 KB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Padmy

CLI utility functions for Postgresql such as **sampling** and **anonymization**.

## Installation

Run `poetry install`  to install the python packages.

## 1. Database Exploration

You can get information about a database by running

```bash

poetry run cli analyze --db test --schemas test

```

or using the docker image

```bash

 docker run -it \

   --network host \

   tracktor/padmy:latest analyze --db test --schemas test

```

For instance, the following table definition will output:

```sql

CREATE TABLE table1

(

    id SERIAL PRIMARY KEY

);

CREATE TABLE table2

(

    id        SERIAL PRIMARY KEY,

    table1_id INT REFERENCES table1

);

CREATE TABLE table3

(

    id        SERIAL PRIMARY KEY,

    table1_id INT REFERENCES table1,

    table2_id INT REFERENCES table2

);

CREATE TABLE table4

(

    id        SERIAL PRIMARY KEY,

    table1_id INT REFERENCES table1

);

INSERT INTO table1(id)

SELECT generate_series(0, 10);

```

**Default**

![Network schema](./docs/explore-default.png)

**Network Schema** (if `--show-graphs` is specified)

![Network schema](./docs/explore-schema.png)

## 2. Sampling

You can quickly sample (ie: take a subset) of a database by running

```bash

poetry run cli sample \

  --db test --to-db test-sampled \

  --sample 20 \

  --schemas public

```

This will sample the `test` database into a new `test-sampled` database, copy of the

original one, keeping if possible (see: [Annexe](#Known-limitations)) **20%** of the original database.

You can choose how to sample with more granularity by passing a configuration file.

Here is an example:

```yaml

# We want a default sampling size of 20% of each table count

sample: 20

# We want to sample `schema_1` and `schema_2`

schemas:

  - schema_1

  # We want a default size of 30% for the tables of this schema

  - name: schema_2

    sample: 30

tables:

  # We want a sample size of 10% for this table

  - schema: public

    table: table_3

    sample: 10

```

## 3. Migration utils

**Setting up**

This library includes a migration utility to help you evolve your data model.

In order to use it, start by setting up the migration table:

```bash

poetry run cli -vv migrate setup --db postgres

```

This will create the `public.migration` table that stores all the migration / rollback that

will be applied.

**Setting up the Schemas**

Now that we are all setup, let's create our first sql file that will create the schema:

```bash

poetry run cli -vv migrate new-sql 1 --sql-dir /tmp/sql

```  

Add `CREATE SCHEMA general;` to the file.

Then apply the modifications to the database:

```bash

poetry run cli -vv migrate apply-sql --sql-dir /tmp/sql --db postgres 

```

Notes:

This will run through all the files in the `/tmp/sql` folder (in order) run them.

Sql files here **need to be IDEMPOTENT**

**Creating a first migration**

Now, lets create our first migration:

```bash

mkdir -p /tmp/migrations # You can choose a different folder to store your migrations

poetry run cli -vv migrate new --sql-dir /tmp/migrations

```

This will create 2 new files:

- **up**: `{timestamp}-{migration_id}-up.sql` that contains your

  migration to apply to the database.

- **down**: `{timestamp}-{migration_id}-down.sql` that contains the code to revert your changes.

Let's now modify the `up.sql` file with:

```sql

CREATE TABLE IF NOT EXISTS general.test

(

    id  int primary key,

    foo int

);

CREATE TABLE IF NOT EXISTS general.test2

(

    id  serial primary key,

    foo text

);

```

and check that the migration is valid:

```bash

poetry run cli -vv migrate verify --sql-dir /tmp/migrations

``` 

Because we did not add anything to the `down.sql` file, the command returns an error.

Let's modify it to make the command pass:

```sql

DROP table general.test;

DROP table general.test2;

``` 

```bash

poetry run cli -vv migrate verify --sql-dir /tmp/migrations

``` 

We are all good !

**Optional**: You can also verify that the order of the migration is correct by running:

```bash

poetry run cli -vv migrate verify-files --sql-dir /tmp/migrations --no-raise

```

## 4. Comparing databases schemas

You can compare two databases by running:

```bash

poetry run cli -vv schema-diff --db tracktor --schemas schema_1,schema_2

```

If differences are found, the command will output the differences between the two databases.

### Known limitations

**Exact sample size**

Sometimes, we cannot guaranty that the sampled table will have the exact

expected size.

For instance let's say we want **10%** of *table1* and **10%** of *table2*, given the following

table definitions:

```sql

CREATE TABLE table1

(

    id SERIAL PRIMARY KEY

);

CREATE TABLE table2

(

    id        SERIAL PRIMARY KEY,

    table1_id INT NOT NULL REFERENCES table1

);

INSERT INTO table1(id)

VALUES (1);

INSERT INTO table2(table1_id)

SELECT 1

FROM generate_series(1, 10);

```

In this case, it's not possible to have less that **100%** of table 1 since it has only 1 key on

which depend all the `table1_id` rows of *table2*.

**Cyclic foreign keys**

Cyclic foreign keys (table with a FK on another table that reference the previous one) are not supported.

Here is an example.

```sql

CREATE TABLE table1

(

    id        SERIAL PRIMARY KEY,

    table2_id INT NOT NULL

);

CREATE TABLE table2

(

    id        SERIAL PRIMARY KEY,

    table1_id INT NOT NULL

);

ALTER TABLE table1

    ADD CONSTRAINT table1_table2_id_fk

        FOREIGN KEY (table2_id) REFERENCES table2;

ALTER TABLE table2

    ADD CONSTRAINT table2_table1_id_fk

        FOREIGN KEY (table1_id) REFERENCES table1;

```

![Cyclic dependencies](./docs/cyclic-deps.png)

You can display cycling dependencies in a database by running:

```bash

poetry run cli -vv analyze --db test --schemas test --show-graph

```

(**Note::** you'll need to have installed the `network` extra )

**Self referencing foreign keys**

Foreign keys referencing another column in the same table are ignored.

```sql

CREATE TABLE table1

(

    id        SERIAL PRIMARY KEY,

    parent_id INT REFERENCES table1

);

```

# Annexes

## Showing Network in Jupyter

You can display the network visualization in Jupyter using [jupyter_dash]()

```python

from jupyter_dash import JupyterDash

from padmy.sampling import network, viz, sampling

import asyncpg

PG_URL = 'postgresql://postgres:postgres@localhost:5432/test'

app = JupyterDash(__name__)

db = sampling.Database(name='test')

async with asyncpg.create_pool(PG_URL) as pool:

    await db.explore(pool, ['public'])

g = network.convert_db(db)

app.layout = viz.get_layout(g,

                            style={'width': '100%', 'height': '800px'},

                            layout='klay')

app.run_server(mode='jupyterlab')  # or mode='inline'

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tracktor/padmy

Awesome Lists containing this project

README