https://github.com/FGRibreau/sql-convention

:ok_hand: The only SQL conventions you will ever need
https://github.com/FGRibreau/sql-convention
conventions sql sql-conventions
Last synced: 4 months ago
JSON representation
:ok_hand: The only SQL conventions you will ever need
Host: GitHub
URL: https://github.com/FGRibreau/sql-convention
Owner: FGRibreau
Created: 2015-06-21T18:02:49.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2024-02-23T10:33:29.000Z (almost 2 years ago)
Last Synced: 2025-04-03T19:21:58.439Z (8 months ago)
Topics: conventions, sql, sql-conventions
Homepage: http://twitter.com/FGRibreau
Size: 143 KB
Stars: 65
Watchers: 3
Forks: 6
Open Issues: 1
Metadata Files:
- Readme: README.adoc
Awesome Lists containing this project

jimsghstars - FGRibreau/sql-convention - :ok_hand: The only SQL conventions you will ever need (Others)
README

          :toc:

:toclevels: 4 

:toc-placement!:

= SQL Conventions 

image::https://img.shields.io/badge/Slack-Join%20our%20tech%20community-17202A?logo=slack[link=https://join.slack.com/t/fgribreau/shared_invite/zt-edpjwt2t-Zh39mDUMNQ0QOr9qOj~jrg]

==== The only SQL convention you will ever need.

This convention is used @cloud_iam_com @Netwo @OuestFrance @hook0 @iAdvize @Bringr @Redsmin @Oxmoto.

toc::[]

=== Data layer

* For SQL use https://www.postgresql.org[PostgreSQL], it’s the

https://insights.stackoverflow.com/survey/2018/#technology-most-loved-dreaded-and-wanted-databases[most

loved relational database (StackOverflow survey 2018)] and it’s a

multi-model database (K/V store, Document store (use jsonb), foreign

data wrapper, and much more). Any questions?

=== Application layer

* If your API is only doing mainly data persistence use

https://postgrest.com[Postgrest] is the way to go and only implement the

missing part in another process. You can then compose both API with the

reverse-proxy.

* Otherwise, use a data-mapping library

(e.g. https://github.com/tpolecat/doobie[doobie]) not an ORM.

==== Queries

* Don’t use BETWEEN

(https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_BETWEEN_.28especially_with_timestamps.29[why])

* Prefer = to LIKE

____

LIKE compares characters, and can be paired with wildcard operators like %, whereas the = operator compares strings and numbers for exact matches. The = can take advantage of indexed columns. (https://www.metabase.com/learn/building-analytics/sql-templates/sql-best-practices[source])

____

* 

* Prefer `EXIST` to `IN`

____

If you just need to verify the existence of a value in a table, prefer EXISTS to IN, as the EXISTS process exits as soon as it finds the search value, whereas IN will scan the entire table. IN should be used for finding values in lists.

Similarly, prefer NOT EXISTS to NOT IN. (https://www.metabase.com/learn/building-analytics/sql-templates/sql-best-practices[source])

____

=== DDL - Data Description Language

* `SET search_path=pg_catalog` to force to explicitely specify schema names in every object declaration (besides triggers). This will lower bugs and gives better understanding to developers because (https://getnobullshit.com/)[explicit > implicit].

=== Tables/Views

==== Table/Views name

* *singular* (e.g. `+team+` not `+teams+`) (https://launchbylunch.com/posts/2014/Feb/16/sql-naming-conventions/#singular-relations[Here is why])

* *snake_case* (e.g. `block_theme_version`)

* *double underscore* for `+n-n+` tables (e.g. `user__organization`)

==== Columns

* *snake_case* (for example: `+created_at+`. Not `+createdAt+` or `CreatedAt`) Because in PostgreSQL https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS[keywords and unquoted identifiers are case insensitive] and is the source of many mistakes.

* *double underscore* for PK and FK columns (e.g. (PK) `+user__id+`, (PK) `+user__id+`, (FK) `+organization__id+`, (FK)

`+organization__id+`)

 ** why?

   *** leverage `using(column__id)`

   *** easier to grasp for PK/FK the table name part (the part before `__`) for snake_case columns

   *** Column are case-sensitive in postgresql but SQL queries are case insensitive

* *`NOT NULL` by default*, NULL is the exception (think of it as the https://github.com/chrissrogers/maybe#why[maybe Monad])

* *No abbreviation* unless it's both well-known and very long like `i18n`

* *No reserved keywords* (https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html[Complete list])

* *Use UUID* as PK and FK (https://www.clever-cloud.com/blog/engineering/2015/05/20/why-auto-increment-is-a-terrible-idea/[Where is why]), (https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_serial[do not use `serial`]) rely on `gen_random_uuid()` (https://shusson.info/post/benchmark-v4-uuid-generation-in-postgres[benchmark])

  * Note that when you use Postgres native UUID v4 type instead of bigserial, table size grows by 25% and insert rate drops to 25%.

  * If you choose bigserial than distinguish internal and external ids (e.g. gitlab internal schema design names column "iid" those that are publicly shared to the end user). Don't forget to add an index `CREATE UNIQUE INDEX index_issues_on_project_id_and_iid ON public.issues USING btree (project_id, iid);`

* Use `text` or `citext` (variable unlimited length) with check constraint instead of `varchar(n)` or `char(n)`.

  * `text` type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks. ([source](https://shekhargulati.com/2022/07/08/my-notes-on-gitlabs-postgres-schema-design/))

#### Date time management

* Use `timestamptz` everywhere you need to store a date (e.g. `+created_at TIMESTAMPTZ DEFAULT now()+` (https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29[Here is why])) and leverage the https://www.postgresql.org/docs/current/brin-intro.html[BRIN] index on it

* `+updated_at TIMESTAMPTZ DEFAULT now()+` unless you plan to leverage (https://www.morling.dev/blog/last-updated-columns-with-postgres/[learn more])

event-sourcing

* `+deleted_at TIMESTAMPTZ DEFAULT NULL+`:

** unless you plan to leverage event-sourcing

** don’t forget to

http://stackoverflow.com/questions/8289100/create-unique-constraint-with-null-columns/8289253#8289253[`+deleted_at+`]

* Comment each column, explain your rational, explain your decisions, should be in plain english

* Boolean columns must start with either `+is+` or `+has+`.

* https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_char.28n.29[Don't use char(n)]

https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_char.28n.29_even_for_fixed-length_identifiers[even for fixed-length identifiers]

=== Constraints

General rule is: `+{tablename}_{columnname(s)}_{suffix}+`

(e.g. `+table_name_column_name_a__pkey+`) where the suffix is one of the

following: 

* Primary Key constraint: `+pk+` 

* Foreign key: `+fk+`

* Unique constraint: `+key+` 

* Check constraint: `+chk+` 

* Exclusion constraint: `+exl+` 

* Any other kind of index: `+idx+`

==== PK - Primary Key

* `+{table_name}_{column_name}_pk+` in case of a single column PK

* `+{table_name}_{column_name1}_{column_name2}_{column_name3}_pk+` in case of

multiple columns as primary key (`+column_name1+`, `+column_name2+`,

`+column_name3+`)

==== FK - Foreign key

* `+{from_table_name}_{from_column_name}_{to_table_name}_{to_column_name}__fk+`

* Always specify `ON DELETE` `ON UPDATE` in order to force *you* to think about reference consequences

==== Unique

* `+{from_table_name}_{from_column_name}_key+` in case of a single column unique

constraint

* `+{from_table_name}_{from_column_name1}_{from_column_name2}_{from_column_name3}__key+` in case of

multiple columns as unique (`+column_name1+`, `+column_name2+`,

`+column_name3+`)

=== Functions

==== Name

They are 3 types of functions, `+notify+` functions and `+private+`

functions and `+public+` functions

* *notify*, format: notify[_schema_name_][_table_name_][_event_] (e.g. `+notify_authentication_user_created(user_id)+`): should only format the notification message underneath and use pg_notify. Beware of the

http://stackoverflow.com/a/41059797/745121[8000 characters limit], only

send metadata (ids), data should be asked by workers through the API. If

you really wish to send data then

https://github.com/xstevens/pg_kafka[pg_kafka] might be a better

alternative.

* *private*, format: _[`+_function_name_+`]

(e.g. `+_reset_failed_login+`): must never be exposed through the public

schema. Used mainly for consistency and business-rules

* *public*, format [`+_function_name_+`] (e.g. `+log_in(email, password)+`): must be

exposed through the public schema.

==== Parameters

Every parameter name must ends with `$`. This will prevent any "Reference to XXX is ambiguous" issue.

===== Example 

```sql

create function lib_fsm.transition_create(

  from_state__id$ uuid, 

  event$ varchar(30), 

  to_state__id$ uuid, 

  description$ text default null

)

```

=== Types

==== Enum types

Don't use enums, you will have issue over time because https://stackoverflow.com/a/25812436/745121[you cannot remove element from an enum].

If your enums represent various state, leverage https://en.wikipedia.org/wiki/Finite-state_machine[a state machine]. Use a library like https://github.com/netwo-io/lib_fsm[lib_fsm].

==== Boolean

Always use `true` and `false`, without single-quote.

PostgreSQL documentation says that `TRUE` and `FALSE` should be prefered because they are more SQL compliant but hey, LET'S STOP YELLING WHEN WE WRITE SQL SHALL WE?

==== String

- Multi-line string must be represented with `$_$my string$_$`

==== JSONB

- prefer `+jsonb+` to `json` and sql arrays. Jsonb has improved query performance and efficient storage

 

- A `metadata` jsonb column is a great way to let the end-user store arbitrary key-value data to these objects. (e.g. https://documentation.hook0.com/docs/metadata https://stripe.com/docs/api/metadata )

metadata key-value pair must be https://www.getnobullshit.com/tech-lead/tout-limiter-dans-lespace-et-dans-le-temps[limited in space] you can use a trigger for that:

[source,sql]

----

CREATE OR REPLACE FUNCTION validate_metadata()

RETURNS TRIGGER AS $$

DECLARE

    key TEXT;

    value TEXT;

    keys INT;

BEGIN

    keys := 0;

    FOR key, value IN (SELECT * FROM jsonb_each_text(NEW.metadata))

    LOOP

        keys := keys + 1;

        IF length(key::text) > 40 OR length(value::text) > 500 THEN

            RAISE 'Key and value must be at most 40 and 500 characters long respectively.';

        END IF;

        IF keys > 50 THEN

            RAISE 'A maximum of 50 keys are allowed in the metadata.';

        END IF;

    END LOOP;

    RETURN NEW;

END;

$$ LANGUAGE plpgsql;

CREATE TRIGGER validate_metadata_trigger

BEFORE INSERT OR UPDATE ON your_table

FOR EACH ROW EXECUTE FUNCTION validate_metadata();

----

=== Triggers

==== Name

(translation in progress)

==== Columns

* utiliser BNCF (au dessus de la 3NF) (cf normal form)

* leverage `+using+`, so instead of:

[source,sql]

----

select  from

  table_1

  inner join table_2

    on table_1.table_1_id =

       table_2.table_1_id

----

use:

[source,sql]

----

select  from

  table_1

  inner join table_2

    using (table_1_id)

----

* don’t use PostgreSQL enums you will have issues when you need to https://stackoverflow.com/a/25812436/745121[remove some values over time]. Use a dedicated table instead.

* use the right PostgreSQL types:

....

inet (IP address)

timestamp with time zone

point (2D point)

tstzrange (time range)

interval (duration)

....

* constraint should be inside your database as much as possible:

[source,sql]

----

create table reservation(

    reservation_id uuid primary key,

    dates tstzrange not null,

    exclude using gist (dates with &&)

);

----

* use row-level-security to ensure R/U/D access on each table rows

(http://stackoverflow.com/questions/4107915/postgresql-default-constraint-names/4108266#4108266[source])

=== Policies

==== Name

todo.

=== SQL Formatter

```bash

docker run --rm --network=none guriandoro/sqlparse:0.3.1 "SELECT several, columns from a_table as a join another_table as b where a.id = 1;"

```

=== Configuration

==== `statement_timeout`

Since we do want to https://www.getnobullshit.com/[limit everything in space and time], configure `statement_timeout` on role to let your database abort any statement that takes more than the specified amount of time (in ms).

```sql

-- Limit in time SQL queries => improve overall reliability

-- https://www.postgresql.org/docs/current/runtime-config-client.html

-- PostgreSQL WILL ABORT any statement that takes more than the specified amount of time (in milliseconds)

-- If you do have an issue with that, please first (from first to last):

--  - .. check that your query is relying on indices (did you use EXPLAIN (ANALYZE, BUFFERS) ?)

--  - .. consider materialized views

--  - .. ensure pg cache settings are OK

--  - .. ensure the disk is SSD and fast enough

--  - .. ensure the server has enough CPU & RAM

--  - .. check if its for analytics purposes, if so then requesting a postgres replica might be a better idea

-- When all these above points were evaluated *then* we can all talk about increasing the values below :)

alter role APP_ROLE_THAT_DOES_THE_QUERY set statement_timeout to '250ms';

```

== Things to monitor

- https://www.percona.com/blog/2020/05/29/removing-postgresql-bottlenecks-caused-by-high-traffic/[Removing PostgreSQL Bottlenecks Caused by High Traffic]

____

Your cache hit ratio tells you how often your data is served from in

memory vs. having to go to disk. Serving from memory vs. going to disk

will be orders of magnitude faster, thus the more you can keep in memory

the better. Of course you could provision an instance with as much

memory as you have data, but you don’t necessarily have to. Instead

watching your cache hit ratio and ensuring it is at 99% is a good metric

for proper performance.

(https://www.citusdata.com/blog/2019/03/29/health-checks-for-your-postgres-database/[Source])

____

[source,sql]

----

SELECT

  sum(heap_blks_read) as heap_read,

  sum(heap_blks_hit)  as heap_hit,

  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio

FROM

  pg_statio_user_tables;

----

____

Under the covers Postgres is essentially a giant append only log. When

you write data it appends to the log, when you update data it marks the

old record as invalid and writes a new one, when you delete data it just

marks it invalid. Later Postgres comes through and vacuums those dead

records (also known as tuples). All those unvacuumed dead tuples are

what is known as bloat. Bloat can slow down other writes and create

other issues. Paying attention to your bloat and when it is getting out

of hand can be key for tuning vacuum on your database.

(https://www.citusdata.com/blog/2019/03/29/health-checks-for-your-postgres-database/[Source])

____

[source,sql]

----

WITH constants AS (

  SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 4 AS ma

), bloat_info AS (

  SELECT

    ma,bs,schemaname,tablename,

    (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,

    (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2

  FROM (

    SELECT

      schemaname, tablename, hdr, ma, bs,

      SUM((1-null_frac)*avg_width) AS datawidth,

      MAX(null_frac) AS maxfracsum,

      hdr+(

        SELECT 1+count(*)/8

        FROM pg_stats s2

        WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename

      ) AS nullhdr

    FROM pg_stats s, constants

    GROUP BY 1,2,3,4,5

  ) AS foo

), table_bloat AS (

  SELECT

    schemaname, tablename, cc.relpages, bs,

    CEIL((cc.reltuples*((datahdr+ma-

      (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta

  FROM bloat_info

  JOIN pg_class cc ON cc.relname = bloat_info.tablename

  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema'

), index_bloat AS (

  SELECT

    schemaname, tablename, bs,

    COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,

    COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols

  FROM bloat_info

  JOIN pg_class cc ON cc.relname = bloat_info.tablename

  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema'

  JOIN pg_index i ON indrelid = cc.oid

  JOIN pg_class c2 ON c2.oid = i.indexrelid

)

SELECT

  type, schemaname, object_name, bloat, pg_size_pretty(raw_waste) as waste

FROM

(SELECT

  'table' as type,

  schemaname,

  tablename as object_name,

  ROUND(CASE WHEN otta=0 THEN 0.0 ELSE table_bloat.relpages/otta::numeric END,1) AS bloat,

  CASE WHEN relpages < otta THEN '0' ELSE (bs*(table_bloat.relpages-otta)::bigint)::bigint END AS raw_waste

FROM

  table_bloat

    UNION

SELECT

  'index' as type,

  schemaname,

  tablename || '::' || iname as object_name,

  ROUND(CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages/iotta::numeric END,1) AS bloat,

  CASE WHEN ipages < iotta THEN '0' ELSE (bs*(ipages-iotta))::bigint END AS raw_waste

FROM

  index_bloat) bloat_summary

ORDER BY raw_waste DESC, bloat DESC

----

____

Postgres makes it simply to query for unused indexes so you can easily

give yourself back some performance by removing them

(https://www.citusdata.com/blog/2019/03/29/health-checks-for-your-postgres-database/[Source])

____

[source,sql]

----

SELECT

            schemaname || '.' || relname AS table,

            indexrelname AS index,

            pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,

            idx_scan as index_scans

FROM pg_stat_user_indexes ui

         JOIN pg_index i ON ui.indexrelid = i.indexrelid

WHERE NOT indisunique AND idx_scan < 50 AND pg_relation_size(relid) > 5 * 8192

ORDER BY pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST,

         pg_relation_size(i.indexrelid) DESC;

----

____

pg_stat_statements is useful for monitoring your database query

performance. It records a lot of valuable stats about which queries are

run, how fast they return, how many times their run, etc. Checking in on

this set of queries regularly can tell you where is best to add indexes

or optimize your application so your query calls may not be so

excessive.

(https://www.citusdata.com/blog/2019/03/29/health-checks-for-your-postgres-database/[Source])

____

[source,sql]

----

SELECT query,

       calls,

       total_time,

       total_time / calls as time_per,

       stddev_time,

       rows,

       rows / calls as rows_per,

       100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent

FROM pg_stat_statements

WHERE query not similar to '%pg_%'

and calls > 500

--ORDER BY calls

--ORDER BY total_time

order by time_per

--ORDER BY rows_per

DESC LIMIT 20;

----

== Schema design

* https://github.com/FGRibreau/stripe-schema[Stripe own schema]

== Tools

* https://www.postgresql.org/docs/9.4/pgstatstatements.html[pg_stat_statements]

* https://github.com/darold/pgbadger[A fast PostgreSQL Log Analyzer]

* https://pganalyze.com[PostgreSQL Performance Monitoring]

== Migrations

- https://pythonspeed.com/articles/schema-migrations-server-startup/[How to do Zero-downtime migrations]

- https://medium.com/braintree-product-technology/postgresql-at-scale-database-schema-changes-without-downtime-20d3749ed680[Zero-downtime migrations best practices]

== Good practices

* https://hakibenita.com/sql-dos-and-donts[12 Common Mistakes and Missed Optimization Opportunities in SQL]

* https://pythonspeed.com/articles/schema-migrations-server-startup/[Don't apply migrations on application startup]

== Managed PostgreSQL Databases

* Google Cloud PostgreSQL

** Pros

** Cons

*** No support for plv8

*** Any features that require `superuser` privileges are not supported

*** `postgres` role is not a `superuser`

**** Can create roles

**** Can not select from tables that are restricted by default like `pg_shadow`

**** Thus can not edit `pg_catalog.pg_class` (in order to change row level security activation for example)

**** Can read from all necessary tables other than `pg_authid`

**** 

* Scaleway Managed PostgreSQL:

** Pros

*** multi-schema support

*** configuration options are editable

*** user/role management is self-service

** Cons

*** /

* OVH Cloud SQL

** Pros

*** /

** Cons

*** no multi-schema support
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/FGRibreau/sql-convention

Awesome Lists containing this project

README