An open API service indexing awesome lists of open source software.

https://github.com/cbbrowne/pgcmp

Tool for comparing Postgres database schemas
https://github.com/cbbrowne/pgcmp

comparison postgresql schema shell

Last synced: 3 months ago
JSON representation

Tool for comparing Postgres database schemas

Awesome Lists containing this project

README

          

* pgcmp
** Purpose

This toolset is intended to perform comparisons ("reconciliation")
of schemas between databases to determine if they are /equivalent/.
There are several contexts where this is useful, each with a
somewhat different understanding of /equivalent/.

1. Build process
- The build process has database schema in two forms:
- Build From Scratch :: A deployable database is built from
scratch, from an empty database
- Upgrade From Earlier Version :: An older version is upgraded
to the latest version using a set of /upgrade scripts/.
- We need to ensure that the results of building from scratch
are /identical/ to those that come from upgrading the old
version using upgrade scripts.
- In this case, we do not expect /any/ variations, aside from the
fact that OIDs will vary between databases. These databases
will be built within the same *smoke test* environment, and
thus have the same sets of users, roles, and such, and there
should not be any additional components introduced (as might be
the case in production)
- It is therefore considered a build-breaking *error* to find
/any/ variation between the two database schemas
2. Validating upgrades
- When DBAs are planning to upgrade an instance, using the
/upgrade scripts/, it is essential that their starting point
represents a schema that is /equivalent/ to the *previous*
version. Otherwise, if the starting point is /wrong/, then the
upgrade scripts may not provide a proper upgrade to the new
version of the schema.
- In this environment, there are a number of /acceptable/ kinds
of differences that should be accommodated between the
"production" schema and a "development" schema:
- Replication :: Development does not include replication, but
in production, Slony schema, tables,
functions, and such will exist, and the
reconciliation should be able to exclude these
differences.
- DBA Tooling :: DBAs add additional tooling into production
databases, generally in their own schemas,
to help with monitoring.
- Supplementary Applications :: Some applications have add-ons,
and, when considering the main application, we may wish to
exclude add-ons.
- Specialized Ownership :: Often, different users are used in
production than are used in development.
- As a result, when reconciling the "old" schema version against
the version in production, it will be necessary to accept the
need for /reconciling entries/ so that DBAs are not left
puzzling over hundreds of differences that may be reasonably
/expected/.
3. Validating replicas *in sync*
- When deploying an upgrade, it is not sufficient to know that
the "master" node has an agreeable schema; it is essential that
/all/ replicas have suitable schemas.
- In the above analysis, it was necessary to accommodate there
being considerable difference between the "development" schema
and the "production" schema. When comparing replicas,
differences should be considered rather more dangerous.
Differences to accept/accommodate include:
- Node-specific deployment :: In some cases, add-ons are only installed on certain nodes in
the cluster.
- Origin/Replica differences :: Slony defines some triggers
that run only on origin nodes (/e.g./ - ~logtrigger()~),
and others that run only on replica nodes (/e.g./ -
~denyaccess()~).
** The Comparison Script
~pgcmp~ tooling consists of pair of scripts:

This script pulls data about the schema from the two databases that
are to be compared, loads that data into a third database where
that data is compared in order, and performs a reconciliation of
the similarities and differences.

- pgcmp-dump :: This tool extracts schema data from a database,
putting it into a data file for later analysis
- pgcmp :: This tool uses a pair of files generated by
~pgcmp-dump~, and looks for differences.

*** Security Considerations
This script requires /read-only/ access to schema data in
~INFORMATION_SCHEMA~ and ~pg_catalog~ in both databases that are
to be compared.

This script requires /write/ access to a third database in which
it constructs a series of tables used to perform the
reconciliation.
*** Running pgcmp components

**** pgcmp-dump

~pgcmp-dump~ uses a series of environment variables to control its
activity.

- Database connection information
- PGDB :: Database name
- PPGGHOST :: Database host
- PGUSER :: DB User
- PGPORT :: Port number
- PGPGURI :: Since PostgreSQL 9.2, [[https://www.postgresql.org/docs/9.2/static/libpq-connect.html][libpq supports PGURIs]] as a singular
connection string. ~pgcmp~ prefers the use of these
URIs. If the individual parameters above are offered,
they will be constructed into a URI; if ~PGURI~ is
populated, it will be used instead.
- PGCLABEL :: Indicates an identifer for the database's data to
report in analysis output
- PGCMPOUTPUT :: The name of the file in which to store output, that
is, the set of metadata about the database schema
that will be used later for comparison with the
schema of another database.
- PGCMPIGNORESLONYTRIGGERS :: defaults to ~false~; if ~true~, then ignore differences involving triggers of the form created by Slony-I

#+BEGIN_EXAMPLE
$ PGURI=postgresql://postgres@localhost/test1 PGCMPOUTPUT=/tmp/test-pgcmp-file1 PGCLABEL=db1 ../pgcmp-dump
pgcmp-dump - extracting schema data from database PGURI=[postgresql://postgres@localhost/test1]

Data file containing output: PGCMPOUTPUT[/tmp/test-pgcmp-file1]
Label: PGCLABEL=[db1]

Extracted schema data for postgresql://postgres@localhost/test1
-rw-r--r-- 1 cbbrowne cbbrowne 7350 Nov 3 12:12 /tmp/test-pgcmp-file1
#+END_EXAMPLE

**** pgcmp

~pgcmp~ uses a pair of input files as produced by ~pgcmp-dump~, loads
them into a comparison database, and checks for differences.

It uses a series of environment variables to control its activity.

- PGBINDIR :: Indicates where to find ~psql~

- Database connection information
- PGDB :: Database name
- PPGGHOST :: Database host
- PGUSER :: DB User
- PGPORT :: Port number
- PGURI :: Since PostgreSQL 9.2, [[https://www.postgresql.org/docs/9.2/static/libpq-connect.html][libpq supports PGURIs]] as a singular
connection string. ~pgcmp~ prefers the use of these
URIs. If the individual parameters above are offered,
they will be constructed into a URI; if ~PGURI~ is
populated, it will be used instead.
- PGCLABEL1 :: Label identifying data for the first database
- PGCLABEL2 :: Label identifying data for the second database
- PGCOMITSCHEMAS :: Indicates (as a SQL clause suitable to insert
VALUES into a table), a set of schema labels from
~pg_catalog.pg_namespace~ that should be omitted
from analysis. As identifiers, they may require
doublequotes consistent with the function
~pg_catalog.quote_ident()~. ~pg_catalog~ and
~information_schema~ were automatically left out
in ~pgcmp-dump~, so they do not need to be
included.
- Data file information
- PGCWORKDIR :: Work directory where work files are placed/found, defaults to ~/tmp~
- PGCEXPLANATIONS :: Contains records intended to explain expected variations
- Structure is a Postgres ~COPY~ of data for a table of expected
differences, defined as follows:
#+BEGIN_EXAMPLE
create table expected_differences (
object_type text,
object_name text,
difference_type text,
difference_cause text,
primary key(object_type, object_name, difference_type, difference_cause)
);
#+END_EXAMPLE
- PGCFULLOUTPUT :: Contains full output of all records, those that match and those that do not.
- Structure is a Postgres ~COPY~ of data from a table with the following structure:
#+BEGIN_EXAMPLE
create table fulloutput (
object_name text,
object_type text,
label1 text,
object_definition1 text,
label2 text,
object_definition2 text,
difference_type text,
difference_cause text,
importance integer
);
#+END_EXAMPLE
- PGCUNEXPLAINED :: Contains the set of records that did not match between schemas that did not have explanations provided.
- Structure is a Postgres COPY of data from a table with the following structure:
#+BEGIN_EXAMPLE
create table unexplained_items (
object_name text,
object_schema text,
object_type text,
label1 text,
id1 integer,
object_definition1 text,
label2 text,
id2 integer,
object_definition2 text,
difference_type text,
importance integer
);
#+END_EXAMPLE
- PGCBADEXPLAIN :: Contains the set of records that did not match between schemas where explanations offered did not match the problem.
- Structure is a Postgres COPY of data from a table with the following structure:
#+BEGIN_EXAMPLE
create table badexplanations_items (
object_type text,
object_schema text,
object_name text,
difference_type text,
difference_cause text,
importance integer
);
#+END_EXAMPLE
- PGCMPINPUT1 :: Input file containing metadata about the schema for the first database
- PGCMPINPUT2 :: Input file containing metadata about the schema for the second database

**** Interpretations of differences
- object_type :: Indicates which kind of object had a difference
- object_schema :: Indicates the namespace where the object occurs
- object_name :: Fully qualified name of the object
- difference_type :: Kind of difference, one of
| type | description |
|-------------------+-----------------------------------------|
| match | perfect match across the databases |
| mismatch | values differ between databases |
| missing in 1st DB | exists in 2nd database, but not 1st one |
| missing in 2nd DB | exists in 1st database, but not 2nd one |
- difference_cause :: Indicates whether a difference has been
explained satisfactorily via the ~explanations.txt~ data
- Not Yet Properly Explained :: Indicates the difference has not
been explained at all
- Still Not Properly Explained :: Indicates that
~explanations.txt~ indicates that the problem was not
explained. In effect, the administrator took the data
straight out of an earlier ~pgcmp~ run, and tried to use it
as an explanation, as opposed to analyzing it to determine
what should be done about it.
- Misexplained :: Indicates that the explanation in the
~explanations.txt~ file indicated a different
sort of ~difference_type~ than was found by
~pgcmp~.
- importance :: Indicates how crucial a discrepancy is. Low
~importance~ values indicate the most crucial items.
Generally:
- NULL :: no discrepancy found; no difference to be examined
- 1 :: mismatch between databases, likely indicating a problem to be fixed
- 2 :: object is present in one database, absent in the other
- 3 or more :: object is present in one database, absent in the
other, but is a "child" of some other missing
object. For instance, if a schema is missing, then
that would be indicated as ~importance=2~, the
tables, views, sequences, and such, in that schema
that are missing in one database would be marked
with ~importance=3~, and columns, indexes, and
other attributes of tables/views/sequences would be
marked with ~importance=4~. The items with
~importance~ of 3 or 4 are the consequence of the
~importance=2~ item; the focus of an analyst should
be in the high ~importance~ (1,2) items.

**** Input File: ~explanations.txt~

The input file, ~explanations.txt~, provides a set of *explanation*
items that allow an administrator to indicate explanations for
discrepancies that are considered acceptable.

For instance:
- Production includes Slony :: If comparing a /development/ schema
that does not include replication against a /production/ schema
where [[http://www.slony.info/][Slony]] has been installed, it is to be expected that all of
the Slony objects will comprise a set of "expected" differences.
- Production Monitoring :: DBAs may add in additional components such
as the ~pgstattuples~ contrib module, or even additional schemas
and tables.
- Production Users :: The production environment may be expected to
have additional users and roles not found in the /development/
schema.

#+BEGIN_EXAMPLE
create table expected_differences (
object_type text,
object_schema text,
object_name text,
difference_type text,
difference_cause text,
importance integer,
primary key(object_type, object_name, difference_type, difference_cause)
);
#+END_EXAMPLE

***** How To Populate ~explanations.txt~

The easiest way to populate this file is by running
~pgcmp.sh~ with an /empty/ set of explanations, and
then transforming the resulting set of unexplained items into
"explained" differences.

#+BEGIN_EXAMPLE
insert into expected_differences (object_type, object_schema,
object_name, difference_type, difference_cause) select object_type,
object_schema, object_name, difference_type, 'Slony objects only in
production' from unexplained_items where object_name like '_oxrspro%';

insert into expected_differences (object_type, object_schema,
object_name, difference_type, difference_cause) select object_type,
object_schema, object_name, difference_type, 'contrib objects only in
production' from unexplained_items where object_name like
'postgres_contrib%';

insert into expected_differences (object_type, object_schema,
object_name, difference_type, difference_cause) select object_type,
object_schema, object_name, difference_type, 'Conversion objects to be
removed from production' from unexplained_items where object_name like
'dotpro_conversion%';

\copy expected_differences to '/tmp/expected_differences.txt';
#+END_EXAMPLE

In subsequent runs, these differences become "expected"
differences, so that a DBA or QA analyst does not need to spend
their attention manually filtering out these expected
differences.

*** Outputs
The process has output in several forms:
- Brief report to standard output
- Files containing details
- Return codes useful for determining success/failure
**** Brief Report
Here is an example of running a comparison between two schemas:
- Parameters :: lists values for all the environment variables
- Extraction Summary :: lists information about the files of
extracted schema data
- SQL messages :: lists commands run against the comparison database
- Results Summary :: indicates, by object type, statistics on
matches, differences, and explanations. This is a summary
on the table ~fulloutput~.
- Inadequately Explained Items :: indicates specific objects
that were inadequately explained by the ~EXPLANATIONS~ data

**** File Output
The following files (based on contents of these environment
variables) are created and populated via COPY:
- FULLOUTPUT :: populated from table ~fulloutput~
- This contains a full list of all objects examined in both
databases, complete with objects, respective definitions,
and difference type and cause
- UNEXPLAINED :: populated from table ~unexplained_items~
- This lists all objects where there was some difference, but
no item found to explain the difference.
- BADEXPLAIN :: populated from table ~badexplanations_items~
- This lists all objects where there was some difference, and
an explanation, but the explanation did not properly explain
the difference. For instance, an object was missing from
the second database, but the explanation indicated that
there should have been a different definition (which
indicates that the object was expected to be found in both
databases).
**** Return Codes

If errors are encountered, the script ~pgcmp.sh~
will terminate with varying exit codes:

- exit 1 :: If data could not be extracted from the either of the source databases
- exit 1 :: If a connection is not established with the comparison database
- exit 2 :: If the comparison script does not run successfully
- exit 3 :: If not all object differences were adequately explained
- exit 0 :: If all runs to completion, and differences /were/
adequately explained

Thus, generally explaining this:
0. Comparison ran successfully, found no troublesome differences
1. Database connectivity problems
2. Error in processing comparison
3. Comparison ran, and found irreconcilable differences

These return codes should be useful if running scripts to do
automated schema analyses.
** Installation Requirements

~pgcmp~ is implemented as a shell script (expects to be run using
~/bin/bash~), and thus requires:

- Bash :: Available as ~/bin/bash~
- psql :: Available from a Postgres installation

If building an RPM file, this documentation, in ~README.org~, is
transformed to HTML using a Ruby script that uses a Ruby "gem"
called ~org-ruby~.

There are several ways the ~org-ruby~ gem may be installed:

- Manual Installation of Ruby Gem :: This Gem may be installed via
the command ~gem install org-ruby~. On many systems, this
would need to be run by the ~root~ user, perhaps using ~sudo~.

- Packaged Installation :: On Debian (and perhaps derivatives such
as Ubuntu), this gem is contained by the package ~ruby-org~,
and hence may be installed via ~apt-get install ruby-org~.