{"id":13705686,"url":"https://github.com/laurenz/pgreplay","last_synced_at":"2025-05-05T16:33:46.892Z","repository":{"id":45424084,"uuid":"14922217","full_name":"laurenz/pgreplay","owner":"laurenz","description":"pgreplay reads a PostgreSQL log file (*not* a WAL file), extracts the SQL statements and executes them in the same order and relative time against a PostgreSQL database cluster.","archived":false,"fork":false,"pushed_at":"2023-10-02T06:34:30.000Z","size":264,"stargazers_count":214,"open_issues_count":1,"forks_count":29,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-08-03T22:15:49.771Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/laurenz.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2013-12-04T11:37:54.000Z","updated_at":"2024-08-03T02:57:18.000Z","dependencies_parsed_at":"2024-01-14T19:14:35.092Z","dependency_job_id":"1f4749f5-9732-42d5-b03e-53fc17cb0bd0","html_url":"https://github.com/laurenz/pgreplay","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laurenz%2Fpgreplay","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laurenz%2Fpgreplay/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laurenz%2Fpgreplay/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/laurenz%2Fpgreplay/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/laurenz","download_url":"https://codeload.github.com/laurenz/pgreplay/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224455907,"owners_count":17314204,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T22:00:46.107Z","updated_at":"2024-11-13T13:30:49.186Z","avatar_url":"https://github.com/laurenz.png","language":"C","funding_links":[],"categories":["C"],"sub_categories":[],"readme":"pgreplay - record and replay real-life database workloads\n=========================================================\n\npgreplay reads a PostgreSQL log file (*not* a WAL file), extracts the\nSQL statements and executes them in the same order and with the original\ntiming against a PostgreSQL database.\n\nIf the execution of statements gets behind schedule, warning messages\nare issued that indicate that the server cannot handle the load in a\ntimely fashion.\n\nA final report gives you a useful statistical analysis of your workload\nand its execution.\n\nThe idea is to replay a real-world database workload as exactly as possible.\n\nThis is useful for performance tests, particularly in the following\nsituations:\n- You want to compare the performance of your PostgreSQL application\n  on different hardware or different operating systems.\n- You want to upgrade your database and want to make sure that the new\n  database version does not suffer from performance regressions that\n  affect you.\n\nMoreover, pgreplay can give you some feeling as to how your application\n*might* scale by allowing you to try to replay the workload at a higher\nspeed (if that is possible; see\n[implementation details](#implementation-details) below).\nBe warned, though, that 500 users working at double speed is not really\nthe same as 1000 users working at normal speed.\n\nWhile pgreplay will find out if your database application will encounter\nperformance problems, it does not provide a lot of help in the analysis of\nthe cause of these problems.  Combine pgreplay with a specialized analysis\nprogram like [pgBadger](https://pgbadger.darold.net/) for that.\n\nAs an additional feature, pgreplay lets you split the replay in two\nparts: you can parse the log file and create a \"replay file\", which\ncontains just the statements to be replayed and is hopefully much\nsmaller than the original log file.  \nSuch a replay file can then be run against a database.\n\npgreplay is written by Laurenz Albe and is inspired by \"Playr\"\nwhich never made it out of Beta.\n\nInstallation\n============\n\npgreplay needs PostgreSQL 8.0 or better.\n\nIt is supposed to compile without warnings and run on all platforms\nsupported by PostgreSQL.  \nSince I only got to test it on Linux, AIX, FreeBSD and Windows, there may be\nproblems with other platforms. I am interested in reports and fixes for\nthese platforms.  \nOn Windows, only the MinGW build environment is supported (I have no\nother compiler). That means that there is currently no 64-bit build\nfor Windows (but a 32-bit executable should work fine anywhere).\n\nTo build pgreplay, you will need the `pg_config` utility. If you installed\nPostgreSQL using installation packages, you will probably have to install\nthe development package that contains `pg_config` and the header files.\n\nIf `pg_config` is on the `PATH`, the installation process will look like this:\n\n- unpack the tarball\n- `./configure`\n- `make`\n- `make test`     (optional, described below)\n- `make install`  (as superuser)\n\nIf your PostgreSQL installation is in a nonstandard directory, you\nwill have to use the `--with-postgres=\u003cpath to location of pg_config\u003e`\noption of `configure`.\n\nUnless you link it statically, pgreplay requires the PostgreSQL client \nshared library on the system where it is run.\n\nThe following utilities are only necessary if you intend to develop pgreplay:\n- autoconf 2.62 or better to generate `configure`\n- GNU tar to `make tarball` (unless you want to roll it by hand)\n- groff to make the HTML documentation with `make html`\n\nDocker\n------\n\nThe `Dockerfile` provided with the software can be used as a starting\npoint for creating a container that runs pgreplay.  Adapt is as necessary.\n\nHere are commands to build and run the container:\n\n```\n# build the image\ndocker build -t laurenz/pgreplay -f Dockerfile .\n\n# and run it\ndocker run --rm -ti -v $(pwd):/app -w /app laurenz/pgreplay pgreplay -h\n```\n\nTesting\n-------\n\nYou can run a test on pgreplay before installing by running `make test`.\nThis will parse sample log files and check that the result is as\nexpected.\n\nThen an attempt is made to replay the log files and check if that\nworks as expected.  For this you need psql installed and a PostgreSQL server\nrunning (on this or another machine) so that the following command\nwill succeed:\n\n    psql -U postgres -d postgres -l\n\nYou can set up the `PGPORT` and `PGHOST` environment variables and a password\nfile for the user if necessary.\n\nThere have to be a login roles named `hansi` and `postgres` in the database,\nand both users must be able to connect without a password.  Only `postgres`\nwill be used to run actual SQL statements.  The regression test will create\na table `runtest` and use it, and it will drop the table when it is done.\n\nUsage\n=====\n\nFirst, you will need to record your real-life workload.\nFor that, set the following parameters in `postgresql.conf`:\n\n- `log_min_messages = error`  (or more)  \n   (if you know that you have no cancel requests, `log` will do)\n- `log_min_error_statement = log`  (or more)\n- `log_connections = on`\n- `log_disconnections = on`\n- `log_line_prefix = '%m|%u|%d|%c|'`  (if you don't use CSV logging)\n- `log_statement = 'all'`\n- `lc_messages` must be set to English (the encoding does not matter)\n- `bytea_output = escape`  (from version 9.0 on, only if you want to replay\n                            the log on 8.4 or earlier)\n\nIt is highly recommended that you use CSV logging, because anything that\nthe PostgreSQL server or any loaded modules write to standard error will\nbe written to the stderr log and might confuse the parser.\n\nThen let your users have their way with the database.\n\nMake sure that you have a `pg_dumpall` of the database cluster from the time\nof the start of your log file (or use the `-b` option with the time of your\nbackup).  Alternatively, you can use point-in-time-recovery to clone your\ndatabase at the appropriate time.\n\nWhen you are done, restore the database (in the \"before\" state) to the\nmachine where you want to perform the load test and run pgreplay against\nthat database.\n\nTry to create a scenario as similar to your production system as\npossible (except for the change you want to test, of course).  For example,\nif your clients connect over the network, run pgreplay on a different\nmachine from where the database server is running.\n\nSince passwords are not logged (and pgreplay consequently has no way of\nknowing them), you have two options: either change `pg_hba.conf` on the\ntest database to allow `trust` authentication or (if that is unacceptable)\ncreate a password file as described by the PostgreSQL documentation.\nAlternatively, you can change the passwords of all application users\nto one single password that you supply to pgreplay with the `-W` option.\n\nLimitations\n===========\n\npgreplay can only replay what is logged by PostgreSQL.\nThis leads to some limitations:\n\n- `COPY` statements will not be replayed, because the copy data are not logged.\n  I could have supported `COPY TO` statements, but that would have imposed a\n  requirement that the directory structure on the replay system must be\n  identical to the original machine.\n  And if your application runs on the same machine as your database and they\n  interact on the file system, pgreplay will probably not help you much\n  anyway.\n- Fast-path API function calls are not logged and will not be replayed.\n  Unfortunately, this includes the Large Object API.\n- Since the log file is always written in the database encoding (which you\n  can specify with the `-E` switch of pgreplay), all `SET client_encoding`\n  statements will be ignored.\n- If your cluster contains databases with different encoding, the log file\n  will have mixed encoding as well.  You cannot use pgreplay well in such\n  an environment, because many statements against databases whose\n  encoding does not match the `-E` switch will fail.\n- Since the preparation time of prepared statements is not logged (unless\n  `log_min_messages` is `debug2` or more), these statements will be prepared\n  immediately before they are first executed during replay.\n- All parameters of prepared statements are logged as strings, no matter\n  what type was originally specified during bind.\n  This can cause errors during replay with expressions like `$1 + $2`,\n  which will cause the error `operator is not unique: unknown + unknown`.\n\nWhile pgreplay makes sure that commands are sent to the server in the\norder in which they were originally executed, there is no way to guarantee\nthat they will be executed in the same order during replay:  Network\ndelay, processor contention and other factors may cause a later command\nto \"overtake\" an earlier one.  While this does not matter if the\ncommands don't affect each other, it can lead to SQL statements hitting\nlocks unexpectedly, causing replay to deadlock and \"hang\".\nThis is particularly likely if many different sessions change the same data\nrepeatedly in short intervals.\n\nYou can work around this problem by canceling the waiting statement with\npg_cancel_backend.  Replay should continue normally after that.\n\nImplementation details\n======================\n\npgreplay will track the \"session ID\" associated with each log entry (the\nsession ID uniquely identifies a database connection).\nFor each new session ID, a new database connection will be opened during\nreplay.  Each statement will be sent on the corresponding connection, so\ntransactions are preserved and concurrent sessions cannot get in each\nother's way.\n\nThe order of statements in the log file is strictly preserved, so there\ncannot be any race conditions caused by different execution speeds on\nseparate connections.  On the other hand, that means that long running\nqueries on one connection may stall execution on concurrent connections,\nbut that's all you can get if you want to reproduce the exact same\nworkload on a system that behaves differently.\n\nAs an example, consider this (simplified) log file:\n\n    session 1|connect\n    session 2|connect\n    session 1|statement: BEGIN\n    session 1|statement: SELECT something(1)\n    session 2|statement: BEGIN\n    session 2|statement: SELECT something(2)\n    session 1|statement: SELECT something(3)\n    session 2|statement: ROLLBACK\n    session 2|disconnect\n    session 1|statement: COMMIT\n    session 2|disconnect\n\nThis will cause two database connections to be opened, so the `ROLLBACK` in\nsession 2 will not affect session 1.\nIf `SELECT something(2)` takes longer than expected (longer than it did in\nthe original), that will not stall the execution of `SELECT something(3)`\nbecause it runs on a different connection.  The `ROLLBACK`, however, has to\nwait for the completion of the long statement.  Since the order of statements\nis preserved, the `COMMIT` on session 1 will have to wait until the `ROLLBACK`\non session 2 has started (but it does not have to wait for the completion of\nthe `ROLLBACK`).\n\npgreplay is implemented in C and makes heavy use of asynchronous command\nprocessing (which is the reason why it is implemented in C).\nThis way a single process can handle many concurrent connections, which\nmakes it possible to get away without multithreading or multiprocessing.\n\nThis avoids the need for synchronization and many portability problems.\nBut since TINSTAAFL, the choice of C brings along its own portability\nproblems.  Go figure.\n\nReplay file format\n------------------\n\nThe replay file is a binary file, integer numbers are stored in network\nbyte order.\n\nEach record in the replay file corresponds to one database operation\nand is constructed as follows:\n- 4-byte `unsigned int`: log file timestamp in seconds since 2000-01-01\n- 4-byte `unsigned int`: fractional part of log file timestamp in microseconds\n- 8-byte `unsigned int`: session id\n- 1-byte `unsigned int`: type of the database action:\n  - 0 is connect\n  - 1 is disconnect\n  - 2 is simple statement execution\n  - 3 is statement preparation\n  - 4 is execution of a prepared statement\n  - 5 is cancel request\n- The remainder of the record is specific to the action, strings are stored\n  with a preceeding 4-byte unsigned int that contains the length.\n  Read the source for details.\n- Each record is terminated by a new-line character (byte 0x0A).\n\nSupport\n=======\n\nIf you have a problem or question, the preferred option is to [open an\nissue](https://github.com/laurenz/pgreplay/issues).\nThis requires a GitHub account.\n\nProfessional support can be bought from\n[CYBERTEC PostgreSQL International GmbH](https://www.cybertec-postgresql.com/).\n\nTODO list\n=========\n\nNothing currently.  Tell me if you have good ideas.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flaurenz%2Fpgreplay","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flaurenz%2Fpgreplay","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flaurenz%2Fpgreplay/lists"}