{"id":13640423,"url":"https://github.com/cybertec-postgresql/pg_squeeze","last_synced_at":"2025-06-19T03:05:28.533Z","repository":{"id":40523220,"uuid":"74355274","full_name":"cybertec-postgresql/pg_squeeze","owner":"cybertec-postgresql","description":"A PostgreSQL extension for automatic bloat cleanup","archived":false,"fork":false,"pushed_at":"2025-04-09T10:38:39.000Z","size":1066,"stargazers_count":547,"open_issues_count":5,"forks_count":34,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-04-20T02:33:31.365Z","etag":null,"topics":["postgresql","postgresql-extension"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cybertec-postgresql.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-11-21T11:05:49.000Z","updated_at":"2025-04-17T04:56:49.000Z","dependencies_parsed_at":"2024-01-14T09:11:59.040Z","dependency_job_id":"10a8fe57-a5c0-4880-819f-9cc50babb009","html_url":"https://github.com/cybertec-postgresql/pg_squeeze","commit_stats":null,"previous_names":[],"tags_count":46,"template":false,"template_full_name":null,"purl":"pkg:github/cybertec-postgresql/pg_squeeze","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cybertec-postgresql%2Fpg_squeeze","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cybertec-postgresql%2Fpg_squeeze/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cybertec-postgresql%2Fpg_squeeze/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cybertec-postgresql%2Fpg_squeeze/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cybertec-postgresql","download_url":"https://codeload.github.com/cybertec-postgresql/pg_squeeze/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cybertec-postgresql%2Fpg_squeeze/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260676931,"owners_count":23045115,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["postgresql","postgresql-extension"],"created_at":"2024-08-02T01:01:11.003Z","updated_at":"2025-06-19T03:05:23.501Z","avatar_url":"https://github.com/cybertec-postgresql.png","language":"C","funding_links":[],"categories":["C","Extensions"],"sub_categories":[],"readme":"PostgreSQL extension that removes unused space from a table and optionally\nsorts tuples according to particular index (as if [CLUSTER][2] command was\nexecuted concurrently with regular reads / writes). In fact we try to replace\n[`pg_repack`][1] extension.\n\nWhile providing very similar functionality, `pg_squeeze` takes a different\napproach as it:\n\n1. Implements the functionality purely on server side.\n2. Utilizes recent improvements of PostgreSQL database server.\n\nWhile (1) makes both configuration and use simpler (compared to [pg_repack][1]\nwhich uses both server and client side code), it also allows for rather smooth\nimplementation of unattended processing using [background workers][3].\n\nAs for (2), one important difference (besides the use of background workers) is\nthat we use [logical decoding][4] instead of triggers to capture concurrent\nchanges.\n\n# INSTALL\n\nInstall PostgreSQL before proceeding. Make sure to have `pg_config` binary,\nthese are typically included in `-dev` and `-devel` packages.\n\n```bash\ngit clone https://github.com/cybertec-postgresql/pg_squeeze.git\ncd pg_squeeze\nmake\nmake install\n```\n\nAdd these to `postgresql.conf`:\n\n```\nwal_level = logical\nmax_replication_slots = 1 # ... or add 1 to the current value.\nshared_preload_libraries = 'pg_squeeze' # ... or add the library to the existing ones.\n```\n\nRestart the cluster, and invoke:\n\n```\nCREATE EXTENSION pg_squeeze;\n```\n\n*Note: when upgrading a database cluster with pg_squeeze installed (either using\n`pg_dumpall`/restore or `pg_upgrade`), make sure that the new cluster has\n`pg_squeeze` in `shared_preload_libraries` *before* you upgrade, otherwise\nthe upgrade will fail.*\n\n# Register table for regular processing\n\nFirst, make sure that your table has either primary key or unique constraint.\nThis is necessary to process changes other transactions might do while\n`pg_squeeze` is doing its work.\n\nTo make the `pg_squeeze` extension aware of the table, you need to insert a\nrecord into `squeeze.tables` table. Once added, statistics of the table are\nchecked periodically. Whenever the table meets criteria to be \"squeezed\", a\n\"task\" is added to a queue. The tasks are processed sequentially, in the order\nthey were created.\n\nThe simplest \"registration\" looks like:\n\n```\nINSERT INTO squeeze.tables (tabschema, tabname, schedule)\nVALUES ('public', 'foo', ('{30}', '{22}', NULL, NULL, '{3, 5}'));\n```\n\nAdditional columns can be specified optionally, for example:\n\n```\nINSERT INTO squeeze.tables (\n    tabschema,\n    tabname,\n    schedule,\n    free_space_extra,\n    vacuum_max_age,\n    max_retry\n)\nVALUES (\n    'public',\n    'bar',\n    ('{30}', '{22}', NULL, NULL, '{3, 5}'),\n    30,\n    '2 hours',\n    2\n);\n```\n\nFollowing is the complete description of table metadata.\n\n* `tabschema` and `tabname` are schema and table name respectively.\n\n* `schedule` column tells when the table should be checked, and possibly\n  squeezed. The schedule is described by a value of the following composite\n  data type, which resembles a [crontab][6] entry:\n\n  ```\n  CREATE TYPE schedule AS (\n      minutes       minute[],\n      hours         hour[],\n      days_of_month dom[],\n      months        month[],\n      days_of_week  dow[]\n  );\n  ```\n\n  Here, `minutes` (0-59) and `hours` (0-23) specify the time of the check\n  within a day, while `days_of_month` (1-31), `months` (1-12) and\n  `days_of_week` (0-7, where both 0 and 7 stand for Sunday) determine the day\n  of the check.\n\n  The check is performed if `minute`, `hour` and `month` all match the current\n  timestamp, while NULL value means any minute, hour and month respectively. As\n  for `days_of_month` and `days_of_week`, at least one of these needs to match\n  the current timestamp, or both need to be NULL for the check to take place.\n\n  For example, in the entries above tell that table `public`.`bar` should be\n  checked every Wednesday and Friday at 22:30.\n\n* `free_space_extra` is the minimum percentage of `extra free space` needed to\n  trigger processing of the table. The `extra` adjective refers to the fact\n  that free space derived from `fillfactor` is not reason to squeeze the table.\n\n  For example, if `fillfactor` is equal to 60, then at least 40 percent of each\n  page should stay free during normal operation. If you want to ensure that 70\n  percent of free space makes pg_squeeze interested in the table, set\n  `free_space_extra` to 30 (that is 70 percent required to be free minus the 40\n  percent free due to the `fillfactor`).\n\n  Default value of `free_space_extra` is 50.\n\n* `min_size` is the minimum disk space in megabytes the table must occupy to be\n  eligible for processing. The default value is 8.\n\n* `vacuum_max_age` is the maximum time since the completion of the last VACUUM\n  to consider the free space map (FSM) fresh. Once this interval has elapsed,\n  the portion of dead tuples might be significant and so more effort than\n  simply checking the FSM needs to be spent to evaluate the potential effect\n  `pg_squeeze`. The default value is 1 hour.\n\n* `max_retry` is the maximum number of extra attempts to squeeze a table if the\n  first processing of the corresponding task failed. Typical reason to retry\n  the processing is that table definition got changed while the table was being\n  squeezed. If the number of retries is achieved, processing of the table is\n  considered complete. The next task is created as soon as the next scheduled\n  time is reached.\n\n  The default value of `max_retry` is 0 (i.e. do not retry).\n\n* `clustering_index` is an existing index of the processed table. Once the\n  processing is finished, tuples of the table will be physically sorted by the\n  key of this index.\n\n* `rel_tablespace` is an existing tablespace the table should be moved into.\n  NULL means that the table should stay where it is.\n\n* `ind_tablespaces` is a two-dimensional array in which each row specifies\n  tablespace mapping of an index. The first and the second columns represent\n  index name and tablespace name respectively. All indexes for which no mapping\n  is specified will stay in the original tablespace.\n\n  Regarding tablespaces, one special case is worth to mention: if tablespace is\n  specified for table but not for indexes, the table gets moved to that\n  tablespace but the indexes stay in the original one (i.e. the tablespace of\n  the table is not the default for indexes as one might expect).\n\n* `skip_analyze` indicates that table processing should not be followed by\n  ANALYZE command. The default value is `false`, meaning ANALYZE is performed\n  by default.\n\n`squeeze.table` **is the only table user should modify. If you want to change\nanything else, make sure you perfectly understand what you are doing.**\n\n# Ad-hoc processing for any table\n\nIt's also possible to squeeze tables manually without registering\n(i.e. inserting the corresponding record into `squeeze.tables`), and without\nprior checking of the actual bloat.\n\nFunction signature:\n\n```\nsqueeze.squeeze_table(\n    tabchema name,\n    tabname name,\n    clustering_index name,\n    rel_tablespace name,\n    ind_tablespaces name[]\n)\n```\n\nSample execution:\n\n```\nSELECT squeeze.squeeze_table('public', 'pgbench_accounts');\n```\n\nNote that the function is not transactional: it only starts a background\nworker, tells it which table it should process and exits. Rollback of the\ntransaction the function was called in does not revert the changes done by the\nworker.\n\n# Enable / disable table processing\n\nTo enable processing of bloated tables, run this statement as superuser:\n\n```\nSELECT squeeze.start_worker();\n```\n\nThe function starts a background worker (`scheduler worker`) that periodically\nchecks which of the registered tables should be checked for bloat, and creates\na task for each. Another worker (`squeeze worker`) is launched whenever a task\nexists for particular database.\n\nIf the scheduler worker is already running for the current database, the\nfunction does not report any error but the new worker will exit immediately.\n\nIf the workers are running for the current database, you can use the following\nstatement to stop them:\n\n```\nSELECT squeeze.stop_worker();\n```\n\n**Only the functions mentioned in this documentation are considered user\ninterface.  If you want to call any other one, make sure you perfectly\nunderstand what you're doing.**\n\nIf you want the background workers to start automatically during startup of the\nwhole PostgreSQL cluster, add entries like this to `postgresql.conf` file:\n\n```\nsqueeze.worker_autostart = 'my_database your_database'\nsqueeze.worker_role = postgres\n```\n\nNext time you start the cluster, two or more workers (i.e. one `scheduler\nworker` and one or more `squeeze workers`) will be launched for `my_database`\nand the same for `your_database`. If you take this approach, note that any\nworker will either reject to start or will stop without doing any work if\neither:\n\n1. The `pg_squeeze` extension does not exist in the database, or\n\n2. `squeeze.worker_role` parameter specifies role which does not have the\n   superuser privileges.\n\n*The functions/configuration variables explained above use singular form of the\nword `worker` although there are actually two workers. This is because only one\nworker existed in the previous versions of pg_squeeze, which ensured both\nscheduling and execution of the tasks. This implementation change is probably\nnot worth to force all users to adjust their configuration files during\nupgrade.*\n\n# Control the impact on other backends\n\nAlthough the table being squeezed is available for both read and write\noperations by other transactions most of the time, exclusive lock is needed to\nfinalize the processing. If pg_squeeze occasionally seems to block access to\ntables too much, consider setting `squeeze.max_xlock_time` GUC parameter. For\nexample:\n\n```\nSET squeeze.max_xlock_time TO 100;\n```\n\nTells that the exclusive lock shouldn't be held for more than 0.1 second (100\nmilliseconds). If more time is needed for the final stage, pg_squeeze releases\nthe exclusive lock, processes changes committed by other transactions in\nbetween and tries the final stage again. Error is reported if the lock duration\nis exceeded a few more times. If that happens, you should either increase the\nsetting or schedule processing of the problematic table to a different daytime,\nwhen the write activity is lower.\n\n# Running multiple workers per database\n\nIf you think that a single squeeze worker does not cope with the load,\nconsider setting the `squeeze.workers_per_database` configuration variable to\nvalue higher than 1. Then the `pg_squeeze` extension will be able to process\nmultiple tables at a time - one table per squeeze worker. However, be aware\nthat this setting affects all databases in which you actively use the\n`pg_squeeze` extension. The total number of all the squeeze workers in the\ncluster (including the \"scheduler workers\") cannot exceed the in-core\nconfiguration variable `max_worker_processes`.\n\n# Monitoring\n\n* `squeeze.log` table contains one entry per successfully squeezed table.\n\n  The columns `tabschema` and `tabname` identify the processed table. The\n  columns `started` and `finished` tell when the processing started and\n  finished. `ins_initial` is the number of tuples inserted into the new table\n  storage during the \"initial load stage\", i.e. the number of tuples present\n  in the table before the processing started. On the other hand, `ins`, `upd`\n  and `del` are the numbers of tuples inserted, updated and deleted by\n  applications during the table processing. (These \"concurrent data changes\"\n  must also be incorporated into the squeezed table, otherwise they'd get\n  lost.)\n\n* `squeeze.errors` table contains errors that happened during squeezing. An\n  usual problem reported here is that someone changed definition (e.g. added or\n  removed column) of the table whose processing was just in progress.\n\n* `squeeze.get_active_workers()` function returns a table of squeeze workers\n  which are just processing tables in the current database.\n\n  The `pid` column contains the system PID of the worker process. The other\n  columns have the same meaning as their counterparts in the `squeeze.log`\n  table. While the `squeeze.log` table only shows information on the completed\n  squeeze operations, the `squeeze.get_active_workers()` function lets you\n  check the progress during the processing.\n\n# Unregister table\n\nIf particular table should no longer be subject to periodical squeeze, simply\ndelete the corresponding row from `squeeze.tables` table.\n\nIt's also a good practice to unregister table that you're going to drop,\nalthough the background worker does unregister non-existing tables\nperiodically.\n\n# Upgrade\n\nMake sure to install PostgreSQL and `pg_config`, see [install](#install)\nsection.\n\n```bash\nmake # Compile the newer version.\npg_ctl -D /path/to/cluster stop # Stop the cluster.\nmake install\npg_ctl -D /path/to/cluster start # Start the cluster.\n```\n\nConnect to each database containing `pg_squeeze` and run this command:\n\n```\nALTER EXTENSION pg_squeeze UPDATE;\n```\n\n# Upgrade from 1.2.x\n\n**As there's no straightforward way to migrate the scheduling\ninformation (see the notes on the `schedule` column of the `squeeze.tables`\ntable) automatically, and as the `schedule` column must not contain NULL\nvalues, the upgrade deletes the contents of the `squeeze.tables`\ntable. Please export the table contents to a file before you perform the\nupgrade and configure the checks of those tables again as soon as the upgrade\nis done.**\n\n\n# Concurrency\n\n1. The extension does not prevent other transactions from altering table at\n   certain stages of the processing. If a \"disruptive command\" (i.e. `ALTER\n   TABLE`, `VACUUM FULL`, `CLUSTER` or `TRUNCATE`) manages to commit before the\n   squeeze could finish, the `squeeze_table()` function aborts and all changes\n   done to the table are rolled back. The `max_retry` column of\n   `squeeze.tables` table determines how many times the squeeze worker will\n   retry. Besides that, change of schedule might help you to avoid disruptions.\n\n2. Like [`pg_repack`][1], `pg_squeeze` also changes visibility of rows and thus\n   allows for MVCC-unsafe behavior described in the first paragraph of\n   [mvcc-caveats][5].\n\nDisk Space Requirements\n-----------------------\n\nPerforming a full-table squeeze requires free disk space about twice as large\nas the target table and its indexes. For example, if the total size of the\ntables and indexes to be squeezed is 1GB, an additional 2GB of disk space is\nrequired.\n\n[1]: https://reorg.github.io/pg_repack/\n[2]: https://www.postgresql.org/docs/13/static/sql-cluster.html\n[3]: https://www.postgresql.org/docs/13/static/bgworker.html\n[4]: https://www.postgresql.org/docs/13/static/logicaldecoding.html\n[5]: https://www.postgresql.org/docs/13/static/mvcc-caveats.html\n[6]: https://www.freebsd.org/cgi/man.cgi?query=crontab\u0026sektion=5\u0026apropos=0\u0026manpath=FreeBSD+12.1-RELEASE+and+Ports\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcybertec-postgresql%2Fpg_squeeze","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcybertec-postgresql%2Fpg_squeeze","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcybertec-postgresql%2Fpg_squeeze/lists"}