https://github.com/geekodour/nix-postgres-docker

notes on running walg w postgres in a container
https://github.com/geekodour/nix-postgres-docker
Last synced: 4 months ago
JSON representation
notes on running walg w postgres in a container
Host: GitHub
URL: https://github.com/geekodour/nix-postgres-docker
Owner: geekodour
Created: 2024-09-03T03:54:49.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-01-19T03:00:24.000Z (5 months ago)
Last Synced: 2025-01-19T04:17:03.762Z (5 months ago)
Language: Shell
Homepage:
Size: 201 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.org
Awesome Lists containing this project

README

        * nix-postgres-docker

** Preface

#+begin_quote

I'd highly recommend reading the official docs, and only read this doc as supplement on how do some of these things fit in context of ~wal-g~:

- [[https://www.postgresql.org/docs/current/runtime-config-wal.html][PostgreSQL: Documentation: 16: 20.5. Write Ahead Log]]

- [[https://www.postgresql.org/docs/current/wal-configuration.html][PostgreSQL: Documentation: 16: 30.5. WAL Configuration]]

- [[https://www.postgresql.org/docs/current/backup.html][PostgreSQL: Documentation: 16: Chapter 26. Backup and Restore]]

- [[https://www.postgresql.org/docs/current/continuous-archiving.html][PostgreSQL: Documentation: 16: 26.3. Continuous Archiving and Point-in-Time Recovery (PITR)]]

- [[https://www.postgresql.org/docs/current/warm-standby.html][PostgreSQL: Documentation: 16: 27.2. Log-Shipping Standby Servers]]

- [[https://www.postgresql.org/docs/current/warm-standby-failover.html][PostgreSQL: Documentation: 16: 27.3. Failover]]

- [[https://www.postgresql.org/docs/current/hot-standby.html][PostgreSQL: Documentation: 16: 27.4. Hot Standby]]

- [[https://www.postgresql.org/docs/current/runtime-config-replication.html][PostgreSQL: Documentation: 16: 20.6. Replication]]

- [[https://www.postgresql.org/docs/current/protocol-replication.html][PostgreSQL: Documentation: 16: 55.4. Streaming Replication Protocol]]

- [[https://www.postgresql.org/docs/current/different-replication-solutions.html][PostgreSQL: Documentation: 16: 27.1. Comparison of Different Solutions]]

#+end_quote

#+begin_quote

While I think I should use a managed pg or one of those baked distributions already, I also think selfhosting pg for my homelab has taught me many lessons that I wouldn't learn otherwise, so I guess we shall continue doing so "for now". (that too in a container)

Deploying postgres on containers is [[https://www.reddit.com/r/PostgreSQL/comments/11nwf54/container_or_not/][definitely a matter]] of [[https://www.reddit.com/r/PostgreSQL/comments/1c2rbow/why_not_run_production_postgres_in_docker/][discussion]]. one of the issues i faced was docker messing up with [[https://www.instaclustr.com/blog/postgresql-docker-and-shared-memory/][dynamic shared memory configuration in postgres]], but otherwise i've been happy running postgres in a container. (i personally have been using [[https://github.com/hashicorp/nomad][nomad]] so far)

How to deploy, how to add a connection pooler etc. are concerns which are out of the scope of this repo at the moment. I'll keep this document focused on "how to setup WAL archival using WAL-G" when running postgres in a container.

All of the things mentioned in this document is very much my experience playing around with wal-g and the particular scenario I had to use wal-g in, so most things may change based on what scenario/scale etc you need to use it.

For actual commands that I've tested, check the ~./Justfile.wal-g.example~ file.

#+end_quote

** Table of contents :TOC_9:

- [[#nix-postgres-docker][nix-postgres-docker]]

  - [[#preface][Preface]]

  - [[#context-and-the-problem][Context and the problem]]

  - [[#solution][Solution]]

    - [[#creating-the-docker-image--breakdown-of-postgres-docker-image][Creating the docker image / Breakdown of postgres docker image.]]

    - [[#setting-up-wal-g--references][Setting up ~wal-g~ & references]]

      - [[#command-reference][Command reference]]

      - [[#making-base-backup][Making base backup]]

      - [[#running-wal-archival][Running WAL archival]]

      - [[#restorationstandby][Restoration/Standby]]

        - [[#postgres-restore-related-command-reference][postgres restore related command reference]]

      - [[#cleanup-and-maintenance][Cleanup and Maintenance]]

  - [[#restoration-crash-recovery-and-standby][Restoration, Crash Recovery and Standby]]

    - [[#standby-servers-async][Standby Servers (Async)]]

      - [[#via-streaming-replication-todo][via Streaming Replication (TODO)]]

      - [[#via-wal-archival-pitr][via WAL Archival (PITR)]]

        - [[#fetch-the-base-backup][Fetch the base backup]]

        - [[#special-usecase-restore-only-with-base-backup][Special Usecase: Restore only with base backup]]

        - [[#make-changes-to-postgresqlconf-and-pgdata-for-standby][Make changes to ~postgresql.conf~ and ~$PGDATA~ for standby]]

        - [[#promote-as-needed][Promote as needed]]

    - [[#standby-servers-sync-only-via-streaming-replication-todo][Standby Servers (Sync) (ONLY via Streaming Replication) (TODO)]]

    - [[#crash-recovery-pitr][Crash Recovery (PITR)]]

      - [[#via-streaming-replication-todo-1][via Streaming Replication (TODO)]]

      - [[#via-wal-archival-pitr-1][via WAL Archival (PITR)]]

        - [[#differences-with-standby-modeservers][Differences with standby mode/servers]]

        - [[#fetch-the-base-backup-1][Fetch the base backup]]

        - [[#make-changes-to-postgresqlconf-and-pgdata-for-recovery][Make changes to ~postgresql.conf~ and ~$PGDATA~ for recovery]]

    - [[#crash-recovery--standby-todo][Crash Recovery + Standby (TODO)]]

    - [[#related-concepts][Related concepts]]

      - [[#failover][Failover]]

      - [[#promotion--timeline][Promotion & Timeline]]

      - [[#whether-to-set-archive_mode-when-recoverystandby][Whether to set ~archive_mode~ when recovery/standby]]

      - [[#wal-generation][WAL generation]]

  - [[#other-notesreferences][Other notes/references]]

    - [[#aside-on-backup-ecosystem-in-postgres-todo][Aside on Backup ecosystem in Postgres (TODO)]]

      - [[#replication-vs-backup][Replication vs Backup]]

    - [[#everything-works-except-archive_command][Everything works except ~archive_command~!]]

    - [[#previous-attempts][Previous attempts]]

      - [[#creating-normal-dockerfile-official-docker-image][Creating normal Dockerfile (official docker image)]]

      - [[#using-dockertoolsfromimage-official-docker-image][Using ~dockerTools.fromImage~ (official docker image)]]

  - [[#to-explore-later][To explore later]]

** Context and the problem

To give you the context, I've been happily running postgres in my homelab without a backup(who needs one anyway) but then I planned on doing a user facing side project for which I decided, will use the same pg cluster. But now, even though I'll get only 2 users it's my duty to ensure that their data remains intact despite of the unfathomable amount of fuckery i am about to unleash on my homelab servers. So, in conclusion, i need a backup policy for the database.

I already have one actually, ~pg_dump~ > [[https://restic.net/][restic]] > [[https://en.wikipedia.org/wiki/Backblaze][blackblaze B2]], this backs up more than just a db though. But, it is not ideal(various reasons) for a database that'll have a continuous flow of writes to it. For this, you need what postgres folks call "[[https://www.postgresql.org/docs/current/continuous-archiving.html][continuous archival]]" (many ways to do it, see aside section below). For sqlite, there's [[https://litestream.io/][litestream]] which does a similar thing pretty brilliantly.

So after doing some comparison of various backup solutions/methods, i decided to do with [[https://github.com/wal-g/wal-g][wal-g]]. You could actually do it without any 3rd party tool but they definitely help. (See below for comparison)

Previously I've been running a custom docker image based on the official postgres docker image to install the required extension. It'd be nice to extend it! In case of ~wal-g~, the running postgres service([[https://www.interdb.jp/pg/pgsql09/10.html][archiver (background) process]]) needs to call ~wal-g~ from the [[https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-ARCHIVE-COMMAND][archive_command]] postgres configuration. (This gets triggered every-time there's a new ~WAL segment~, which can in-turn be triggered by setting ~archive_timeout~ in case your db doesn't make enough commits to generate a new ~WAL segment~ but you still want continuous archival). Basically, you need ~wal-g~ and ~postgres~ in the same runtime and be callable(~$PATH~).

Now to get ~wal-g~ running I just need to install ~wal-g~ into the docker image(that I have already), that's a ~apt-get~. Simple right? Yes! There are even some [[https://github.com/wal-g/wal-g/issues/473][off the shelf]] [[https://github.com/stephane-klein/playground-postgresql-walg/blob/60c483c7675899bdf3a4ad3f0d7627f3998432b8/docker-image/postgres-with-wal-g/Dockerfile#L20][community]] [[https://hub.docker.com/r/apkawa/wal-g][repos]] online for this exact usecase.

But I am stupid, so I decide to make my life harder. I have rest of my infra wired nicely in [[https://zero-to-nix.com/concepts/flakes][nix flakes]] and I run things on an arm machine and also locally(x86), so I thought to myself it'd be really nice to have "most" things in a flake. Honestly would've gone the first way if I knew doing it the nix way would put me in the trenches.

So now finally our problem becomes:

#+begin_quote

"Create a multi-arch production-ready(walg+compatibility w official image+custom extensions) postgres docker image using NixOS dockerTools AND make sure the auxiliary processes(eg. base backup shipping, cleanup etc.) are running as expected"

#+end_quote

** Solution

*** Creating the docker image / Breakdown of postgres docker image.

I initially attempted to base off the official postgres docker image, this was not successful. So the alternative is building the image directly from Nix definitions. [[https://xeiaso.net/talks/2024/nix-docker-build/][This is usually pretty simple]], but for postgres things are a bit complicated you need the postgres user, postgres has initdb scripts, the official image makes use of [[https://github.com/tianon/gosu][gosu]] etc which I don't properly understand.

But it was doable, I did it in two different ways:

1. Based on official postgres docker image: Most things work nicely, but I am not sure how would I install postgres extensions to this. It will be possible just not super straightforward.

2. Based on nixpkgs postgres: Installing ~wal-g~, installing other things along with postgres extension to the docker container is pretty straightforward.

See ~./flake.nix~

Additionally, github actions is setup at ~.github/workflows~ to build this image for both ~amd64~ and ~arm64~

*** Setting up ~wal-g~ & references

#+begin_quote

Gitlab has an amazing runbook for wal-g

- https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/postgresql-backups-wale-walg.md

- https://archive.ph/83BLw

My impression on ~wal-g~, while I think it's great software but it lacks heavily in documentation. I am not sure why such a widely adopted tool has such poor documentation(lot of undocumented cli features). Lot of conceptual things which are important to understand for something like managing backup are not mentioned at all. Fortunately, they've a super friendly and nice telegram and the developers hang around and help resolve issues. I'd personally like to work on the docs but I'll let things sink in for a while.

#+end_quote

**** Command reference

Following are ~wal-g~ commands, what they do and when to run them.

| context     | wal-g command        | what it does                                                                                          | when to run?                                                                                                                                         |

|-------------+----------------------+-------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------|

| base-backup | ~backup-list~          | List of successful "base backups", also will have delta backups                                       | manually, to check (This won't list anything till your first base backup, even if ~archive_command~ is sending in WAL(s))                              |

|             |                      | ~backup-list --detail --pretty~ is useful as-well                                                       |                                                                                                                                                      |

|             | ~backup-push~          | create a full ~$PGDATA~ backup                                                                          | periodic(auto), via ~systemd-timer~                                                                                                                    |

|             |                      | ~backup-push~ will decide on it's own when to do delta or full                                          |                                                                                                                                                      |

|             |                      | base backup is always done in reference to a WAL LSN, this is helpful in many ways(restore, deletion) |                                                                                                                                                      |

|             | ~backup-fetch~         | Restore PostgreSQL data directory from a full backup                                                  | manually, when shit hits the fan                                                                                                                     |

|-------------+----------------------+-------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------|

| delete      | ~garbage~ (WAL)        | removes unnecessary WAL(s), leaving only stuff that can be used for restore                           | periodic(auto), via ~systemd-timer~                                                                                                                    |

|             | ~retain~ (base-backup) | Can be ~FULL~ or ~FIND_FULL~, deletes certain of "base backups" and dependent WALs                        | periodic(auto), via ~systemd-timer~ (I just keep last 2 full backups: ~wal-g delete retain FULL 2~)                                                      |

|             | ~target~ (base-backup) | delete specific base backup, does not delete WALs                                                     | I don't run this                                                                                                                                     |

|             | ~before~ (base-backup) | delete specific base backup, before a certain base backup                                             | I don't run this                                                                                                                                     |

|             | ~everything~           | Deletes everything except parmanent backups (~--parmanent~)                                             | I don't run this except for debugging                                                                                                                |

|-------------+----------------------+-------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------|

| wal         | ~wal-push~             | pushes WALs as part of ~archive_command~                                                                | never run manually                                                                                                                                   |

|             | ~wal-fetch~            | fetches WALs as part of ~restore_command~                                                               | never run manually, but only run during restoration. The gitlab runbook mentions that they nolonger use this, but use streaming replication directly |

|-------------+----------------------+-------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------|

| replication | ~wal-receive~          | I haven't explored this yet, but can be an alternative to ~wal-push~                                    |                                                                                                                                                      |

|             | ~catchup-*~            | I haven't explored these yet                                                                          |                                                                                                                                                      |

|-------------+----------------------+-------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------|

| verify      |                      | There are verify and checksum check related commands aswell                                           |                                                                                                                                                      |

**** Making base backup

- This could be done as a [[https://developer.hashicorp.com/nomad/tutorials/task-deps/task-dependencies-interjob][sidecar]] job also in cloud native environments but since i have the machine to myself I like to setup systemd-timer. Other than the timer, I can now ssh into the machine and manually trigger the backup with ~sudo systemctl start walg-backup-push~. The systemd service looks something like:

- ~wal-g~ has an issue that makes it little in-convenient to run base-backups from the host when running postgres in container: https://github.com/wal-g/wal-g/issues/1782

#+begin_src nix

# walg-backup-push will do a full backup and clean unnessary wal files from the backup

systemd.services."walg-backup-push" = {

  script = ''

  set -eu

  export AWS_ENDPOINT=

  export WALG_S3_PREFIX=s3:///

  export AWS_ACCESS_KEY_ID=

  export AWS_SECRET_ACCESS_KEY=

  export PGPASSWORD=

    ${pkgs.wal-g}/bin/wal-g backup-push $WALG_PGDATA && ${pkgs.wal-g}/bin/wal-g delete garbage

  '';

  serviceConfig = {

    Type = "oneshot";

    User = "root";

  };

  environment = {

    WALG_PGDATA = "/var/lib/postgres/data";

    PGHOST = "localhost";

    PGPORT = "5432";

    PGUSER = "postgres";

  };

};

systemd.timers."walg-backup-push" = {

  wantedBy = [ "timers.target" ];

  timerConfig = {

    OnCalendar = "weekly"; # my db updates non-frequently, OK with 1 week worth of WAL

    Persistent = true;

    Unit = "walg-backup-push.service";

  };

};

#+end_src

**** Running WAL archival

- Just set ~archive_command = 'wal-g wal-push %p'~ and we're all set. It needs the env vars: ~AWS_ACCESS_KEY_ID~, ~AWS_SECRET_ACCESS_KEY~, ~AWS_ENDPOINT~, ~WALG_S3_PREFIX~ if you're using a s3 backend(blackblaze). You can set them however you want(eg. I set them using nomad env vars)

- If WAL shipping (~archive_command~) fails for some reason, WAL files will be kept on the server until the disk is running full!

- There are a number of cases where the same WAL segment can be pushed more than once, especially after failures where Postgres is not sure that the command completed.

**** Restoration/Standby

Restoring and standby is pretty involved, I've created a separate higher level section for it.

***** postgres restore related command reference

#+begin_quote

- The official docs(pg16.04) are much better reference but just dumping here for an overview.

- As of pg17 only the date/time and named restore point options are very usable, since there are no tools to help you identify with any accuracy which transaction ID to use.

- ~date +'%Y-%m-%d %H:%M:%S.%N' | cut -b 1-23~ gives numeric UTC offset

- ~pg_controldata -D $PGDATA~

- ~pg_resetwal~ : [[https://stackoverflow.com/questions/8799474/postgresql-error-panic-could-not-locate-a-valid-checkpoint-record][postgresql error PANIC: could not locate a valid checkpoint record - Stack Overflow]]

#+end_quote

| Context                      | Name                      | Description                                                                       | Crash recovery mode              | Standby mode                                         |

|------------------------------+---------------------------+-----------------------------------------------------------------------------------+----------------------------------+------------------------------------------------------|

|                              | ~restore_command~           |                                                                                   | Required                         | Optional, can be combined with Streaming Replication |

|------------------------------+---------------------------+-----------------------------------------------------------------------------------+----------------------------------+------------------------------------------------------|

| Recovery Target("till this") | ~recovery_target~           | Only allowed value is ~immediate~, till consistent state                            | Yes(oneof)                       | N/A                                                  |

|                              | ~recovery_target_name~      | if ~pg_create_restore_point~ was used                                               | Yes(oneof)                       | N/A                                                  |

|                              | ~recovery_target_time~      | numeric offset from UTC, time stamp up to which recovery will proceed.            | Yes(oneof)                       | N/A                                                  |

|                              | ~recovery_target_lsn~       | If you know the exact lsn                                                         | Yes(oneof)                       | N/A                                                  |

|                              | ~recovery_target_xid~       | If you know the exact tx_id                                                       | Yes(oneof)                       | N/A                                                  |

|------------------------------+---------------------------+-----------------------------------------------------------------------------------+----------------------------------+------------------------------------------------------|

|                              | ~recovery_target_inclusive~ | Related to ~recovery_target_[time/lsn/xid]~ for one-off adjustments                 | Yes(default:on)                  | N/A                                                  |

|                              | ~recovery_target_timeline~  | The ~recovery_target_*~ can only belong to one timeline_id.                         | Yes(default:latest wal timeline) | N/A                                                  |

|                              | ~recovery_target_action~    | What happens when restore completes (~pause/shutdown/promote~). see doc for details | Yes(default:pause)               | N/A                                                  |

**** Cleanup and Maintenance

- If you make a ~backup-push~ with the ~--parmanent~ flag(or using ~backup-mark~), it'll not be picked by the ~delete~ command.

- ~wal-g delete garbage~ can be run periodically to remove WAL files which

- I think if you run ~wal-g retain~ then you don't really need to run ~garbage~ but you can if leftovers are there.

** Restoration, Crash Recovery and Standby

#+begin_quote

- All of these need a base backup or [[https://stackoverflow.com/questions/77234780/postgres-replication-without-taking-a-new-base-backup-of-the-primary-server][to be configured to use pg_rewind]], I personally have not used ~pg_rewind~ yet.

- All of the places where streaming replication can be applied, it can either be [[https://www.postgresql.org/docs/current/wal-async-commit.html][sync or async]] , sync replication is not possible with WAL archival as it operates at commit level and WAL archival works at WAL segment level.

- ~.history~ errors in the logs are expected when doing crash recovery: [[https://github.com/wal-g/wal-g/issues/628][wal-g/wal-g#628]]

- Order of operations matter: [[https://dba.stackexchange.com/questions/306490/failed-postgres-replication-from-master-to-standby][Failed Postgres Replication from master to standby]]

- ~hot_standby~ is now ~on~ by default. So if you create a standby server, you'll be able to do reads on it as archives are fetches. Essentially a read replica.

#+end_quote

| Combination                     | SR protocol | WAL Archival (PITR) / File based / Log shipping      | Notes                                                         | Doubts                                                          |

|---------------------------------+-------------+------------------------------------------------------+---------------------------------------------------------------+-----------------------------------------------------------------|

| Warm Standby                    | No          | ~standby.signal~                                       | primary & standby (~archive_command~ <=> ~restore_command~)       |                                                                 |

| Hot Standby                     | No          | ~standby.signal~ + ~hot_standby=on~ (default)            | warm standby + read-only                                      |                                                                 |

|---------------------------------+-------------+------------------------------------------------------+---------------------------------------------------------------+-----------------------------------------------------------------|

| Warm Standby                    | Yes         | No                                                   | This is no backup, just replicas (~primary_conninfo~ et al)     |                                                                 |

| Hot Standby                     | Yes         | No                                                   | This is no backup, just replicas (~primary_conninfo~ et al)     |                                                                 |

|---------------------------------+-------------+------------------------------------------------------+---------------------------------------------------------------+-----------------------------------------------------------------|

| Warm Standby                    | Yes         | Yes                                                  | This is ideal (~primary_conninfo~ et al + ~restore_command~)      |                                                                 |

| Hot Standby                     | Yes         | Yes                                                  | This is also ideal (~primary_conninfo~ et al + ~restore_command~) |                                                                 |

|---------------------------------+-------------+------------------------------------------------------+---------------------------------------------------------------+-----------------------------------------------------------------|

| Crash Recovery (Normal)         | No          | Yes, use ~recovery.signal~                             | This is the most basic scenario (~restore_command~)             | Do we really need to disable ~archive_mode~ here during recovery? |

| Crash Recovery (w Standby)      | No          | Yes, use ~recovery.signal~ and use ~standby.signal~ both | In this scenario, standby takes precedence (~restore_command~)  |                                                                 |

| Crash Recovery (w Standby w SR) | Yes         | Yes                                                  | WAL archive + SR + Instant promotion                          | I don't see this scenario being very useful/intuitive/practical |

*** Standby Servers (Async)

#+begin_quote

These can be either warm or hot

#+end_quote

**** via Streaming Replication (TODO)

- I've not done this yet

- You need to set ~primary_conninfo~. ~restore_command~ is not used here.

**** via WAL Archival (PITR)

#+begin_quote

This section is about running a "standby server" with ~restore_command~ + ~standby.signal~

#+end_quote

***** Fetch the base backup

#+begin_src bash

export AWS_ENDPOINT= # optional

export WALG_S3_PREFIX= # eg. s3:///app_pg

export AWS_ACCESS_KEY_ID=

export AWS_SECRET_ACCESS_KEY=

export AWS_REGION= # optional if region already specified in endpoint

export PATH_TO_PGDATA=/srv/data/pg_data

wal-g backup-fetch $PATH_TO_PGDATA LATEST

#+end_src

- For ~100GB data, it took about 15mins with 30MB/s + extraction

- This is more of a one time manual thing

- Find a place where you want to restore the base backup to, this will usually be the ~$PGDATA~ of the new postgres server to be restored/standby to be created.

- Use ~wal-g backup-fetch~ to fetch the base backup you want.

- In case of running postgres in a container

  - You'd want to run ~backup-fetch~ on the host to whatever the ~$PGDATA~ directory should be. And when that ~$PGDATA~ gets mounted into the container make sure its given appropriate permissions etc.

  - In case of using the official postgres image in a non-rootless/rootfull environment, postgres uses [[https://github.com/tianon/gosu][gosu]], so you can in-fact run ~backup-fetch~ as root and you should be fine. Whether this is a good idea? ¯\_(ツ)_/¯

***** Special Usecase: Restore only with base backup

- Usually you'd want fresh WAL etc. (that's the traditional continious backup)

- But if you wanted more of the ~pg_restore~ kind of behavior with wal-g base backup

  - Then even if you're running in standby mode, you can't run ~pg_ctl promote~, it'll give error that not runing in standby ()

- The ~restore_command = 'wal-g wal-fetch %f %p'~ is absolutely useless. as it's used for continious archival and we not doing that here.

- If you don't specify the ~standby.signal~ you'd get the following:

  #+begin_src

2024-12-10 17:10:48.788 UTC [28] HINT:  If you are restoring from a backup, touch "/srv/data/postgres/recovery.signal" and add required recovery options.

If you are not restoring from a backup, try removing the file "/srv/data/postgres/backup_label".

Be careful: removing "/srv/data/postgres/backup_label" will result in a corrupt cluster if restoring from a backup.

  #+end_src

- BUT

  - basebackup itself need WAL for restoing too, WAL-G operates same. WAL is used for PITR, but even if you restoring to the time of backup (without PITR), you still need WAL.

#+begin_quote

So in summary, this usecase is not possible iiuc.

#+end_quote

***** Make changes to ~postgresql.conf~ and ~$PGDATA~ for standby

- Add ~standby.signal~ inside ~$PGDATA~, this will tell postgres to start in standby mode. If you don't specify this and yet have a ~restore_command~, pg will safely error out.

- Update standby settings in ~postgresql.conf~

  #+begin_src diff

--- a/workloads/nomad/pg/postgresql.conf

+++ b/workloads/nomad/pg/postgresql.conf

-archive_mode = on		enables archiving; off, on, or always

+archive_mode = off		# enables archiving; off, on, or always

-#restore_command = ''		# command to use to restore an archived WAL file

+restore_command = 'wal-g wal-fetch %f %p'

  #+end_src

- *DO NOT SET* any of the other ~recovery_*~ config(s), those are meant to be set during recovery and will mess things up if set when setting up a standby.

- Start the postgres service(the container if running via container).

- At this point, it should take a while based on how far ago the base backup was taken/how many WALs it needs to catch up on etc. Once it catches up, it'll continue to be in standby mode.

- To continue to be in standby mode means, postgres will keep looking for the next WAL log, which is probably not yet pushed by the primary's ~archive_command~, as soon as the next segment is available it'll load that in.

  - See [[https://github.com/wal-g/wal-g/issues/1775][this issue]] for an example of this behavior.

***** Promote as needed

- Once you know that the "current" primary is no longer going to push new WAL logs/data, you can simply shutdown the primary and promote the stanby as per need.

- You can promote a standby by running ~pg_ctl promote~ or calling ~pg_promote()~. This will also delete the ~standby.signal~ file. If you're running postgres in an container, exec into the container as the ~postgres~ user and run ~pg_ctl promote~

- After promoting you may want to remove/comment the ~restore_command~ config, it'll not be picked up anyway by postgres but better to comment it out imo.

*** Standby Servers (Sync) (ONLY via Streaming Replication) (TODO)

#+begin_quote

- sync replication happens at commit level.

- Synchronous standby servers cannot happen without postgres streaming replication protocol, hence won't work with WAL archival.

- These can be either warm or hot

#+end_quote

- I've not done this yet

*** Crash Recovery (PITR)

#+begin_quote

- PITR, as I understand PITR, it's is not one single thing. It's the combination of all the different recovery targets, mechanism to do the backups and the concept of [[https://www.highgo.ca/2021/11/01/the-postgresql-timeline-concept/][postgres timeline]] etc. With the concept of timeline ID, it is possible that the same LSN or the same WAL segments exist in multiple timelines.

- When using different a ~recovery_* target~ other than ~recovery_target~ and when ~recovery_target_action~, you can use ~pg_wal_replay_resume()~ to play/pause the recovery process. Eg. You want to manually check if some data is recovered or no etc. Like using a debugger with steps kind of. But usually this is not needed simply using ~recovery_target~ is enough for me.

- During recovery mode, if the recovery is terminated because of an external error, the server can simply be restarted and it will continue recovery

#+end_quote

**** via Streaming Replication (TODO)

- I've not done this yet

- I don't completely understand this scenario because after recovery, the recovery pg instance would implicitly be promoted but doing this via streaming replication means the primary is live. So in which case do we even need this?

**** via WAL Archival (PITR)

***** Differences with standby mode/servers

This is similar to standby servers with the following gotchas/differences(there'll be more ofc):

- The settings(~postgresql.conf~) here are only needed for the duration of restore, once done they should be reset.

- Instead of ~standby.signal~, we'd add ~recovery.signal~

- Instead of the standby mode behavior, where it stays in "standby mode" until promoted, in archival/crash recovery, as soon as recovery is successful, the postgres server [[https://dba.stackexchange.com/questions/315999/postgresql-stops-continuous-recovery-without-error][will implicitly get promoted]] unlike standby mode.

- Since recovery mode does implicit promotion, you cannot really run multiple recovery, if that's a usecase you probably need a standby instead of recovery. See [[https://github.com/wal-g/wal-g/issues/1780][this issue]] for more info.

***** Fetch the base backup

This is same as done for standby servers

***** Make changes to ~postgresql.conf~ and ~$PGDATA~ for recovery

- Add ~recovery.signal~ inside ~$PGDATA~, this will tell postgres to start in recovery mode.

  - The server will enter recovery and will not stop recovery when the end of archived WAL is reached, after which it'll implicitly promote itself/do what ~recovery_target_action~ is set to.

  - The ~recovery_*~ configuration control how this behaves (Eg. you can control for an earlier stopping point than end of WAL etc.)

- Update recovery settings in ~postgresql.conf~

  #+begin_src diff

--- a/workloads/nomad/postgres/postgresql.conf

+++ b/workloads/nomad/postgres/postgresql.conf

-archive_mode = on      # enables archiving; off, on, or always

+archive_mode = off     # enables archiving; off, on, or always

-#restore_command = ''  # command to use to restore an archived WAL file

+restore_command = 'wal-g wal-fetch %f %p'

  #+end_src

- After adding in these changes, one should be able to successfully recover from WAL archive.

*** TODO Crash Recovery + Standby (TODO)

#+begin_quote

From [[https://www.postgresql.org/docs/current/runtime-config-wal.html#RUNTIME-CONFIG-WAL-ARCHIVE-RECOVERY][PostgreSQL: Documentation: 16: 20.5. Write Ahead Log]]:

"To start the server in targeted recovery mode, create a file called recovery.signal in the data directory. If both standby.signal and recovery.signal files are created, standby mode takes precedence. Targeted recovery mode ends when the archived WAL is fully replayed, or when recovery_target is reached. In this mode, the parameters from both this section and Section 20.5.6 will be used."

#+end_quote

This usecase is somewhat tricky to me and the can be combined with others things like streaming replication etc aswell.

*** Related concepts

**** Failover

#+begin_quote

See [[https://www.postgresql.org/docs/current/warm-standby-failover.html][PostgreSQL: Documentation: 16: 27.3. Failover]]. I am just dumping important points from there and some peronal notes

#+end_quote

- ~STONITH (Shoot The Other Node In The Head)~: If the primary server fails and the standby server becomes the new primary, and then the old primary restarts, you must have a mechanism for informing the old primary that it is no longer the primary. This is necessary to avoid situations where both systems think they are the primary, which will lead to confusion and ultimately data loss.

- PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server.

- To trigger failover of a log-shipping standby server, run pg_ctl promote or call pg_promote().

**** Promotion & Timeline

#+begin_quote

- Successful promotion will put the server in a new timeline, this is true for both crash recovery(implicit) and standby mode.

- Each timeline will have its own file naming scheme, so they can't overlap.

- Once promotion is done, you [[https://stackoverflow.com/questions/59875106/multiple-times-point-in-time-recovery-in-postgresql][cannot go back.]] Basically, do not promote standby to primary if you want this standby to be able to apply WALs from other primary.

#+end_quote

- There should be only [[https://stackoverflow.com/questions/37154902/primary-and-standby-server-at-different-timelines-in-postgres][one valid "current" timeline]]

- ~SELECT timeline_id FROM pg_control_checkpoint();~

- For recovery

  - postgres implicitly(and automatically) gets promoted and the ~recovery.signal~ file is deleted on successful restore unless ~recovery_target_action~ is set to ~shutdown~.

- For stanby

  - postgres will NOT automatically promote anything, you'd need to manually call ~pg_ctl promote~

**** Whether to set ~archive_mode~ when recovery/standby

There is some confusion around whether to keep ~archive_mode~ on when running a recovery server/standby server.

- To be safe, I unset ~archive_command~ or set ~archive_mode=off~ for the standby/recovery postgres instance.

  - I don't want it to be pushing ~archive_command~ immediately after getting restored(I'd want to verify things are working etc.)

- After I verify, I manually re-set ~archive_command~ and re-trigger a base-backup from the new replica/recovered instance.

- I am not sure what's the best approach is, just something that works for me.

- Confusion: [[https://dba.stackexchange.com/questions/306490/failed-postgres-replication-from-master-to-standby][instance where setting archive_command]] messed things up, but  [[https://stackoverflow.com/questions/24007152/disabling-wal-archiving-during-pg-restore][this discussion]] recommending against it. (I think it's talking about disabling ~archive_command~ in the primary, whereas in my case I am talking about not having ~archive_command~ in the restore/standby instance)

**** WAL generation

- WAL segment size is 16MB hardcoded at buildtime

- What happens when we do ~archive_mode~ set to ~on~

  - high iops?

** Other notes/references

*** TODO Aside on Backup ecosystem in Postgres (TODO)

#+begin_quote

This section is todo, I'll update later.

#+end_quote

Now I observe the postgres community from the sidelines and I don't think I have any "real" postgres experience, however this is what i've figured reading the manual and opinion of other folks on the interwebs.

#+begin_quote

"pgBackRest is more conservative and kind of reliable, wal-g is more performant, both are much better than Barman"

I won't mention who said it, my real surname is ~Barman~ (check my github fr) and I am not offended.

#+end_quote

- [[https://lobste.rs/s/egqvg1/if_pg_dump_is_not_backup_tool_what_is][If pg_dump is not a backup tool, what is? | Lobsters]]

**** Replication vs Backup

*** Everything works except ~archive_command~!

#+begin_quote

This is a short debugging story, that finally had a happy ending. :)

#+end_quote

Creating the plain postgres container image from nixpkgs using ~dockerTools~ was simple. It builds off ~24.05~ which has a postgres version pinned, and it happens to be ~postgres16.4.x~ and built using ~glibc 2.39~ but I was already running postgres already using the official debian based docker image. When I deployed the custom image to replace the official image, I got collation version mismatch [[https://community.fly.io/t/postgres-flex-database-postgres-has-a-collation-version-mismatch/14391][similar to this issue]]. I initially thought this was related to ~locale~, but later confirmed it was related the glibc version the postgres binary was compiled with and run from. The usually suggested way out of this is to ~REINDEX~ but my tables don't even have a backup yet so I didn't want to take any risk whatsoever.

#+begin_src

WARNING:  database "" has a collation version mismatch

DETAIL:  The database was created using collation version 2.36, but the operating system provides version 2.39.

HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE  REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.

#+end_src

So for this, there were too many ways to fix this. I want to keep things simple, so I just bought back a previous failed attempt, instead of building from nixpkgs, I'll use ~fromImage~ and use the official postgres image as base and install nixpkgs ~wal-g~ on top of it.

This initially failed because of [[https://github.com/NixOS/nix/issues/1205#issuecomment-2161613130][this issue]] (the ~docker_entrypoint.sh~ in the official pg image had ~/usr/bin/env~) but we resolved it using ~dockerTools.buildLayeredImage.fakeRootCommands~. So now everything is working!

Old postgres official image is replaced by custom docker image, there's no collation mismatch, I am able to exec into the container where postgres is running and am able to access the ~wal-g~ binary in the ~$PATH~. All G!

Now things were straightforward, I just had to setup the ~archive_command~ to ~wal-g wal-push %p~ and I am good. So I did that.

I got hit by dreaded boi 127:

#+begin_src shell

archive command failed with exit code 127

#+end_src

This drove me nuts. Initially I thought the forked ~archiver~ process was not able to access ~wal-g~ somehow. So I tried doing normal ~cp~ as the ~archive_command~ as that's the simplest. That was failing too!

Then I tried:

- ~archive_command = 'echo test > /tmp/archive_test.txt'~ : fail

- ~archive_command = 'true'~ : fail!

Now since ~/bin/true~ was failing I realized, I might have fu*ked up big time. To confirm that this is in-fact my image, I tried setting the ~archive_command = 'true'~ in the official postgres image and it was working. So I was sure that whatever is happening is my doing.

After some googling and claude, I found that you could trigger archive on demand by manually doing a WAL switch using: ~SELECT pg_switch_wal();~ So now could observe the postgres process at syscall level when it happens and not get lost in the sea of syscall. Let's bring in old friend ~strace~.

- ~docker inspect -f '{{.State.Pid}}' ~ : get the pid of the running postgres container (main process)

- ~sudo strace -f -p  -s 1024 -o /tmp/postgres_strace.log~

- Then manually trigger the WAL switch and hence ~archival_command~ and exit out of strace.

- Inspect the log, I see:

  #+begin_src

1053745 execve("/bin/sh", ["sh", "-c", "true"], 0xaaaaecdfef10 /* 65 vars */ 

1053746 close(3)                        = 0

1053745 <... execve resumed>)           = -1 ENOENT (No such file or directory)

1053746 signalfd4(-1, [URG], 8, SFD_CLOEXEC|SFD_NONBLOCK 

1053745 exit_group(127)                 = ?

  #+end_src

- It's not able to find ~true~! Upon looking up,  ~/bin/true~ exists. Hmm.

- Look closer.

- It's not able to find ~sh~! Upon looking up,  ~/bin/sh~ does not exists :)

- Now this is the same nix issue as not finding ~env~ as previously mentioned.

So I added the fix in similar manner using: ~ln -sfn "${pkgs.bash}/bin/sh" /bin/sh~ and things finally started working.

*** Previous attempts

**** Creating normal Dockerfile (official docker image)

This image has a extension that i use but doesn't have wal-g in it.

#+begin_src Dockerfile

FROM postgres:16.2-bookworm AS builder

RUN apt-get update \

    && apt-get -y upgrade \

    && apt-get install -y --no-install-recommends \

    curl ca-certificates git\

    build-essential libpq-dev postgresql-server-dev-all

RUN update-ca-certificates

WORKDIR /srv

RUN git clone https://github.com/fboulnois/pg_uuidv7.git .

RUN for v in `seq 16`; do pg_buildext build-$v $v; done

# create tarball and checksums

RUN cp sql/pg_uuidv7--1.5.sql . && TARGETS=$(find * -name pg_uuidv7.so) \

  && tar -czvf pg_uuidv7.tar.gz $TARGETS pg_uuidv7--1.5.sql pg_uuidv7.control \

  && sha256sum pg_uuidv7.tar.gz $TARGETS pg_uuidv7--1.5.sql pg_uuidv7.control > SHA256SUMS

FROM postgres:16.2-bookworm AS runner

COPY --from=builder /srv/pg_uuidv7.tar.gz /srv/SHA256SUMS /srv/

COPY --from=builder /srv/${PG_MAJOR}/pg_uuidv7.so /usr/lib/postgresql/${PG_MAJOR}/lib

COPY --from=builder /srv/pg_uuidv7.control /usr/share/postgresql/${PG_MAJOR}/extension

COPY --from=builder /srv/pg_uuidv7--1.5.sql /usr/share/postgresql/${PG_MAJOR}/extension

#+end_src

**** Using ~dockerTools.fromImage~ (official docker image)

#+begin_quote

Update!

I think i've found the issue mentioned in the following text, it's related to the shebang on top of the entrypoint file but eitheway I have dropped the idea of using ~fromImage~ for now and going all in on building it from nix derivations.

[[https://github.com/NixOS/nix/issues/1205#issuecomment-2161613130][Inconsistent treatment of /usr/bin/env in build sandbox vs. NixOS · Issue #1205 · NixOS/nix · GitHub]]

Update 2!

Because I was having difficulty building pg16.2 from nixpkgs, I ended up using this approach as I had an exsiting db using pg16.2 and I faced a [[https://community.fly.io/t/postgres-flex-database-postgres-has-a-collation-version-mismatch/14391][collation]] issue due to glibc version and I didn't want to REINDEX yet.

#+end_quote

- ~dockerTools~ allow you pull from another image. So since the official comes with the starter scripts that run initdb and it does a lot of other things. I thought it would be nice to just use it. But it didn't work out.

- See [[https://discourse.nixos.org/t/building-on-dockerfile-based-images/29583][Building on dockerfile-based images - Help - NixOS Discourse]]

#+begin_src nix

# file: flake.nix

let

  # nix run nixpkgs#nix-prefetch-docker -- postgres --image-tag 16.2-bookworm --arch amd64 --os linux

  pg_amd64 = pkgs.dockerTools.pullImage {

    imageName = "postgres";

    imageDigest = "sha256:4aea012537edfad80f98d870a36e6b90b4c09b27be7f4b4759d72db863baeebb";

    sha256 = "1rizfs2f6l834cgym0jpp88g3r3mcrxn9fd58np91ny9fy29zyck";

    finalImageName = "postgres";

    finalImageTag = "16.2-bookworm";

    os = "linux";

    arch = "amd64";

  };

in {

  packages = {

    nix_postgres_docker = pkgs.dockerTools.buildLayeredImage  {

      name = builtins.getEnv "IMAGE_NAME";

      tag = builtins.getEnv "IMAGE_TAG";

      fromImage = pg_amd64; # TODO make conditional

      contents = with pkgs; [ cacert postgresql16Packages.pg_uuidv7 ];

      config = {

        Cmd = ["postgres"];

        entrypoint = [ "docker-entrypoint.sh" ];

      };

    };

  };

};

#+end_src

While this seems like it works from the face of it. Unfortunately or fortunately, it doesn't. @NoobZ and @ManoftheSea from the unofficial NixOS discord channel talked me out of this and I stopped trying to make this work. If someone is interested to make this work, here is where I got stuck: "no such file or directory".

#+begin_src bash

λ just docker-build # success

λ just docker-load # success

λ just docker-run # fail

exec /usr/local/bin/docker-entrypoint.sh: no such file or directory

error: Recipe `docker-run` failed on line 36 with exit code 1

#+end_src

I am not sure why this happened, arch is what it should be and base image is a docker image so should not be nixos fsh issues.

Trying to install nixpkgs pg extensions into the correct location in the image would've been a real hassle.

** To explore later

- [ ] Streaming replication

- [ ] Compression of WAL logs, esp when setting archive_timeout before WAL hits 16MB wal segment size. It seems like the the empty WAL segments (due to early ~archive_timeout~ consume all 16M)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/geekodour/nix-postgres-docker

Awesome Lists containing this project

README