{"id":20078428,"url":"https://github.com/hexops-graveyard/pgtrgm_emperical_measurements","last_synced_at":"2025-07-26T12:11:10.898Z","repository":{"id":86747432,"uuid":"337222701","full_name":"hexops-graveyard/pgtrgm_emperical_measurements","owner":"hexops-graveyard","description":"Emperical measurements of pg_trgm performance at scale","archived":false,"fork":false,"pushed_at":"2021-02-21T02:24:20.000Z","size":5050,"stargazers_count":11,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-05T22:40:49.328Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hexops-graveyard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-02-08T22:06:58.000Z","updated_at":"2023-07-04T04:28:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"61dd88f0-62dc-4143-8cd3-d5680fb489e8","html_url":"https://github.com/hexops-graveyard/pgtrgm_emperical_measurements","commit_stats":null,"previous_names":["hexops/pgtrgm_emperical_measurements"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hexops-graveyard/pgtrgm_emperical_measurements","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hexops-graveyard%2Fpgtrgm_emperical_measurements","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hexops-graveyard%2Fpgtrgm_emperical_measurements/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hexops-graveyard%2Fpgtrgm_emperical_measurements/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hexops-graveyard%2Fpgtrgm_emperical_measurements/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hexops-graveyard","download_url":"https://codeload.github.com/hexops-graveyard/pgtrgm_emperical_measurements/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hexops-graveyard%2Fpgtrgm_emperical_measurements/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267163823,"owners_count":24045730,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-26T02:00:08.937Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T15:14:24.703Z","updated_at":"2025-07-26T12:11:10.871Z","avatar_url":"https://github.com/hexops-graveyard.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Measuring the performance of pg_trgm\n\nThis repository contains **extensive, verbose, detailed** information about the behavior of pg_trgm at large scales, particularly for regex search.\n\nThis repository shares how we performed our empirical measurements, for reproducibility by others.\n\n## Article / paper\n\nBefore viewing this repository, you will likely prefer to read the article which is far less verbose and summarizes the findings more nicely:\n\n[\"Postgres regex search over 10,000 GitHub repositories\"](https://devlog.hexops.com/2021/postgres-regex-search-over-10000-github-repositories)\n\n## Overview\n\n- `cmd/corpusindex` small Go program which bulk inserts the corpus into Postgres\n- `cmd/githubscrape` small Go program that fetches the top 1,000 repositories for any language.\n- `cmd/visualize-docker-json-stats` cleans up `docker_stats_logs/` output for visualization using [the jp tool](https://github.com/sgreben/jp).\n- `docker_logs/` logs from the Docker container during execution.\n- `docker_stats_logs/` logs from `docker stats` during indexing/querying the corpus, showing CPU/memory usage over time.\n- `top_repos/` contains URLs to the top 1,000 repositories for a given language. In total, 20,578 repositories.\n- `query_logs/` the Postgres SQL queries we ultimately ran.\n- `capture-docker-stats.sh` captures `docker stats` output every 1s with timing info.\n- `clone-corpus.sh` clones all 20,578 repositories (concurrently.)\n- `extract-base-postgres-config.sh` extracts the base Postgres config from the Docker image.\n- `index-corpus.sh` used to invoke the `corpusindex` tool for every repository, once cloned.\n- `query-corpus.sh` runs detailed search queries over the corpus (invokes the other `query-corpus*` scripts.)\n- `run-postgres.sh` runs the Postgres server Docker image.\n\n## Cloning the corpus\n\nFirst run `./clone_corpus.sh` to download the corpus into `../corpus` (it uses the parent directory, because VS Code and most tooling will barf if there is a directory that many files existing in a project.)\n\nWARNING, this will:\n\n* Clone all 20,578 repositories concurrently, using most of your available CPU/memory/network resources.\n* Take 12-16 hours with a fast ~100 Mbps connection to GitHub's servers.\n* Consumes ~265G of disk space.\n* Requires you have `gfind` installed (`brew install gfind`), otherwise replace `gfind` with `find` in the script.\n\nTo try and save you disk space, the script will already trim the data down a lot, reducing the corpus size by about 66%:\n\n* Clones repos only with `--depth 1`\n* Deletes the entire `.git` directory after cloning repos, so only file contents are left. This reduces the corpus size by about 30% (412G -\u003e 290G, for 12,148 repos) \n* Deleting files greater than 1MB. GitHub only indexes files smaller than 384KB, for example - and this 1MB limit reduces the corpus size by _another_ whopping 51% (290G -\u003e 142G, for 12,148 repos) - wow.\n\nYou can use this command at any time to figure out how many repos have been cloned:\n\n```sh\necho \"progress: $(find ../corpus/ -maxdepth 4 -mindepth 4 | wc -l) repos cloned\"\n```\n\n## Setting up Docker\n\nIf you plan on using Docker and are on Mac OS, you are using a VM and this has performance implications. Be sure to:\n\n1. Max out the CPUs, Memory, and disk space available to Docker.\n2. Disable \"Use gRPC FUSE for file sharing\" in Experimental Features.\n\n## Initializing Postgres\n\nLaunch Postgres via `./run-postgres.sh`, and then get a `psql` prompt:\n\n```sh\ndocker exec -it postgres psql -U postgres\n```\n\nCreate the DB schema:\n\n```sql\nBEGIN;\nCREATE EXTENSION IF NOT EXISTS pg_trgm;\nCREATE TABLE IF NOT EXISTS files (\n    id bigserial PRIMARY KEY,\n    contents text NOT NULL,\n    filepath text NOT NULL\n);\nCOMMIT;\n```\n\n## Indexing the corpus\n\nIndex a single repository:\n\n```sh\nDATABASE=postgres://postgres@127.0.0.1:5432/postgres go run ./cmd/corpusindex/main.go ../corpus/c/github.com\\\\/linux-noah\\\\/noah/\n```\n\nIndex all repositories:\n\n```sh\ngo install ./cmd/corpusindex; DATABASE=postgres://postgres@127.0.0.1:5432/postgres ./index-corpus.sh\n```\n\n## Querying the corpus\n\n```\npostgres=# SELECT filepath FROM files WHERE contents ~ 'use strict';\n                                   filepath                                    \n-------------------------------------------------------------------------------\n ../corpus/c/github.com\\/linux-noah\\/noah/.git/hooks/fsmonitor-watchman.sample\n(1 row)\n```\n\nThis will take around ~8 hours on a 2020 Macbook Pro i9 (8 physical cores, 16 virtual) w/ 16G memory.\n\nOn-disk size of the entire DB at this point will be 101G.\n\n## Create the Trigram index\n\n```sql\nCREATE INDEX IF NOT EXISTS files_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);\n```\n\n### Configuration attempt 1 (indexing failure, OOM)\n\nWith this configuration, the above `CREATE INDEX` command will take `11h34m` and ultimately OOM and fail:\n\n```\nlisten_addresses = '*'\nmax_connections = 100\nshared_buffers = 4GB\neffective_cache_size = 12GB\nmaintenance_work_mem = 16GB\ncheckpoint_completion_target = 0.9\nwal_buffers = 16MB\ndefault_statistics_target = 100\nrandom_page_cost = 1.1\neffective_io_concurrency = 200\nwork_mem = 5242kB\nmin_wal_size = 50GB\nmax_wal_size = 4GB\nmax_worker_processes = 8\nmax_parallel_workers_per_gather = 8\nmax_parallel_workers = 8\nmax_parallel_maintenance_workers = 8\n```\n\n```\npostgres=# CREATE INDEX IF NOT EXISTS files_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);\n\nserver closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\nThe connection to the server was lost. Attempting reset: Failed.\n```\n\nPostgres logs indicate:\n\n```\n2021-01-30 05:04:11.045 GMT [276] LOG:  stats_timestamp 2021-01-30 05:04:11.773621+00 is later than collector's time 2021-01-30 05:04:11.036405+00 for database 0\n2021-01-30 08:00:56.721 GMT [276] LOG:  stats_timestamp 2021-01-30 08:00:56.707853+00 is later than collector's time 2021-01-30 08:00:56.702848+00 for database 0\n2021-01-30 08:24:57.919 GMT [276] LOG:  stats_timestamp 2021-01-30 08:24:57.922315+00 is later than collector's time 2021-01-30 08:24:57.917066+00 for database 13442\n2021-01-30 09:05:13.815 GMT [1] LOG:  server process (PID 290) was terminated by signal 9: Killed\n2021-01-30 09:05:13.815 GMT [1] DETAIL:  Failed process was running: CREATE INDEX IF NOT EXISTS files_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);\n2021-01-30 09:05:13.818 GMT [1] LOG:  terminating any other active server processes\n2021-01-30 09:05:13.823 GMT [275] WARNING:  terminating connection because of crash of another server process\n2021-01-30 09:05:13.823 GMT [275] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.\n2021-01-30 09:05:13.823 GMT [275] HINT:  In a moment you should be able to reconnect to the database and repeat your command.\n2021-01-30 09:05:13.854 GMT [980] FATAL:  the database system is in recovery mode\n2021-01-30 09:05:14.020 GMT [1] LOG:  all server processes terminated; reinitializing\n2021-01-30 09:08:44.448 GMT [981] LOG:  database system was interrupted; last known up at 2021-01-29 22:18:31 GMT\n2021-01-30 09:08:50.772 GMT [981] LOG:  database system was not properly shut down; automatic recovery in progress\n2021-01-30 09:08:50.876 GMT [981] LOG:  redo starts at 19/82EF3D98\n2021-01-30 09:08:50.877 GMT [981] LOG:  invalid record length at 19/82EF3EE0: wanted 24, got 0\n2021-01-30 09:08:50.877 GMT [981] LOG:  redo done at 19/82EF3EA8\n2021-01-30 09:08:51.158 GMT [1] LOG:  database system is ready to accept connections\n```\n\nNo postgres _container_ restart will be observed because (interestingly) Postgres can handle the OOM without restarting the container and start itself again. One of the benefits of handling C allocation failures, I presume, but didn't investigate:\n\n```\n$ docker ps\nCONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                      NAMES\neb087868cb00        postgres:13.1-alpine   \"docker-entrypoint.s…\"   43 hours ago        Up 43 hours         127.0.0.1:5432-\u003e5432/tcp   postgres\n```\n\nSee `docker_stats_logs/configuration-failure-1.log` for a JSON log stream of container `docker stats` captured during the `CREATE INDEX`.\n\nThere is evidence that indexing with that configuration -- for whatever reason -- for the vast majority of indexing time uses just 1-2 CPU cores, and peak ~11 GiB of memory according to `docker stats`.\n\nMemory usage in MiB as reported by `docker stats` over time rendered via:\n\n```\ncat ./docker_stats_logs/configuration-failure-1.log | go run ./cmd/visualize-docker-json-stats/main.go --trim-end=32000 | jq | jp -y '..MemUsageMiB'\n```\n\n\u003cimg width=\"981\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107313722-56bbac80-6a50-11eb-94c7-8e13ea095053.png\"\u003e\n\nCPU usage percentage (150% indicates \"one and a half virtual CPU cores\") as reported by `docker stats` over time rendered via:\n\n```\ncat ./docker_stats_logs/configuration-failure-1.log | go run ./cmd/visualize-docker-json-stats/main.go --trim-end=32000 | jq | jp -y '..CPUPerc'\n```\n\n\u003cimg width=\"982\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107313915-cc277d00-6a50-11eb-9282-62159a127966.png\"\u003e\n\nLess reliable charts from a Mac app (seems to have periodic data loss issues):\n\nMemory usage (purple == compressed, red==active, blue==wired):\n\n\u003cimg width=\"354\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106368429-9a067480-6306-11eb-82f6-769733a425ee.png\"\u003e\n\nMemory pressure:\n\n\u003cimg width=\"356\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106368408-80fdc380-6306-11eb-8169-865605b7815d.png\"\u003e\n\nMemory swap:\n\n\u003cimg width=\"355\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106368339-13519780-6306-11eb-8d14-9b0deda6ed78.png\"\u003e\n\nDisk activity:\n\n\u003cimg width=\"360\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106368350-2f553900-6306-11eb-81f5-7a017f0b9d50.png\"\u003e\n\nCPU activity:\n\n\u003cimg width=\"352\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106368381-57449c80-6306-11eb-80c0-774f23832d71.png\"\u003e\n\nCPU load avg:\n\n\u003cimg width=\"346\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106368392-6d525d00-6306-11eb-8538-d3e398486e22.png\"\u003e\n\n\n### Configuration attempt 2 (24h+ of indexing, then out of disk space)\n\nFor this attempt, we use a configuration provided by pgtune for a data warehouse with 10G memory to reduce the chance of OOMs:\n\n```\n# DB Version: 13\n# OS Type: linux\n# DB Type: dw\n# Total Memory (RAM): 10 GB\n# CPUs num: 8\n# Connections num: 20\n# Data Storage: ssd\n\nmax_connections = 20\nshared_buffers = 2560MB\neffective_cache_size = 7680MB\nmaintenance_work_mem = 1280MB\ncheckpoint_completion_target = 0.9\nwal_buffers = 16MB\ndefault_statistics_target = 500\nrandom_page_cost = 1.1\neffective_io_concurrency = 200\nwork_mem = 16MB\nmin_wal_size = 4GB\nmax_wal_size = 16GB\nmax_worker_processes = 8\nmax_parallel_workers_per_gather = 4\nmax_parallel_workers = 8\nmax_parallel_maintenance_workers = 4\n```\n\nIndexing took `~26h54m`, compared to the `~11h34m` in the previous attempt, starting at 2:39pm and ending at ~5:31pm the next day in an out-of-space failure.\n\nSee `docker_stats_logs/configuration-failure-2.log` for a full JSON stream of `docker stats` during indexing.\n\nSee `logs/configuration-failure-2.log` for the Postgres logs during this attempt.\n\nOf particular note is that, again, almost 100% of the time was spent with a single CPU core maxed out and the vast majority of the CPU in `Idle` state (red).\n\nMemory usage in MiB as reported by `docker stats` over time rendered via:\n\n```\ncat ./docker_stats_logs/configuration-failure-2.log | go run ./cmd/visualize-docker-json-stats/main.go | jq | jp -y '..MemUsageMiB'\n```\n\n\u003cimg width=\"980\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107314104-350ef500-6a51-11eb-909f-2f1b524d29b2.png\"\u003e\n\nCPU usage percentage (150% indicates \"one and a half virtual CPU cores\") as reported by `docker stats` over time rendered via:\n\n```\ncat ./docker_stats_logs/configuration-failure-2.log | go run ./cmd/visualize-docker-json-stats/main.go | jq | jp -y '..CPUPerc'\n```\n\n\u003cimg width=\"980\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107314168-507a0000-6a51-11eb-8a18-ec18752f7f16.png\"\u003e\n\nLess reliable charts from a Mac app (seems to have periodic data loss issues):\n\n\u003cimg width=\"597\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106505762-fba11d00-6485-11eb-8c58-5954b1dfeb3a.png\"\u003e\n\nMemory pressure was mostly fine and remained under 75%:\n\n\u003cimg width=\"597\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106506305-c5b06880-6486-11eb-9aef-0a0f086fa623.png\"\u003e\n\nMemory usage (purple == compressed, red==active, blue==wired) shows we never hit memory limits or even high usage:\n\n\u003cimg width=\"594\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106506397-e5479100-6486-11eb-93eb-a4488efb15b9.png\"\u003e\n\nThe `docker stats` stream (`docker_stats_logs/configuration-failure-2.log`) shows memory usage throughout the 24h+ period never going above ~1.4G.\n\nDespite this, system swap was used somewhat heavily:\n\n\u003cimg width=\"600\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106507665-9ac71400-6488-11eb-839f-13e0369865d2.png\"\u003e\n\nDisk usage during indexing tells us that the average was about ~250 MB/s for reads (blue) and writes (red):\n\n\u003cimg width=\"599\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106507903-ec6f9e80-6488-11eb-88a8-78e5b7aacfd6.png\"\u003e\n\nIt should be noted that in-software disk speed tests (i.e. including disk encryption Mac OS is performing) show regular read and write speeds of ~860 MB/s with \u003c5% effect on CPU usage:\n\n\u003cimg width=\"591\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/106508609-d8786c80-6489-11eb-92c7-b69db1ee1daa.png\"\u003e\n\nIt should also be noted that postgres disk usage during this test, although eventually running out, rose from `101G` to `124G`:\n\n```\n$ du -sh .postgres/\n124G\t.postgres/\n```\n\n### Configuration attempt 3: reduced dataset\n\nFor this attempt, and to reduce the turnaround time on experiments, we use the same postgres configuration as in attempt 2 but we use a reduced dataset. Before we had 19,441,820 files totalling ~178.2 GiB:\n\n```\npostgres=# select count(filepath) from files;\n  count   \n----------\n 19441820\n(1 row)\n\npostgres=# select SUM(octet_length(contents)) from files;\n     sum      \n--------------\n 191379114802\n(1 row)\n```\n\nWe drop half the files in the dataset, and :\n\n```\npostgres=# select count(filepath) from files;\n  count  \n---------\n 9720910\n(1 row)\n\npostgres=# select SUM(octet_length(contents)) from files;\n     sum     \n-------------\n 88123563320\n(1 row)\n```\n\nNow 82 GiB of raw text are to be indexed.\n\nAnd we free ~228G for use by the Postgres indexing (previously ~15G.)\n\nIndex creation this time took from 3:14pm MST to 7:44pm MST (next day), a total of 28h30m. However, for some period of this time the Macbook went into low-power (not sleep) mode for - approx 6h - making actual indexing time around ~22h.\n\nTotal Postgres data size afterwards (again, less than 82 GiB due to compression):\n\n```\n$ du -sh .postgres/\n 73G\t.postgres/\n```\n\nMemory usage in MiB as reported by `docker stats` over time rendered via:\n\n```\ncat ./docker_stats_logs/configuration-3.log | go run ./cmd/visualize-docker-json-stats/main.go --trim-end=0 | jq | jp -y '..MemUsageMiB'\n```\n\n\u003cimg width=\"980\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107315387-ce3f0b00-6a53-11eb-886c-410f000f73bd.png\"\u003e\n\nCPU usage percentage (150% indicates \"one and a half virtual CPU cores\") as reported by `docker stats` over time rendered via:\n\n```\ncat ./docker_stats_logs/configuration-3.log | go run ./cmd/visualize-docker-json-stats/main.go --trim-end=0 | jq | jp -y '..CPUPerc'\n```\n\n\u003cimg width=\"980\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107315239-8324f800-6a53-11eb-9a5b-fcc61d1a7b59.png\"\u003e\n\n\n(We did not take measurements through the Mac app for indexing this time.)\n\n## Query performance\n\nRestart Postgres first, such that its memory caches are emptied.\n\nOnce it starts, capture docker stats:\n\n```sh\nOUT=docker_stats_logs/query-run-n.log ./capture-docker-stats.sh\n```\n\nSet a query timeoout of 5 minutes on the database:\n\n```sql\nALTER DATABASE postgres SET statement_timeout = '300s';\n```\n\nThen begin querying the corpus:\n\n```\n./query-corpus.sh \u0026\u003e query_logs/query-run-1.log\n```\n\n## Query performance\n\nWe started queries at 12:42PM MST using:\n\n```\n./query-corpus.sh \u0026\u003e query_logs/query-run-1.log\n```\n\n- Find the exact SQL queries we ran in `query_logs/query-run-1.log`.\n- Find the `docker stats` measured during query execution in `docker_stats_logs/query-run-1.log`.\n\nCPU usage (150% == one and a half virtual cores) during query execution as visualized by:\n\n```sh\ncat ./docker_stats_logs/query-run-1.log | go run ./cmd/visualize-docker-json-stats/main.go --trim-end=9000 | jq | jp -y '..CPUPerc'\n```\n\n\u003cimg width=\"1001\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107459155-c85c2f00-6b12-11eb-9b2a-27e0f1424ed6.png\"\u003e\n\nMemory usage in MiB during query execution as visualized by:\n\n```sh\ncat ./docker_stats_logs/query-run-1.log | go run ./cmd/visualize-docker-json-stats/main.go --trim-end=9000 | jq | jp -y '..MemUsageMiB'\n```\n\n\u003cimg width=\"996\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107459238-fa6d9100-6b12-11eb-8692-4a68e421b2a6.png\"\u003e\n\n\nWe can see there were 1458 queries total with:\n\n```sh\n$ cat ./query_logs/query-run-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[]' | jq -c -s '.[]' | wc -l\n    1458\n```\n\nAnd of those, we can see that 20 timed out:\n\n```sh\n$ cat ./query_logs/query-run-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select (.Timeout == true)' | jq -c -s '.[]' | wc -l\n      20\n```\n\nWe can also visualize that 1405 (96.4%) queries completed in under 1300ms (Y axis), but are generally planned to execute in under 47ms (X axis):\n\n```sh\ncat ./query_logs/query-run-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select(.ExecutionTimeMs \u003c 1300) | select(.PlanningTimeMs \u003c 50)' | jq -c -s '.[]' | wc -l\n```\n\n```sh\ncat ./query_logs/query-run-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select(.ExecutionTimeMs \u003c 1300) | select(.PlanningTimeMs \u003c 50)' | jq -s | jp -x '..PlanningTimeMs' -y '..ExecutionTimeMs' -type scatter -canvas quarter\n```\n\n\u003cimg width=\"980\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107490175-c0b67d80-6b46-11eb-9124-24910290ecc6.png\"\u003e\n\n\nHowever, there are outliers which were planned for execution in under 100ms (X axis) and actually had execution between 1.3s to ~150s (Y axis):\n\n```sh\ncat ./query_logs/query-run-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select(.ExecutionTimeMs \u003e 1300) | select(.PlanningTimeMs \u003c 100)' | jq -s | jp -x '..PlanningTimeMs' -y '..ExecutionTimeMs' -type scatter -canvas quarter\n```\n\n\u003cimg width=\"981\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107491036-e1330780-6b47-11eb-845b-0dafaa1cff44.png\"\u003e\n\nWe can see the above accounted for 29 queries via:\n\n```\ncat ./query_logs/query-run-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select(.ExecutionTimeMs \u003e 1300) | select(.PlanningTimeMs \u003c 100)' | jq -c -s '.[]' | wc -l\n```\n\nWe can also see how many of which types of queries we ran, e.g. via:\n\n```\n$ cat ./query_logs/query-run-1.log | go run ./cmd/visualize-query-log/main.go | jq -c '.[] | {Timeout: .Timeout, Limit: .Limit, Results: .Rows, Query: .Query}' | sort | uniq -c\n 100 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"123456789\"}\n   2 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635\"}\n   4 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"bytes.Buffer\"}\n   2 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877\"}\n 100 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Error\"}\n 100 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Print.*\"}\n 100 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Println\"}\n 200 {\"Timeout\":false,\"Limit\":10,\"Results\":7,\"Query\":\"error\"}\n 100 {\"Timeout\":false,\"Limit\":10,\"Results\":7,\"Query\":\"var\"}\n 100 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"123456789\"}\n   2 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635\"}\n   4 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"bytes.Buffer\"}\n   2 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877\"}\n 100 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"fmt\\\\.Error\"}\n 100 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"fmt\\\\.Print.*\"}\n 100 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"fmt\\\\.Println\"}\n 200 {\"Timeout\":false,\"Limit\":100,\"Results\":7,\"Query\":\"error\"}\n 100 {\"Timeout\":false,\"Limit\":100,\"Results\":7,\"Query\":\"var\"}\n   2 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"123456789\"}\n   2 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635\"}\n   4 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"bytes.Buffer\"}\n   2 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877\"}\n   2 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"fmt\\\\.Error\"}\n   2 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"fmt\\\\.Print.*\"}\n   2 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"fmt\\\\.Println\"}\n   4 {\"Timeout\":false,\"Limit\":1000,\"Results\":7,\"Query\":\"error\"}\n   2 {\"Timeout\":false,\"Limit\":1000,\"Results\":7,\"Query\":\"var\"}\n   4 {\"Timeout\":true,\"Limit\":-1,\"Results\":10,\"Query\":\"error\"}\n   2 {\"Timeout\":true,\"Limit\":-1,\"Results\":10,\"Query\":\"fmt\\\\.Error\"}\n   1 {\"Timeout\":true,\"Limit\":-1,\"Results\":10,\"Query\":\"fmt\\\\.Println\"}\n   2 {\"Timeout\":true,\"Limit\":-1,\"Results\":9,\"Query\":\"123456789\"}\n   2 {\"Timeout\":true,\"Limit\":-1,\"Results\":9,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635\"}\n   2 {\"Timeout\":true,\"Limit\":-1,\"Results\":9,\"Query\":\"bytes.Buffer\"}\n   2 {\"Timeout\":true,\"Limit\":-1,\"Results\":9,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877\"}\n   2 {\"Timeout\":true,\"Limit\":-1,\"Results\":9,\"Query\":\"fmt\\\\.Print.*\"}\n   1 {\"Timeout\":true,\"Limit\":-1,\"Results\":9,\"Query\":\"fmt\\\\.Println\"}\n   2 {\"Timeout\":true,\"Limit\":-1,\"Results\":9,\"Query\":\"var\"}\n```\n\n### Query execution #2\n\nWe rerun with much higher number of queries executed, and with timeout raised to 1h:\n\n```sql\nALTER DATABASE postgres SET statement_timeout = '3600s';\n```\n\nWe start at 3:06AM MST , and execution ran until 7:21PM MST the next day. At this point, `query-corpus-unlimited.sh` was configured to execute each query 100 times. Due to how slow these queries were, we estimated it would take ~11 days to complete and halted testing to reduce the number of queries to just 2 executions per query. These runs are recorded in `query_logs/query-run-[2-3].log` and `docker_stats_logs/query-run-[2-3].log` respectively.\n\nThe third attempt executed only partially, in specific it executed until the queries for unlimited/unbounded `var`:\n\n```\nEXPLAIN ANALYZE select count(id) from (select id from files where contents ~ 'var') as e;\n```\n\nOnce the first unbounded `var` query described above tried to run, we found the machine appeared to be off unexpectedly. MacOS ran out of resources (likely CPU, but possibly memory - and definitely not disk space) causing MacOS's kernel watchdog extension to panic ultimately bricking the entire MacOS installation requiring us to reinstall it: https://twitter.com/slimsag/status/1360513988514091010\n\n### Query execution #2 \u0026 3 performance\n\n#### What queries were ran?\n\nTotal number of queries ran:\n\n```\n$ cat ./query_logs/query-run-2.log ./query_logs/query-run-3.log | go run ./cmd/visualize-query-log/main.go | jq -c '.[] | {Timeout: .Timeout, Limit: .Limit, Results: .Rows, Query: .Query}' | wc -l\n   19936\n```\n\nExact numbers of each query type ran:\n\n```\n$ cat ./query_logs/query-run-2.log ./query_logs/query-run-3.log | go run ./cmd/visualize-query-log/main.go | jq -c '.[] | {Timeout: .Timeout, Limit: .Limit, Results: .Rows, Query: .Query}' | sort | uniq -c | sort\n   2 {\"Timeout\":false,\"Limit\":-1,\"Results\":9,\"Query\":\"fmt\\\\.Error\"}\n   2 {\"Timeout\":false,\"Limit\":-1,\"Results\":9,\"Query\":\"fmt\\\\.Print.*\"}\n   2 {\"Timeout\":false,\"Limit\":-1,\"Results\":9,\"Query\":\"fmt\\\\.Println\"}\n   4 {\"Timeout\":false,\"Limit\":-1,\"Results\":17,\"Query\":\"error\"}\n   4 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"bytes.Buffer'\"}\n   4 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"bytes.Buffer'\"}\n  18 {\"Timeout\":false,\"Limit\":-1,\"Results\":17,\"Query\":\"error'\"}\n 100 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"123456789'\"}\n 100 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635'\"}\n 100 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877'\"}\n 100 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"fmt\\\\.Error'\"}\n 100 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"fmt\\\\.Print.*'\"}\n 100 {\"Timeout\":false,\"Limit\":1000,\"Results\":10,\"Query\":\"fmt\\\\.Println'\"}\n 100 {\"Timeout\":false,\"Limit\":1000,\"Results\":7,\"Query\":\"var'\"}\n 200 {\"Timeout\":false,\"Limit\":1000,\"Results\":7,\"Query\":\"error'\"}\n1000 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"123456789'\"}\n1000 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635'\"}\n1000 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877'\"}\n1000 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Error'\"}\n1000 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Print.*'\"}\n1000 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Println'\"}\n1000 {\"Timeout\":false,\"Limit\":10,\"Results\":7,\"Query\":\"var'\"}\n1000 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"123456789'\"}\n1000 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635'\"}\n1000 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"bytes.Buffer'\"}\n1000 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877'\"}\n1000 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"fmt\\\\.Error'\"}\n1000 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"fmt\\\\.Print.*'\"}\n1000 {\"Timeout\":false,\"Limit\":100,\"Results\":10,\"Query\":\"fmt\\\\.Println'\"}\n1000 {\"Timeout\":false,\"Limit\":100,\"Results\":7,\"Query\":\"var'\"}\n2000 {\"Timeout\":false,\"Limit\":10,\"Results\":7,\"Query\":\"error'\"}\n2000 {\"Timeout\":false,\"Limit\":100,\"Results\":7,\"Query\":\"error'\"}\n```\n\n#### How many timed out?\n\n```\n$ cat ./query_logs/query-run-2.log ./query_logs/query-run-3.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select (.Timeout == true)' | jq -c -s '.[]' | wc -l\n       0\n```\n\n(But we should include the last single query which bricked our MacOS installation mentioned previously..)\n\n#### CPU/memory usage\n\nCPU usage (150% == one and a half virtual cores) during query execution as visualized by:\n\n```sh\ncat ./docker_stats_logs/query-run-2.log ./docker_stats_logs/query-run-3.log | go run ./cmd/visualize-docker-json-stats/main.go | jq | jp -y '..CPUPerc'\n```\n\n\u003cimg width=\"1253\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107847580-0f178680-6daa-11eb-8474-afe2f057cedb.png\"\u003e\n\nMemory usage in MiB during query execution as visualized by:\n\n```sh\ncat ./docker_stats_logs/query-run-2.log ./docker_stats_logs/query-run-3.log | go run ./cmd/visualize-docker-json-stats/main.go | jq | jp -y '..MemUsageMiB'\n```\n\n\u003cimg width=\"1251\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107847596-30787280-6daa-11eb-979c-b77c3711b948.png\"\u003e\n\nThe large spike towards the end is a result of beginning to execute `query-corpus-unlimited.sh` queries - i.e. ones without any `LIMIT`.\n\n#### Other measurements\n\nWe can determine how many queries executed in under a time bucket using e.g.:\n\n```\n$ cat ./query_logs/query-run-2.log ./query_logs/query-run-3.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select (.ExecutionTimeMs \u003c 25000)' | jq -c -s '.[]' | wc -l\n```\n\nOr this to get a scatter plot showing planning time for the query (X axis) vs. execution time for the query (Y axis):\n\n```\ncat ./query_logs/query-run-2.log ./query_logs/query-run-3.log | go run ./cmd/visualize-query-log/main.go | jq '.[]' | jq -s | jp -x '..PlanningTimeMs' -y '..ExecutionTimeMs' -type scatter -canvas quarter\n```\n\nOr this to do the same, but only include queries that executed in under 5s:\n\n```\ncat ./query_logs/query-run-2.log ./query_logs/query-run-3.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select(.ExecutionTimeMs \u003c 5000)' | jq -s | jp -x '..PlanningTimeMs' -y '..ExecutionTimeMs' -type scatter -canvas quarter\n```\n\nWe can also plot execution time (Y axis) vs. # of index rechecks (X axis):\n\n```\ncat ./query_logs/query-run-2.log ./query_logs/query-run-3.log | go run ./cmd/visualize-query-log/main.go | jq '.[]' | jq -s | jp -x '..IndexRechecks' -y '..ExecutionTimeMs' -type scatter -canvas quarter\n```\n\n\u003cimg width=\"1036\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/107849660-fc0cb280-6db9-11eb-9c10-cb7e74366ab7.png\"\u003e\n\n\n### Database startup time\n\nClean startups are almost instantaneous, taking less than a second. \n\nIf the DB is not shut down correctly (i.e. previously terminated during startup), startup takes a fairly hefty 10m12s to complete before the DB will accept any connections, as it begins a recovery process (which I assume involves reading a substantial portion of the DB from disk):\n\n```\nPostgreSQL Database directory appears to contain a database; Skipping initialization\n\n2021-02-08 21:45:48.452 GMT [1] LOG:  starting PostgreSQL 13.1 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit\n2021-02-08 21:45:48.454 GMT [1] LOG:  listening on IPv4 address \"0.0.0.0\", port 5432\n2021-02-08 21:45:48.454 GMT [1] LOG:  listening on IPv6 address \"::\", port 5432\n2021-02-08 21:45:48.531 GMT [1] LOG:  listening on Unix socket \"/var/run/postgresql/.s.PGSQL.5432\"\n2021-02-08 21:45:48.633 GMT [21] LOG:  database system was interrupted; last known up at 2021-02-03 06:16:10 GMT\n2021-02-08 21:47:51.157 GMT [27] FATAL:  the database system is starting up\n2021-02-08 21:47:56.383 GMT [33] FATAL:  the database system is starting up\n2021-02-08 21:48:13.198 GMT [39] FATAL:  the database system is starting up\n2021-02-08 21:48:43.088 GMT [45] FATAL:  the database system is starting up\n2021-02-08 21:52:43.672 GMT [51] FATAL:  the database system is starting up\n2021-02-08 21:53:32.048 GMT [58] FATAL:  the database system is starting up\n2021-02-08 21:54:07.696 GMT [64] FATAL:  the database system is starting up\n2021-02-08 21:55:36.446 GMT [21] LOG:  database system was not properly shut down; automatic recovery in progress\n2021-02-08 21:55:36.515 GMT [21] LOG:  redo starts at 2B/EE02EE8\n2021-02-08 21:55:36.518 GMT [21] LOG:  invalid record length at 2B/EE02FD0: wanted 24, got 0\n2021-02-08 21:55:36.518 GMT [21] LOG:  redo done at 2B/EE02F98\n2021-02-08 21:55:36.783 GMT [1] LOG:  database system is ready to accept connections\n```\n\n### Data size (total on disk)\n\nAfter indexing:\n\n```\n$ du -sh .postgres/\n 73G\t.postgres/\n```\n\nAfter `DROP INDEX files_contents_trgm_idx;`:\n\n```\n$ du -sh .postgres/\n 54G\t.postgres/\n```\n\n### Data size reported by Postgres\n\nAfter indexing:\n\n```\npostgres=# \\d+\n                                  List of relations\n Schema |     Name     |   Type   |  Owner   | Persistence |    Size    | Description \n--------+--------------+----------+----------+-------------+------------+-------------\n public | files        | table    | postgres | permanent   | 47 GB      | \n public | files_id_seq | sequence | postgres | permanent   | 8192 bytes | \n(2 rows)\n```\n\nAfter `DROP INDEX files_contents_trgm_idx;`:\n\n```\npostgres=# \\d+\n                                  List of relations\n Schema |     Name     |   Type   |  Owner   | Persistence |    Size    | Description \n--------+--------------+----------+----------+-------------+------------+-------------\n public | files        | table    | postgres | permanent   | 47 GB      | \n public | files_id_seq | sequence | postgres | permanent   | 8192 bytes | \n(2 rows)\n```\n\n## Table splitting\n\nWe did a brief experiment with table splitting to try and increase Postgres's use of multiple CPU cores, and to reduce the number of row rechecks caused by trigram false positive matches.\n\nFirst we got the total number of rows:\n\n```sql\npostgres=# select count(*) from files;\n  count  \n---------\n 9720910\n(1 row)\n```\n\nBased on this, we determined that we would try 200 tables each with 50,000 rows max. We generated the queries to create the tables:\n\n```sql\n$ go run ./cmd/tablesplitgen/main.go create\nCREATE TABLE files_000 AS SELECT * FROM files WHERE id \u003e 0 AND id \u003c 50000;\nCREATE TABLE files_001 AS SELECT * FROM files WHERE id \u003e 50000 AND id \u003c 100000;\nCREATE TABLE files_002 AS SELECT * FROM files WHERE id \u003e 100000 AND id \u003c 150000;\nCREATE TABLE files_003 AS SELECT * FROM files WHERE id \u003e 150000 AND id \u003c 200000;\nCREATE TABLE files_004 AS SELECT * FROM files WHERE id \u003e 200000 AND id \u003c 250000;\n...\n```\n\nAnd ran those after unsetting the statement timeout:\n\n```sql\nALTER DATABASE postgres SET statement_timeout = default;\n```\n\nCreation of these tables took around ~20-40s each - about two hours in total.\n\nWe then began recording docker stats:\n\n```\nOUT=docker_stats_logs/split-index-1.log ./capture-docker-stats.sh\n```\n\nAnd used a program to generate and run the SQL statement for indexing, e.g.:\n\n```sql\nCREATE INDEX IF NOT EXISTS files_000_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);\n```\n\nIn parallel (8 at a time), for all 200 tables:\n\n```sh\nPARALLEL=8 go run ./cmd/tablesplitgen/main.go index \u0026\u003e ./index_logs/split-index-1.log\n```\n\nThis failed due to an OOM about half way through at around 117 tables indexed, see `index_logs/split-index-1.log`. We then reran with 6 workers and it succeeded (one of the benefits of this approach is we did not have to reindex those 117 tables that did completed).\n\nSee `docker_stats_logs/split-index-1.log` for how long it took and CPU/memory usage, which did consume multiple CPU cores.\n\n## Performance gains from table splitting\n\nWe took a representative sample of a slow query that previously took 27.6s:\n\n```\nquery: EXPLAIN ANALYZE select count(id) from (select id from files where contents ~ 'd97f1d3ff91543[e-f]49.8b07517548877' limit 1000) as e;\nTiming is on.\n                                                                     QUERY PLAN                                                                      \n-----------------------------------------------------------------------------------------------------------------------------------------------------\n Aggregate  (cost=1209.80..1209.81 rows=1 width=8) (actual time=27670.917..27670.972 rows=1 loops=1)\n   -\u003e  Limit  (cost=132.84..1197.80 rows=960 width=8) (actual time=27670.904..27670.948 rows=0 loops=1)\n         -\u003e  Bitmap Heap Scan on files  (cost=132.84..1197.80 rows=960 width=8) (actual time=27670.894..27670.907 rows=0 loops=1)\n               Recheck Cond: (contents ~ 'd97f1d3ff91543[e-f]49.8b07517548877'::text)\n               Rows Removed by Index Recheck: 9838\n               Heap Blocks: exact=5379\n               -\u003e  Bitmap Index Scan on files_contents_trgm_idx  (cost=0.00..132.60 rows=960 width=0) (actual time=38.235..38.239 rows=9838 loops=1)\n                     Index Cond: (contents ~ 'd97f1d3ff91543[e-f]49.8b07517548877'::text)\n Planning Time: 36.870 ms\n Execution Time: 27671.854 ms\n(10 rows)\n```\n\nWe then wrote a small program which starts N workers and in parallel executes a query against all 200 tables (ordered), until it finds the specified limit of results - finding the previously 27.6s query now takes only 7.1s to find no results:\n\n```\n$ DATABASE=postgres://postgres@127.0.0.1:5432/postgres PARALLEL=8 go run ./cmd/tablesplitgen/main.go query 'd97f1d3ff91543[e-f]49.8b07517548877' 1000 200\n...\n0 results in 7140ms\n```\n\nWe found that higher numbers for `PARALLEL` generally improved query time, and so we raised postgres.conf `max_connections` to 128 to allow for higher number of parallel testing. Some brief testing showed that around 32 parallel connections, we no longer saw performance gains and the query executed in 5738ms.\n\n### Take 2\n\nWe also tried another query which was previously quite slow, taking 27.3s:\n\n```\nquery: EXPLAIN ANALYZE select count(id) from (select id from files where contents ~ 'ac8ac5d63b66b83b90ce41a2d4061635' limit 10) as e;\nTiming is on.\n                                                                      QUERY PLAN                                                                      \n------------------------------------------------------------------------------------------------------------------------------------------------------\n Aggregate  (cost=144.06..144.07 rows=1 width=8) (actual time=27379.079..27379.110 rows=1 loops=1)\n   -\u003e  Limit  (cost=132.84..143.94 rows=10 width=8) (actual time=27379.067..27379.087 rows=0 loops=1)\n         -\u003e  Bitmap Heap Scan on files  (cost=132.84..1197.80 rows=960 width=8) (actual time=27379.038..27379.050 rows=0 loops=1)\n               Recheck Cond: (contents ~ 'ac8ac5d63b66b83b90ce41a2d4061635'::text)\n               Rows Removed by Index Recheck: 10166\n               Heap Blocks: exact=5247\n               -\u003e  Bitmap Index Scan on files_contents_trgm_idx  (cost=0.00..132.60 rows=960 width=0) (actual time=41.966..41.970 rows=10166 loops=1)\n                     Index Cond: (contents ~ 'ac8ac5d63b66b83b90ce41a2d4061635'::text)\n Planning Time: 33.703 ms\n Execution Time: 27379.786 ms\n(10 rows)\n```\n\nWith table splitting, it takes just 7s now:\n\n```\n$ DATABASE=postgres://postgres@127.0.0.1:5432/postgres PARALLEL=32 go run ./cmd/tablesplitgen/main.go query 'ac8ac5d63b66b83b90ce41a2d4061635' 10 200\n...\n0 results in 6915ms\n```\n\n### Take 3\n\nWe also tried on a query that was relatively fast before, only 500ms:\n\n```\nquery: EXPLAIN ANALYZE select count(id) from (select id from files where contents ~ 'fmt\\.Print.*' limit 10) as e;\nTiming is on.\n                                                                        QUERY PLAN                                                                         \n-----------------------------------------------------------------------------------------------------------------------------------------------------------\n Aggregate  (cost=215.15..215.16 rows=1 width=8) (actual time=557.535..557.625 rows=1 loops=1)\n   -\u003e  Limit  (cost=204.15..215.02 rows=10 width=8) (actual time=278.488..557.549 rows=10 loops=1)\n         -\u003e  Bitmap Heap Scan on files  (cost=204.15..21133.72 rows=19245 width=8) (actual time=278.479..557.364 rows=10 loops=1)\n               Recheck Cond: (contents ~ 'fmt\\.Print.*'::text)\n               Rows Removed by Index Recheck: 345\n               Heap Blocks: exact=230\n               -\u003e  Bitmap Index Scan on files_contents_trgm_idx  (cost=0.00..199.33 rows=19245 width=0) (actual time=229.300..229.338 rows=228531 loops=1)\n                     Index Cond: (contents ~ 'fmt\\.Print.*'::text)\n Planning Time: 31.950 ms\n Execution Time: 560.100 ms\n(10 rows)\n```\n\nIt now executes in 177-255ms:\n\n```\n$ DATABASE=postgres://postgres@127.0.0.1:5432/postgres PARALLEL=32 go run ./cmd/tablesplitgen/main.go query 'fmt\\.Print.*' 10 200\n...\n0 results in 191ms\n```\n\n### More exhaustive split-table benchmarking\n\nWe clearly needed more representative data, so we did a full corpus query benchmark using:\n\n```sh\nOUT=docker_stats_logs/query-run-split-index-1.log ./capture-docker-stats.sh\n```\n\n```sql\nALTER DATABASE postgres SET statement_timeout = '3600s';\n```\n\n```\n./query-split-corpus.sh \u0026\u003e query_logs/query-run-split-index-1.log\n```\n\n#### Indexing perf\n\nWe saw many benefits of indexing multiple smaller tables:\n\n* When we encountered and OOM or ran out of disk space, all indexing progress up to that point was not lost.\n* Queries could be (although we didn't) executed at the same time as indexing (on tables that have been indexed.)\n* \n\n#### What queries were ran?\n\nTotal number of queries ran was 350 (to save on time, we only ran a smaller number of searches - 10 per unique search - compared to our prior 20k. We found this still gave a generally reliable sample of performance):\n\n```\n$ cat ./query_logs/query-run-split-index-1.log | go run ./cmd/visualize-query-log/main.go | jq -c '.[] | {Timeout: .Timeout, Limit: .Limit, Results: .Rows, Query: .Query}' | wc -l\n     350\n```\n\nExact numbers of each query type ran:\n\n```\n$ cat ./query_logs/query-run-split-index-1.log | go run ./cmd/visualize-query-log/main.go | jq -c '.[] | {Timeout: .Timeout, Limit: .Limit, Results: .Rows, Query: .Query}' | sort | uniq -c | sort\n   1 {\"Timeout\":false,\"Limit\":100,\"Results\":110,\"Query\":\"fmt\\\\.Println\"}\n   1 {\"Timeout\":false,\"Limit\":1000,\"Results\":1011,\"Query\":\"fmt\\\\.Println\"}\n   1 {\"Timeout\":false,\"Limit\":1000,\"Results\":1031,\"Query\":\"fmt\\\\.Print.*\"}\n   1 {\"Timeout\":false,\"Limit\":1000,\"Results\":1031,\"Query\":\"fmt\\\\.Println\"}\n   1 {\"Timeout\":false,\"Limit\":1000,\"Results\":1035,\"Query\":\"fmt\\\\.Print.*\"}\n   1 {\"Timeout\":false,\"Limit\":1000,\"Results\":1573,\"Query\":\"123456789\"}\n   1 {\"Timeout\":false,\"Limit\":1000,\"Results\":1640,\"Query\":\"fmt\\\\.Println\"}\n   1 {\"Timeout\":false,\"Limit\":1000,\"Results\":1784,\"Query\":\"fmt\\\\.Println\"}\n   1 {\"Timeout\":false,\"Limit\":1000,\"Results\":1995,\"Query\":\"bytes.Buffer\"}\n   2 {\"Timeout\":false,\"Limit\":1000,\"Results\":1004,\"Query\":\"fmt\\\\.Error\"}\n   3 {\"Timeout\":false,\"Limit\":1000,\"Results\":1036,\"Query\":\"fmt\\\\.Print.*\"}\n   4 {\"Timeout\":false,\"Limit\":1000,\"Results\":1052,\"Query\":\"123456789\"}\n   5 {\"Timeout\":false,\"Limit\":1000,\"Results\":1000,\"Query\":\"123456789\"}\n   5 {\"Timeout\":false,\"Limit\":1000,\"Results\":1029,\"Query\":\"fmt\\\\.Print.*\"}\n   6 {\"Timeout\":false,\"Limit\":1000,\"Results\":1000,\"Query\":\"fmt\\\\.Println\"}\n   8 {\"Timeout\":false,\"Limit\":1000,\"Results\":1000,\"Query\":\"fmt\\\\.Error\"}\n   9 {\"Timeout\":false,\"Limit\":100,\"Results\":100,\"Query\":\"fmt\\\\.Println\"}\n   9 {\"Timeout\":false,\"Limit\":1000,\"Results\":1002,\"Query\":\"bytes.Buffer\"}\n  10 {\"Timeout\":false,\"Limit\":-1,\"Results\":127895,\"Query\":\"fmt\\\\.Error\"}\n  10 {\"Timeout\":false,\"Limit\":-1,\"Results\":22876,\"Query\":\"fmt\\\\.Println\"}\n  10 {\"Timeout\":false,\"Limit\":-1,\"Results\":37319,\"Query\":\"fmt\\\\.Print.*\"}\n  10 {\"Timeout\":false,\"Limit\":10,\"Results\":0,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635\"}\n  10 {\"Timeout\":false,\"Limit\":10,\"Results\":0,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877\"}\n  10 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"123456789\"}\n  10 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"bytes.Buffer\"}\n  10 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Error\"}\n  10 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Print.*\"}\n  10 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"fmt\\\\.Println\"}\n  10 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"var\"}\n  10 {\"Timeout\":false,\"Limit\":100,\"Results\":0,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635\"}\n  10 {\"Timeout\":false,\"Limit\":100,\"Results\":0,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877\"}\n  10 {\"Timeout\":false,\"Limit\":100,\"Results\":100,\"Query\":\"123456789\"}\n  10 {\"Timeout\":false,\"Limit\":100,\"Results\":100,\"Query\":\"bytes.Buffer\"}\n  10 {\"Timeout\":false,\"Limit\":100,\"Results\":100,\"Query\":\"fmt\\\\.Error\"}\n  10 {\"Timeout\":false,\"Limit\":100,\"Results\":100,\"Query\":\"fmt\\\\.Print.*\"}\n  10 {\"Timeout\":false,\"Limit\":100,\"Results\":100,\"Query\":\"var\"}\n  10 {\"Timeout\":false,\"Limit\":1000,\"Results\":0,\"Query\":\"ac8ac5d63b66b83b90ce41a2d4061635\"}\n  10 {\"Timeout\":false,\"Limit\":1000,\"Results\":0,\"Query\":\"d97f1d3ff91543[e-f]49.8b07517548877\"}\n  10 {\"Timeout\":false,\"Limit\":1000,\"Results\":1000,\"Query\":\"var\"}\n  20 {\"Timeout\":false,\"Limit\":-1,\"Results\":1479452,\"Query\":\"error\"}\n  20 {\"Timeout\":false,\"Limit\":10,\"Results\":10,\"Query\":\"error\"}\n  20 {\"Timeout\":false,\"Limit\":100,\"Results\":100,\"Query\":\"error\"}\n  20 {\"Timeout\":false,\"Limit\":1000,\"Results\":1000,\"Query\":\"error\"}\n```\n\n#### How many timed out?\n\n```\n$ cat ./query_logs/query-run-split-index-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select (.Timeout == true)' | jq -c -s '.[]' | wc -l\n       0\n```\n\n(But we should include the last single query which bricked our MacOS installation mentioned previously..)\n\n#### CPU/memory usage\n\nCPU usage (150% == one and a half virtual cores) during query execution as visualized by:\n\n```sh\n$ cat ./docker_stats_logs/query-run-split-index-1.log | go run ./cmd/visualize-docker-json-stats/main.go --trim-end=7700 | jq | jp -y '..CPUPerc'\n```\n\n\u003cimg width=\"1144\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/108114005-79078880-7055-11eb-9f55-bc4ca65c4808.png\"\u003e\n\nOf note here is that:\n\n2. We see an average of around ~1600% CPU used - this indicates we are actually using 100% of the CPU (the i9 has 8 physical cores, 16 virtual.) Compare this to the fact that our prior test typically only used 1 virtual CPU core.\n1. Due to the way Docker projects CPU usage based on delta measurements, during great spikes of CPU utilization the recorded measurement can exceed 1600% (\"16 virtual cores 100% utilized\").\n\nMemory usage in MiB during query execution as visualized by:\n\n```sh\n$ cat ./docker_stats_logs/query-run-split-index-1.log | go run ./cmd/visualize-docker-json-stats/main.go --trim-end=7700 | jq | jp -y '..MemUsageMiB'\n```\n\n\u003cimg width=\"1143\" alt=\"image\" src=\"https://user-images.githubusercontent.com/3173176/108115193-04354e00-7057-11eb-9782-8d3125c122e1.png\"\u003e\n\nOf note here is that we generally use higher amounts of memory than before, around ~380 MiB on average.\n\n#### Other measurements\n\nWe can determine how many queries executed in under a time bucket using e.g.:\n\n```\n$ cat ./query_logs/query-run-split-index-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select (.ExecutionTimeMs \u003c 25000)' | jq -c -s '.[]' | wc -l\n```\n\nNote that we did not record planning time or index rechecks for these queries.\n\n# Native Postgres tests\n\n## Postgres setup \u0026 configuration\n\nWe installed Postgres natively on the Mac to test for differences between native FS and `osxfs` (Docker driver) differences. We could have tested with Docker volumes (instead of bind mounts) to achieve the same goal, but felt \"Postgres native on Mac\" would give a more direct answer to \"Does Docker introduce overheads?\"\n\nWe installed it through brew:\n\n```\nbrew install postgresql@13\n```\n\nTo ensure the test is otherwise identical to what we ran in Docker, we:\n\n* Confirmed that we are using the same `en_US.utf8` locale (Docker) is in use on Mac (`en_US.UTF-8`)\n\n* Confirmed that the Homebrew default of `initdb --locale=C -E UTF-8 /usr/local/var/postgres` matches the Docker image default locale.\n* Modified `/usr/local/var/postgres/postgresql.conf` to use the same exact [postgres.conf](postgres.conf) file as in our final split-table Docker test:\n\n```\nlisten_addresses = '*'\n\t\t\t\t\t# comma-separated list of addresses;\n\t\t\t\t\t# defaults to 'localhost'; use '*' for all\n\t\t\t\t\t# (change requires restart)\n\n# DB Version: 13\n# OS Type: linux\n# DB Type: dw\n# Total Memory (RAM): 10 GB\n# CPUs num: 8\n# Connections num: 20\n# Data Storage: ssd\n\nmax_connections = 128\nshared_buffers = 2560MB\neffective_cache_size = 7680MB\nmaintenance_work_mem = 1280MB\ncheckpoint_completion_target = 0.9\nwal_buffers = 16MB\ndefault_statistics_target = 500\nrandom_page_cost = 1.1\neffective_io_concurrency = 200\nwork_mem = 8GB\nmin_wal_size = 4GB\nmax_wal_size = 16GB\nmax_worker_processes = 8\nmax_parallel_workers_per_gather = 4\nmax_parallel_workers = 8\nmax_parallel_maintenance_workers = 4\n```\n\nLaunched Postgres using:\n\n```\npg_ctl -D /usr/local/var/postgres start\n```\n\nWe then modified the configuration to set:\n\n```\neffective_io_concurrency = 0\n```\n\nAs Postgres clearly stated:\n\n```\n[47671] DETAIL:  effective_io_concurrency must be set to 0 on platforms that lack posix_fadvise().\n[47671] FATAL:  configuration file \"/usr/local/var/postgres/postgresql.conf\" contains errors\n stopped waiting\npg_ctl: could not start server\nExamine the log output.\n```\n\nWe then created the postgres user:\n\n```\n/usr/local/opt/postgres/bin/createuser -s postgres\n```\n\n## DB schema creation\n\nCreate the DB schema in a `psql -U postgres` prompt:\n\n```sql\nBEGIN;\nCREATE EXTENSION IF NOT EXISTS pg_trgm;\nCREATE TABLE IF NOT EXISTS files (\n    id bigserial PRIMARY KEY,\n    contents text NOT NULL,\n    filepath text NOT NULL\n);\nCOMMIT;\n```\n\n## Inserting the corpus\n\nAnd began to insert all of the 20k repos of files into the DB (no index yet):\n\n```\ngo install ./cmd/corpusindex; DATABASE=postgres://postgres@127.0.0.1:5432/postgres time ./index-corpus.sh \u0026\u003e index_logs/native_postgres\n```\n\nJust as before, we count the number of files we've inserted into the DB and then drop the second half:\n\n```\npostgres=# select count(*) from files;\n  count   \n----------\n 19451299\n(1 row)\n```\n\n```\nDELETE FROM files WHERE id \u003e 9720910;\n```\n\nAnd we confirm the data size is effectively identical to before (off by a few files worth of data, since repo ordering within groups of ~10 are not guaranteed due to inserting in parallel):\n\n```\npostgres=# select count(filepath) from files;\n  count  \n---------\n 9720910\n(1 row)\n\npostgres=# select SUM(octet_length(contents)) from files;\n     sum     \n-------------\n 88201180685\n(1 row)\n```\n\n## Splitting the table\n\nWe again generate the queries to create the split 200 tables each with 50,000 rows max:\n\n```sql\n$ go run ./cmd/tablesplitgen/main.go create\nCREATE TABLE files_000 AS SELECT * FROM files WHERE id \u003e 0 AND id \u003c 50000;\nCREATE TABLE files_001 AS SELECT * FROM files WHERE id \u003e 50000 AND id \u003c 100000;\nCREATE TABLE files_002 AS SELECT * FROM files WHERE id \u003e 100000 AND id \u003c 150000;\nCREATE TABLE files_003 AS SELECT * FROM files WHERE id \u003e 150000 AND id \u003c 200000;\nCREATE TABLE files_004 AS SELECT * FROM files WHERE id \u003e 200000 AND id \u003c 250000;\n...\n```\n\nAnd ran those in a `psql` prompt.\n\nCreation of these tables was much faster than in Docker, around 2-8s each instead of 20-40s previously - taking only 15m total instead of 2h total before.\n\n## Creating the Trigram index\n\nAnd again used the same program to generate and run the SQL statements for indexing, e.g.:\n\n```sql\nCREATE INDEX IF NOT EXISTS files_000_contents_trgm_idx ON files USING GIN (contents gin_trgm_ops);\n```\n\nIn parallel (8 at a time), for all 200 tables:\n\n```sh\nPARALLEL=8 NATIVE=true go run ./cmd/tablesplitgen/main.go index \u0026\u003e ./index_logs/native-index-1.log\n```\n\nWe recorded native CPU/memory/etc stats from `top` during this using:\n\n```sh\nOUT=native_stats_logs/index-1.log ./capture-native-stats.sh\n```\n\nIndexing was also much faster than in Docker, taking only 23m compared to ~3h with Docker.\n\n## Query benchmarking\n\nWe capture native CPU/memory/etc numbers using:\n\n```sh\nOUT=native_stats_logs/query-run-split-index-1.log ./capture-native-stats.sh\n```\n\nAnd use the same exact tests as before:\n\n```sql\nALTER DATABASE postgres SET statement_timeout = '3600s';\n```\n\n```\n./query-split-corpus.sh \u0026\u003e query_logs/query-run-split-index-1.log\n```\n\nIn total this completed in 14m28.178s, and just as in our earlier quick test it ran 350 queries:\n\n```\n$ cat ./query_logs/query-run-split-index-native-1.log | go run ./cmd/visualize-query-log/main.go | jq -c '.[] | {Timeout: .Timeout, Limit: .Limit, Results: .Rows, Query: .Query}' | wc -l\n     350\n```\n\nUsing the following _before_ query:\n\n```\n$ cat ./query_logs/query-run-split-index-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select (.ExecutionTimeMs \u003c 25000)' | jq -c -s '.[]' | wc -l\n```\n\nAnd the following _after_ query:\n\n```\n$ cat ./query_logs/query-run-split-index-native-1.log | go run ./cmd/visualize-query-log/main.go | jq '.[] | select (.ExecutionTimeMs \u003c 25000)' | jq -c -s '.[]' | wc -l\n```\n\nWe can determine the following substantial improvements to query performance (negative change is good):\n\n| Change | Time bucket | Queries under bucket **before** | Queries under bucket **after** |\n|--------|-------------|---------------------------------|--------------------------------|\n| 0%     | 500s        | 350 of 350                      | 350 of 350                     |\n| -12%   | 100s        | 309 of 350                      | 350 of 350                     |\n| -12%   | 50s         | 309 of 350                      | 350 of 350                     |\n| -12%   | 40s         | 308 of 350                      | 350 of 350                     |\n| -12%   | 30s         | 308 of 350                      | 349 of 350                     |\n| -7%    | 25s         | 307 of 350                      | 330 of 350                     |\n| -7%    | 25s         | 307 of 350                      | 330 of 350                     |\n| -8%    | 20s         | 302 of 350                      | 330 of 350                     |\n| -8%    | 20s         | 302 of 350                      | 330 of 350                     |\n| -5%    | 10s         | 297 of 350                      | 311 of 350                     |\n| -26%   | 5s          | 237 of 350                      | 319 of 350                     |\n| -7%    | 2500ms      | 224 of 350                      | 240 of 350                     |\n| -9%    | 2000ms      | 219 of 350                      | 240 of 350                     |\n| -9%    | 1500ms      | 219 of 350                      | 240 of 350                     |\n| -16%   | 1000ms      | 200 of 350                      | 237 of 350                     |\n| -14%   | 750ms       | 190 of 350                      | 221 of 350                     |\n| -23%   | 500ms       | 170 of 350                      | 220 of 350                     |\n| -59%   | 250ms       | 88 of 350                       | 217 of 350                     |\n| -99%   | 100ms       | 1 of 350                        | 168 of 350                     |\n| -99%   | 50ms        | 1 of 350                        | 168 of 350                     |\n\nThe substantial things to note here are:\n\n1. Queries that were previously very slow noticed a ~12% improvement. This is likely due to IO operations needed when interfacing with the 200 separate tables.\n2. Queries that were previously in the middle-ground noticed meager ~5% improvements.\n3. Queries that were previously fairly fast (likely searching only over a few tables before returning) noticed substantial 16-99% improvements.\n\nCPU/memory usage was comparable to our earlier test, consuming all CPU cores and roughly the same amount of memory, and we did record extensive samples of it throughout querying in `native_stats_logs/query-run-split-index-1.log` - but we did not write tooling to visualize it.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhexops-graveyard%2Fpgtrgm_emperical_measurements","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhexops-graveyard%2Fpgtrgm_emperical_measurements","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhexops-graveyard%2Fpgtrgm_emperical_measurements/lists"}