{"id":34497342,"url":"https://github.com/timescale/pg_textsearch","last_synced_at":"2026-04-23T01:00:41.759Z","repository":{"id":329228937,"uuid":"1014938055","full_name":"timescale/pg_textsearch","owner":"timescale","description":"PostgreSQL extension for BM25 relevance-ranked full-text search. Postgres OSS licensed.","archived":false,"fork":false,"pushed_at":"2026-04-16T07:52:54.000Z","size":25961,"stargazers_count":3656,"open_issues_count":17,"forks_count":96,"subscribers_count":10,"default_branch":"main","last_synced_at":"2026-04-16T09:07:33.586Z","etag":null,"topics":["bm25","c-extension","full-text-search","postgresql"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"postgresql","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/timescale.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-07-06T17:45:50.000Z","updated_at":"2026-04-16T04:11:08.000Z","dependencies_parsed_at":"2026-01-07T21:10:34.445Z","dependency_job_id":null,"html_url":"https://github.com/timescale/pg_textsearch","commit_stats":null,"previous_names":["timescale/pg_textsearch"],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/timescale/pg_textsearch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpg_textsearch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpg_textsearch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpg_textsearch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpg_textsearch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/timescale","download_url":"https://codeload.github.com/timescale/pg_textsearch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timescale%2Fpg_textsearch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32161325,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-22T17:06:48.269Z","status":"ssl_error","status_checked_at":"2026-04-22T17:06:19.037Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bm25","c-extension","full-text-search","postgresql"],"created_at":"2025-12-24T01:00:50.266Z","updated_at":"2026-04-23T01:00:41.708Z","avatar_url":"https://github.com/timescale.png","language":"C","funding_links":[],"categories":["C","\u003ca name=\"C\"\u003e\u003c/a\u003eC"],"sub_categories":[],"readme":"# pg_textsearch\n\n[![CI](https://github.com/timescale/pg_textsearch/actions/workflows/ci.yml/badge.svg)](https://github.com/timescale/pg_textsearch/actions/workflows/ci.yml)\n[![Benchmarks](https://github.com/timescale/pg_textsearch/actions/workflows/benchmark.yml/badge.svg)](https://timescale.github.io/pg_textsearch/benchmarks/)\n[![Coverity Scan](https://scan.coverity.com/projects/32822/badge.svg)](https://scan.coverity.com/projects/pg_textsearch)\n\nModern ranked text search for Postgres.\n\n- Simple syntax: `ORDER BY content \u003c@\u003e 'search terms'`\n- BM25 ranking with configurable parameters (k1, b)\n- Works with Postgres text search configurations (english, french, german, etc.)\n- Expression indexes for JSONB fields, multi-column search, and text transformations\n- Partial indexes for scoped search and multilingual tables\n- Fast top-k queries via Block-Max WAND optimization\n- Parallel index builds for large tables\n- Supports partitioned tables\n- Best in class performance and scalability\n\n🚀 **Status**: v1.1.0 - Production ready.\n\n![Tapir and Friends](images/tapir_and_friends_v1.1.0.png)\n\n## Historical note\n\nThe original name of the project was Tapir - **T**extual **A**nalysis for **P**ostgres **I**nformation **R**etrieval.  We still use the tapir as our\nmascot and the name occurs in various places in the source code.\n\n## PostgreSQL Version Compatibility\n\npg_textsearch supports PostgreSQL 17 and 18.\n\n## Installation\n\n### Pre-built Binaries\n\nDownload pre-built binaries from the\n[Releases page](https://github.com/timescale/pg_textsearch/releases).\nAvailable for Linux and macOS (amd64 and arm64), PostgreSQL 17 and 18.\n\n### Build from Source\n\n```sh\ncd /tmp\ngit clone https://github.com/timescale/pg_textsearch\ncd pg_textsearch\nmake\nmake install # may need sudo\n```\n\n## Getting Started\n\npg_textsearch must be loaded via `shared_preload_libraries`. Add the following\nto `postgresql.conf` and restart the server:\n\n```\nshared_preload_libraries = 'pg_textsearch'  # add to existing list if needed\n```\n\nThen enable the extension (once per database):\n\n```sql\nCREATE EXTENSION pg_textsearch;\n```\n\nCreate a table with text content\n\n```sql\nCREATE TABLE documents (id bigserial PRIMARY KEY, content text);\nINSERT INTO documents (content) VALUES\n    ('PostgreSQL is a powerful database system'),\n    ('BM25 is an effective ranking function'),\n    ('Full text search with custom scoring');\n```\n\nCreate a pg_textsearch index on the text column\n\n```sql\nCREATE INDEX docs_idx ON documents USING bm25(content) WITH (text_config='english');\n```\n\n## Querying\n\nGet the most relevant documents using the `\u003c@\u003e` operator\n\n```sql\nSELECT * FROM documents\nORDER BY content \u003c@\u003e 'database system'\nLIMIT 5;\n```\n\nNote: `\u003c@\u003e` returns the negative BM25 score since Postgres only supports `ASC` order index scans on operators. Lower scores indicate better matches.\n\nThe index is automatically detected from the column. For explicit index specification:\n```sql\nSELECT * FROM documents\nORDER BY content \u003c@\u003e to_bm25query('database system', 'docs_idx')\nLIMIT 5;\n```\n\nSupported operations:\n- `text \u003c@\u003e 'query'` - Score text against a query (index auto-detected)\n- `text \u003c@\u003e bm25query` - Score text with explicit index specification\n\n### Verifying Index Usage\n\nCheck query plan with EXPLAIN:\n```sql\nEXPLAIN SELECT * FROM documents\nORDER BY content \u003c@\u003e 'database system'\nLIMIT 5;\n```\n\nFor small datasets, PostgreSQL may prefer sequential scans. Force index usage:\n```sql\nSET enable_seqscan = off;\n```\n\nNote: Even if EXPLAIN shows a sequential scan, `\u003c@\u003e` and `to_bm25query` always use the index for corpus statistics (document counts, average length) required for BM25 scoring.\n\n### Filtering with WHERE Clauses\n\nThere are two ways filtering interacts with BM25 index scans:\n\n**Pre-filtering** uses a separate index (B-tree, etc.) to reduce rows before scoring:\n```sql\n-- Create index on filter column\nCREATE INDEX ON documents (category_id);\n\n-- Query filters first, then scores matching rows\nSELECT * FROM documents\nWHERE category_id = 123\nORDER BY content \u003c@\u003e 'search terms'\nLIMIT 10;\n```\n\n**Post-filtering** applies the BM25 index scan first, then filters\nresults. Columns without their own index are filtered after the BM25\nscan:\n```sql\nSELECT * FROM documents\nWHERE length(content) \u003e 100\nORDER BY content \u003c@\u003e 'search terms'\nLIMIT 10;\n```\n\n**Performance considerations**:\n\n- **Pre-filtering tradeoff**: If the filter matches many rows (e.g., 100K+), scoring\n  all of them can be expensive. The BM25 index is most efficient when it can use\n  top-k optimization (ORDER BY + LIMIT) to avoid scoring every matching document.\n\n- **Post-filtering tradeoff**: The index returns top-k results *before* filtering.\n  If your WHERE clause eliminates most results, you may get fewer rows than\n  requested. Increase LIMIT to compensate, then re-limit in application code.\n\n- **Best case**: Pre-filter with a selective condition (matches \u003c10% of rows), then\n  let BM25 score the reduced set with ORDER BY + LIMIT.\n\nThis is similar to the [filtering behavior in pgvector](https://github.com/pgvector/pgvector?tab=readme-ov-file#filtering),\nwhere approximate indexes also apply filtering after the index scan.\n\n## Indexing\n\nCreate a BM25 index on your text columns:\n\n```sql\nCREATE INDEX ON documents USING bm25(content) WITH (text_config='english');\n```\n\n### Index Options\n\n- `text_config` - PostgreSQL text search configuration to use (required)\n- `k1` - term frequency saturation parameter (1.2 by default)\n- `b` - length normalization parameter (0.75 by default)\n\n```sql\nCREATE INDEX ON documents USING bm25(content) WITH (text_config='english', k1=1.5, b=0.8);\n```\n\nAlso supports different text search configurations:\n\n```sql\n-- English documents with stemming\nCREATE INDEX docs_en_idx ON documents USING bm25(content) WITH (text_config='english');\n\n-- Simple text processing without stemming\nCREATE INDEX docs_simple_idx ON documents USING bm25(content) WITH (text_config='simple');\n\n-- Language-specific configurations\nCREATE INDEX docs_fr_idx ON french_docs USING bm25(content) WITH (text_config='french');\nCREATE INDEX docs_de_idx ON german_docs USING bm25(content) WITH (text_config='german');\n```\n\n### Expression Indexes\n\nIndex expressions instead of plain columns — useful for JSONB fields,\nmulti-column concatenation, and text transformations:\n\n```sql\n-- JSONB field extraction\nCREATE INDEX ON events USING bm25 ((data-\u003e\u003e'description'))\n    WITH (text_config='english');\n\nSELECT * FROM events\nORDER BY (data-\u003e\u003e'description') \u003c@\u003e to_bm25query('network error', 'events_expr_idx')\nLIMIT 10;\n\n-- Multi-column search\nCREATE INDEX ON articles USING bm25 ((coalesce(title, '') || ' ' || coalesce(body, '')))\n    WITH (text_config='english');\n\n-- Text transformation\nCREATE INDEX ON docs USING bm25 ((lower(content)))\n    WITH (text_config='simple');\n```\n\nThe expression must evaluate to `text` and use only IMMUTABLE functions.\nQueries must repeat the same expression in the `ORDER BY` clause.\n\n### Partial Indexes\n\nIndex a subset of rows by adding a `WHERE` clause. Partial indexes are\nsmaller and faster when queries always target a specific subset:\n\n```sql\nCREATE INDEX ON docs USING bm25 (content)\n    WITH (text_config='english')\n    WHERE status = 'published';\n\nSELECT * FROM docs\nWHERE status = 'published'\nORDER BY content \u003c@\u003e to_bm25query('search terms', 'docs_content_idx')\nLIMIT 10;\n```\n\nPartial indexes require explicit index naming via `to_bm25query()` — the\nimplicit `text \u003c@\u003e 'query'` syntax skips them.\n\nExpression and partial indexes can be combined:\n\n```sql\nCREATE INDEX ON events USING bm25 ((data-\u003e\u003e'message'))\n    WITH (text_config='english')\n    WHERE (data-\u003e\u003e'severity') = 'error';\n```\n\n### Multilingual Tables\n\nFor tables with documents in multiple languages, create one partial index\nper language, each with the appropriate text search configuration:\n\n```sql\nALTER TABLE docs ADD COLUMN lang CHAR(2) NOT NULL DEFAULT 'en';\n\nCREATE INDEX docs_en_idx ON docs USING bm25 (content)\n    WITH (text_config='english') WHERE lang = 'en';\nCREATE INDEX docs_de_idx ON docs USING bm25 (content)\n    WITH (text_config='german')  WHERE lang = 'de';\nCREATE INDEX docs_fr_idx ON docs USING bm25 (content)\n    WITH (text_config='french')  WHERE lang = 'fr';\n```\n\nEach index applies language-appropriate stemming and stop words. Query\nwith the matching predicate and index name:\n\n```sql\nSELECT * FROM docs\nWHERE lang = 'en'\nORDER BY content \u003c@\u003e to_bm25query('databases', 'docs_en_idx')\nLIMIT 10;\n```\n\n## Data Types\n\n### bm25query\n\nThe `bm25query` type represents queries for BM25 scoring with optional index context:\n\n```sql\n-- Create a bm25query with index name (required for WHERE clause and standalone scoring)\nSELECT to_bm25query('search query text', 'docs_idx');\n-- Returns: docs_idx:search query text\n\n-- Embedded index name syntax (alternative form using cast)\nSELECT 'docs_idx:search query text'::bm25query;\n-- Returns: docs_idx:search query text\n\n-- Create a bm25query without index name (only works in ORDER BY with index scan)\nSELECT to_bm25query('search query text');\n-- Returns: search query text\n```\n\n**Note**: In PostgreSQL 18, the embedded index name syntax using single colon (`:`) allows the\nquery planner to determine the index name even when evaluating SELECT clause expressions early.\nThis ensures compatibility across different query evaluation strategies.\n\n#### bm25query Functions\n\nFunction | Description\n--- | ---\nto_bm25query(text) → bm25query | Create bm25query without index name (for ORDER BY only)\nto_bm25query(text, text) → bm25query | Create bm25query with query text and index name\ntext \u003c@\u003e bm25query → double precision | BM25 scoring operator (returns negative scores)\nbm25query = bm25query → boolean | Equality comparison\n\n## Performance\n\npg_textsearch indexes use a memtable architecture for efficient writes. Like other index types, it's faster to create an index after loading your data.\n\n```sql\n-- Load data first\nINSERT INTO documents (content) VALUES (...);\n\n-- Then create index\nCREATE INDEX docs_idx ON documents USING bm25(content) WITH (text_config='english');\n```\n\n### Parallel Index Builds\n\npg_textsearch supports parallel index builds for faster indexing of large tables.\nPostgres automatically uses parallel workers based on table size and configuration.\n\n```sql\n-- Configure parallel workers (optional, uses server defaults otherwise)\nSET max_parallel_maintenance_workers = 4;\nSET maintenance_work_mem = '256MB';  -- At least 64MB required for parallel builds\n\n-- Create index (parallel workers used automatically for large tables)\nCREATE INDEX docs_idx ON documents USING bm25(content) WITH (text_config='english');\n```\n\n**Note:** The planner requires `maintenance_work_mem \u003e= 64MB` to enable parallel index\nbuilds. With insufficient memory, builds fall back to serial mode silently.\n\nYou'll see a notice when parallel build is used:\n```\nNOTICE:  parallel index build: launched 4 of 4 requested workers\n```\n\nFor partitioned tables, each partition builds its index independently with parallel\nworkers if the partition is large enough. This allows efficient indexing of very\nlarge partitioned datasets.\n\n### Performance Tuning\n\n#### Force-merging segments\n\nThe index stores data in multiple segments across levels (similar to an LSM\ntree). After bulk loads or sustained incremental inserts, multiple segments\nmay accumulate; consolidating them into one improves query speed by reducing\nthe number of segments scanned:\n\n```sql\nSELECT bm25_force_merge('docs_idx');\n```\n\nThis is analogous to Lucene's `forceMerge(1)`. It rewrites all segments into\na single segment and reclaims the freed pages. Best used after large batch\ninserts, not during ongoing write traffic.\n\n#### Use LIMIT with ORDER BY\n\nTop-k queries (`ORDER BY ... LIMIT n`) enable Block-Max WAND optimization,\nwhich skips blocks of postings that cannot contribute to the top results.\nWithout a LIMIT clause, the index falls back to scoring all matching\ndocuments up to `pg_textsearch.default_limit`.\n\n```sql\n-- Fast: BMW skips non-competitive blocks\nSELECT * FROM documents ORDER BY content \u003c@\u003e 'search terms' LIMIT 10;\n\n-- Slower: scores up to default_limit documents\nSELECT * FROM documents ORDER BY content \u003c@\u003e 'search terms';\n```\n\n#### Segment compression\n\nCompression is on by default and generally improves both index size and query\nperformance (fewer pages to read). Disable only if you observe that\ndecompression overhead is a bottleneck for your workload:\n\n```sql\nSET pg_textsearch.compress_segments = off;\n```\n\n#### Postgres settings that affect index builds\n\nSetting | Effect\n--- | ---\n`max_parallel_maintenance_workers` | Number of parallel workers for CREATE INDEX (default 2)\n`maintenance_work_mem` | Memory per worker; must be \u003e= 64MB for parallel builds\n\n#### pg_textsearch GUCs\n\nSetting | Default | Description\n--- | --- | ---\n`pg_textsearch.default_limit` | 1000 | Max documents scored when no LIMIT clause is present\n`pg_textsearch.compress_segments` | on | Compress posting blocks in new segments\n`pg_textsearch.segments_per_level` | 8 | Segments per level before automatic compaction (2-64)\n`pg_textsearch.memory_limit` | 2GB | Cap on shared memory used by memtables (0 = disable)\n`pg_textsearch.bulk_load_threshold` | 100000 | Terms per transaction before auto-spill (0 = disable)\n`pg_textsearch.memtable_spill_threshold` | 32000000 | **Deprecated.** Posting entries before auto-spill (0 = disable)\n\n\u003e **`memtable_spill_threshold` is deprecated.** It was the original\n\u003e mechanism for bounding memtable growth, triggering a spill when an\n\u003e index accumulated a fixed number of posting entries. The newer\n\u003e `memory_limit` GUC replaces it with byte-level estimation that\n\u003e accounts for term overhead and works across indexes. Both checks\n\u003e are evaluated (OR'd), so existing configurations continue to work.\n\u003e New deployments should use `memory_limit` only.\n\n#### Memory management\n\npg_textsearch keeps in-memory inverted indexes (memtables) in Postgres\ndynamic shared memory (DSA). Without a limit, heavy write workloads can\ngrow DSA until the OS OOM killer terminates the server.\n\n`memory_limit` caps this growth. It is the maximum amount of DSA memory\nthe extension will use for memtables. When usage approaches this cap,\nthe extension automatically spills memtables to on-disk segments. If\nusage still exceeds the cap, inserts fail with an ERROR rather than\nrisking an OOM kill.\n\nInternally, three thresholds are derived from `memory_limit`:\n\n| Threshold | Value | What happens |\n| --- | --- | --- |\n| Per-index soft limit | `memory_limit / 8` | Spills that index's memtable to a disk segment |\n| Global soft limit | `memory_limit / 2` | Evicts the largest memtable across all indexes |\n| Hard limit | `memory_limit` | Rejects the insert with an ERROR |\n\nDuring normal operation, the soft limits keep usage well below the hard\ncap. You only need to set `memory_limit`; the internal ratios are tuned\nfor typical workloads.\n\n```sql\n-- Tune for a smaller instance (e.g., 4 GB RAM)\nALTER SYSTEM SET pg_textsearch.memory_limit = '512MB';\nSELECT pg_reload_conf();\n\n-- Tune for a larger instance (e.g., 64 GB RAM)\nALTER SYSTEM SET pg_textsearch.memory_limit = '16GB';\nSELECT pg_reload_conf();\n```\n\nTo check current memory usage:\n\n```sql\nSELECT * FROM bm25_memory_usage();\n```\n\nVACUUM (including autovacuum's insert-threshold path) also spills the\nmemtable when it runs, so the amount of un-spilled state between\n`CREATE INDEX` and the next server restart stays bounded.\n\n**Crash recovery**: The memtable is rebuilt from the heap on startup, so no\ndata is lost if Postgres crashes before spilling to disk.\n\n## Monitoring\n\n```sql\n-- Check index usage\nSELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch\nFROM pg_stat_user_indexes\nWHERE indexrelid::regclass::text ~ 'pg_textsearch';\n```\n\n## Examples\n\n### Basic Search\n\n```sql\nCREATE TABLE articles (id serial PRIMARY KEY, title text, content text);\nCREATE INDEX articles_idx ON articles USING bm25(content) WITH (text_config='english');\n\nINSERT INTO articles (title, content) VALUES\n    ('Database Systems', 'PostgreSQL is a powerful relational database system'),\n    ('Search Technology', 'Full text search enables finding relevant documents quickly'),\n    ('Information Retrieval', 'BM25 is a ranking function used in search engines');\n\n-- Find relevant documents\nSELECT title, content \u003c@\u003e 'database search' as score\nFROM articles\nORDER BY score;\n```\n\nAlso supports different languages and custom parameters:\n\n```sql\n-- Different languages\nCREATE INDEX fr_idx ON french_articles USING bm25(content) WITH (text_config='french');\nCREATE INDEX de_idx ON german_articles USING bm25(content) WITH (text_config='german');\n\n-- Custom parameters\nCREATE INDEX custom_idx ON documents USING bm25(content)\n    WITH (text_config='english', k1=2.0, b=0.9);\n```\n\n\n## Limitations\n\n### No Phrase Queries\n\nThe BM25 index stores term frequencies but not term positions, so it cannot\nnatively evaluate phrase queries like `\"database system\"`. You can emulate\nphrase matching by combining BM25 ranking with a post-filter:\n\n```sql\n-- BM25 ranks candidates; subquery over-fetches to account for\n-- post-filter eliminating non-phrase matches\nSELECT * FROM (\n    SELECT *, content \u003c@\u003e 'database system' AS score\n    FROM documents\n    ORDER BY score\n    LIMIT 100  -- over-fetch\n) sub\nWHERE content ILIKE '%database system%'\nORDER BY score\nLIMIT 10;\n```\n\nBecause the post-filter eliminates some results, the inner LIMIT should\nbe larger than the desired result count.\n\n### No Built-in Faceted Search\n\npg_textsearch does not provide dedicated faceting operators, but standard\nPostgres query machinery handles common faceting patterns:\n\n```sql\n-- Filter by category (assumes a B-tree index on category)\nSELECT * FROM documents\nWHERE category = 'engineering'\nORDER BY content \u003c@\u003e 'search terms'\nLIMIT 10;\n\n-- Compute facet counts over top search results\nSELECT category, count(*)\nFROM (\n    SELECT category FROM documents\n    ORDER BY content \u003c@\u003e 'search terms'\n    LIMIT 100\n) matches\nGROUP BY category;\n```\n\n### Insert/Update Performance\n\nThe memtable architecture is designed to support efficient writes, but\nsustained write-heavy workloads are not yet fully optimized. For initial\ndata loading, creating the index after loading data is faster than\nincremental inserts. This is an active area of development.\n\n### No Background Compaction\n\nSegment compaction currently runs synchronously during memtable spill\noperations. Write-heavy workloads may observe compaction latency during\nspills. Background compaction is planned for a future release.\n\n### Partitioned Tables\n\nBM25 indexes on partitioned tables use **partition-local statistics**. Each\npartition maintains its own:\n- Document count (`total_docs`)\n- Average document length (`avg_doc_len`)\n- Per-term document frequencies for IDF calculation\n\nThis means:\n- Queries targeting a single partition compute accurate BM25 scores using that\n  partition's statistics\n- Queries spanning multiple partitions return scores computed independently per\n  partition, which may not be directly comparable across partitions\n\n**Example**: If partition A has 1000 documents and partition B has 10 documents,\nthe term \"database\" would have different IDF values in each partition. Results\nfrom both partitions would have scores on different scales.\n\n**Recommendations**:\n- For time-partitioned data, query individual partitions when score comparability\n  matters\n- Use partitioning schemes where queries naturally target single partitions\n- Consider this behavior when designing partition strategies for search workloads\n\n```sql\n-- Query single partition (scores are accurate within partition)\nSELECT * FROM docs\nWHERE created_at \u003e= '2024-01-01' AND created_at \u003c '2025-01-01'\nORDER BY content \u003c@\u003e 'search terms'\nLIMIT 10;\n\n-- Cross-partition query (scores computed per-partition)\nSELECT * FROM docs\nORDER BY content \u003c@\u003e 'search terms'\nLIMIT 10;\n```\n\n### Word Length Limit\n\npg_textsearch inherits PostgreSQL's tsvector word length limit of 2047 characters.\nWords exceeding this limit are ignored during tokenization (with an INFO message).\nThis is defined by `MAXSTRLEN` in PostgreSQL's text search implementation.\n\nFor typical natural language text, this limit is never encountered. It may affect\ndocuments containing very long tokens such as base64-encoded data, long URLs, or\nconcatenated identifiers.\n\nThis behavior is similar to other search engines:\n- Elasticsearch: Truncates tokens (configurable via `truncate` filter, default 10 chars)\n- Tantivy: Truncates to 255 bytes by default\n\n### PL/pgSQL and Stored Procedures\n\nThe implicit `text \u003c@\u003e 'query'` syntax relies on planner hooks to automatically\ndetect the BM25 index. These hooks don't run inside PL/pgSQL DO blocks, functions,\nor stored procedures.\n\n**Inside PL/pgSQL**, use explicit index names with `to_bm25query()`:\n\n```sql\n-- This won't work in PL/pgSQL:\n-- SELECT * FROM docs ORDER BY content \u003c@\u003e 'search terms' LIMIT 10;\n\n-- Use explicit index name instead:\nSELECT * FROM docs\nORDER BY content \u003c@\u003e to_bm25query('search terms', 'docs_idx')\nLIMIT 10;\n```\n\nRegular SQL queries (outside PL/pgSQL) support both forms.\n\n## Troubleshooting\n\n```sql\n-- List available text search configurations\nSELECT cfgname FROM pg_ts_config;\n\n-- List BM25 indexes\nSELECT indexname FROM pg_indexes WHERE indexdef LIKE '%USING bm25%';\n```\n\n\n## Installation Notes\n\nIf your machine has multiple Postgres installations, specify the path to `pg_config`:\n\n```sh\nexport PG_CONFIG=/Library/PostgreSQL/18/bin/pg_config  # or 17\nmake clean \u0026\u0026 make \u0026\u0026 make install\n```\n\nIf you get compilation errors, install Postgres development files:\n\n```sh\n# Ubuntu/Debian\nsudo apt install postgresql-server-dev-17  # for PostgreSQL 17\nsudo apt install postgresql-server-dev-18  # for PostgreSQL 18\n```\n\n## Reference\n\n### Index Options\n\nOption | Type | Default | Description\n--- | --- | --- | ---\ntext_config | string | required | PostgreSQL text search configuration to use\nk1 | real | 1.2 | Term frequency saturation parameter (0.1 to 10.0)\nb | real | 0.75 | Length normalization parameter (0.0 to 1.0)\n\n### Text Search Configurations\n\nAvailable configurations depend on your Postgres installation:\n```\n# SELECT cfgname FROM pg_ts_config;\n  cfgname\n------------\n simple\n arabic\n armenian\n basque\n catalan\n danish\n dutch\n english\n finnish\n french\n german\n greek\n hindi\n hungarian\n indonesian\n irish\n italian\n lithuanian\n nepali\n norwegian\n portuguese\n romanian\n russian\n serbian\n spanish\n swedish\n tamil\n turkish\n yiddish\n(29 rows)\n```\nFurther language support is available via extensions such as [zhparser](https://github.com/amutu/zhparser).\n\n### Development Functions\n\nThese functions are for debugging and development use only. Their interface may\nchange in future releases without notice. Functions marked with † require\nsuperuser privileges.\n\nFunction | Description\n--- | ---\nbm25_force_merge(index_name) → void | Merge all segments into one (improves query speed)\nbm25_spill_index(index_name) → int4 | Force memtable spill to disk segment\nbm25_dump_index(index_name) † → text | Dump internal index structure (truncated)\nbm25_summarize_index(index_name) † → text | Show index statistics without content\n\nAdditional file-writing debug functions (`bm25_dump_index(text, text)` and\n`bm25_debug_pageviz`) are available in debug builds only (compile with\n`-DDEBUG_DUMP_INDEX`).\n\n```sql\n-- Merge all segments into one (best after bulk loads)\nSELECT bm25_force_merge('docs_idx');\n\n-- Force spill to disk (returns number of entries spilled)\nSELECT bm25_spill_index('docs_idx');\n\n-- Quick overview of index statistics\nSELECT bm25_summarize_index('docs_idx');\n\n-- Detailed dump for debugging (truncated output)\nSELECT bm25_dump_index('docs_idx');\n```\n\n## Extension Compatibility\n\npg_textsearch uses fixed LWLock tranche IDs 1001-1008 to support large numbers\nof indexes (e.g., partitioned tables with hundreds of partitions). If you use\nanother Postgres extension that also registers fixed tranche IDs in this range,\nwait event names in `pg_stat_activity` may be incorrect. Core Postgres tranches\nuse IDs below 100. If you encounter a conflict, please\n[open an issue](https://github.com/timescale/pg_textsearch/issues).\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, code style, and\nhow to submit pull requests.\n\n- **Bug Reports**: [Create an issue](https://github.com/timescale/pg_textsearch/issues/new?labels=bug\u0026template=bug_report.md)\n- **Feature Requests**: [Request a feature](https://github.com/timescale/pg_textsearch/issues/new?labels=enhancement\u0026template=feature_request.md)\n- **General Discussion**: [Start a discussion](https://github.com/timescale/pg_textsearch/discussions)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimescale%2Fpg_textsearch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftimescale%2Fpg_textsearch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimescale%2Fpg_textsearch/lists"}