{"id":13699316,"url":"https://github.com/citusdata/postgresql-topn","last_synced_at":"2025-05-16T18:08:05.036Z","repository":{"id":47167318,"uuid":"107943269","full_name":"citusdata/postgresql-topn","owner":"citusdata","description":"TopN is an open source PostgreSQL extension that returns the top values in a database according to some criteria","archived":false,"fork":false,"pushed_at":"2024-10-18T13:35:58.000Z","size":174,"stargazers_count":241,"open_issues_count":10,"forks_count":24,"subscribers_count":32,"default_branch":"master","last_synced_at":"2025-04-03T18:15:16.773Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/citusdata.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-23T06:54:20.000Z","updated_at":"2025-03-22T17:58:04.000Z","dependencies_parsed_at":"2024-10-19T12:05:08.259Z","dependency_job_id":null,"html_url":"https://github.com/citusdata/postgresql-topn","commit_stats":{"total_commits":70,"total_committers":13,"mean_commits":5.384615384615385,"dds":0.6571428571428571,"last_synced_commit":"57d2a1dc5369b52ec0610a9abfa7867a2ee4742c"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citusdata%2Fpostgresql-topn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citusdata%2Fpostgresql-topn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citusdata%2Fpostgresql-topn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citusdata%2Fpostgresql-topn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/citusdata","download_url":"https://codeload.github.com/citusdata/postgresql-topn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248601061,"owners_count":21131607,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T20:00:30.499Z","updated_at":"2025-04-12T16:47:38.429Z","avatar_url":"https://github.com/citusdata.png","language":"C","readme":"[![Build Status](https://travis-ci.org/citusdata/postgresql-topn.svg?branch=master)](https://travis-ci.org/citusdata/postgresql-topn)\n# TopN\n\n`TopN` is an open source PostgreSQL extension that returns the top values in a database according to some criteria. TopN takes elements in a data set, ranks them according to a given rule, and picks the top elements in that data set. When doing this, TopN applies an approximation algorithm to provide fast results using few compute and memory resources.\n\nThe `TopN` extension becomes useful when you want to materialize top values, incrementally update these top values, and/or merge top values from different time intervals. If you're familiar with [the PostgreSQL HLL extension](https://github.com/citusdata/postgresql-hll), you can think of `TopN` as its cousin.\n\n## When to use TopN\nTopN becomes helpful when serving customer-facing dashboards or running analytical queries that need sub-second responses. Ranking events, users, or products in a given dimension becomes important for these workloads.\n\n`TopN` is used by customers in production to serve real-time analytics queries over terabytes of data.\n\n## Why use TopN\nCalculating TopN elements in a set by applying count, sort, and limit is simple. As data sizes increase however, this method becomes slow and resource intensive.\n\nThe open source `TopN` extension enables you to serve instant and approximate results to TopN queries. To do this, you first materialize top values according to some criteria in a data type. You can then incrementally update these top values, or merge them on-demand across different time intervals.\n\n`TopN` was originally created to help [Citus Data](https://www.citusdata.com) customers, who needed to scale out their PostgreSQL databases across dozens of machines. These customers needed to compute top values over terabytes of data in less than a second. We realized that the broader Postgres community could benefit from `TopN`, and decided to open source it for all users.\n\n## How does TopN work\nThe TopN approximation algorithm keeps a predefined number of frequent items and counters. If a new item already exists among these frequent items, the algorithm increases the item's frequency counter. Else, the algorithm inserts the new item into the counter list when there is enough space. If there isn't enough space, the algorithm evicts the bottom half of all counters. Since we typically keep counters for many more items (e.g. 100*N) than we are actually interested in, the actual top N items are unlikely to get evicted and will typically have accurate counts.\n\nYou can increase the algoritm's accuracy by increasing the predefined number of frequent items/counters.\n\n# Build\n\nOnce you have PostgreSQL, you're ready to build TopN. For this, you will need to include the pg_config directory path in your make command. This path is typically the same as your PostgreSQL installation's bin/ directory path. For example:\n\n\tPATH=/usr/local/pgsql/bin/:$PATH make\n\tsudo PATH=/usr/local/pgsql/bin/:$PATH make install\n\nYou can run the regression tests as the following.\n\n    sudo make installcheck\n\n# Example\n\nIn this example, we take example customer reviews data from Amazon. We're then going to analyze the most reviewed products based on different criteria.\n\nLet's start by downloading and decompressing source data files.\n\n    wget http://examples.citusdata.com/customer_reviews_2000.csv.gz\n    gzip -d customer_reviews_2000.csv.gz\n\nNext, we're going to connect to PostgreSQL and create the `TopN` extension.\n\n```SQL\nCREATE EXTENSION topn;\n```\n\nLet's then create our example table and load data into it.\n\n```SQL\nCREATE TABLE customer_reviews\n(\n    customer_id TEXT,\n    review_date DATE,\n    review_rating INTEGER,\n    review_votes INTEGER,\n    review_helpful_votes INTEGER,\n    product_id CHAR(10),\n    product_title TEXT,\n    product_sales_rank BIGINT,\n    product_group TEXT,\n    product_category TEXT,\n    product_subcategory TEXT,\n    similar_product_ids CHAR(10)[]\n);\n\n\\COPY customer_reviews FROM 'customer_reviews_2000.csv' WITH CSV;\n```\n\nNow, we're going to create an aggregation table that captures the most popular products for each month. We're then going to materialize top products for each month.\n\n```SQL\n-- Create a roll-up table to capture most popular products\nCREATE TABLE popular_products\n(\n  review_date date UNIQUE,\n  agg_data jsonb\n);\n\n-- Create different summaries by grouping top reviews for each date (day, month, year)\nINSERT INTO popular_products\n    SELECT review_date, topn_add_agg(product_id)\n    FROM customer_reviews\n    GROUP BY review_date;\n```\n\nFrom this table, you can compute the most popular/reviewed product for each day, in the blink of an eye.\n\n```SQL\nSELECT review_date, (topn(agg_data, 1)).*\nFROM popular_products\nORDER BY review_date;\n```\n\nYou can also instantly find the top 10 reviewed products across any time interval, in this case January.\n\n```SQL\nSELECT (topn(topn_union_agg(agg_data), 10)).*\nFROM popular_products\nWHERE review_date \u003e= '2000-01-01' AND review_date \u003c '2000-02-01'\nORDER BY 2 DESC;\n```\n\nOr, you can quickly find the most reviewed product for each month in 2000.\n\n```SQL\nSELECT date_trunc('month', review_date) AS review_month,\n       (topn(topn_union_agg(agg_data), 1)).*\nFROM popular_products\nWHERE review_date \u003e= '2000-01-01' AND review_date \u003c '2001-01-01'\nGROUP BY review_month\nORDER BY review_month;\n```\n\n# Usage\n`TopN` provides the following user-defined functions and aggregates.\n\n### Data Type\n###### `JSONB`\nA PostgreSQL type to keep the frequent items and their frequencies.\n\n### Aggregates\n###### `topn_add_agg(textColumnName)`\nThis is the aggregate add function. It creates an empty `JSONB` and inserts series of item from given column to create aggregate summary of these items. Note that the value must be `TEXT` type or casted to `TEXT`.\n\n###### `topn_union_agg(topnTypeColumn)`\nThis is the aggregate for union operation. It merges the `JSONB` counter lists and returns the final `JSONB` which stores overall result.\n\n### Functions\n###### `topn(jsonb, n)`\nGives the most frequent `n` elements and their frequencies as set of rows from the given `JSONB`.\n\n###### `topn_add(jsonb, text)`\nAdds the given text value as a new counter into the `JSONB` and returns a new `JSONB` if there is an enough space for one more counter. If not, the counter is added and then the counter list is pruned.\n\n###### `topn_union(jsonb, jsonb)`\nTakes the union of both `JSONB`s and returns a new `JSONB`.\n\n### Config settings\n###### `topn.number_of_counters`\nSets the number of counters to be tracked in a `JSONB`. If at some point, the current number of counters exceed `topn.number_of_counters` * 3, the list is pruned. The default value is 1000 for `topn.number_of_counters`. When you increase this setting, `TopN` uses more space and provides more accurate estimates.\n\n# Compatibility\n`TopN` is compatible with the PostgreSQL 9.6, 10, 11, 12, 13, 14, 15, 16 and 17 releases. `TopN` is also compatible with all supported Citus releases, including Citus 6.x, 7.x, 8.x, and 9.x. If you need to run `TopN` on a different version of PostgreSQL or Citus, please open an issue. Opening a pull request (PR) is also highly appreciated.\n\n# Attributions\nThe `TopN` extension to Postgres [was created by](https://www.citusdata.com/blog/2018/03/27/topn-for-your-postgres-database/) and is maintained by the [Citus database team](https://www.citusdata.com/about/our-story/), now part of Microsoft. The Citus team also created [Citus](https://www.citusdata.com/download/), an [open source extension to Postgres](https://github.com/citusdata/citus) that transforms Postgres into a distributed database.\n","funding_links":[],"categories":["C"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitusdata%2Fpostgresql-topn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcitusdata%2Fpostgresql-topn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitusdata%2Fpostgresql-topn/lists"}