{"id":17570638,"url":"https://github.com/ClickHouse/clickpy","last_synced_at":"2025-03-07T21:30:45.472Z","repository":{"id":205064795,"uuid":"678452100","full_name":"ClickHouse/clickpy","owner":"ClickHouse","description":"PyPI analytics powered by ClickHouse","archived":false,"fork":false,"pushed_at":"2025-02-26T10:53:49.000Z","size":3334,"stargazers_count":74,"open_issues_count":14,"forks_count":10,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-02-28T12:31:35.107Z","etag":null,"topics":["analytics","clickhouse","pypi","pypi-packages","python","real-time"],"latest_commit_sha":null,"homepage":"https://clickpy.clickhouse.com","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ClickHouse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-14T15:28:59.000Z","updated_at":"2025-02-26T10:53:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"58228c10-5440-4ad4-87da-0b0a70d494a3","html_url":"https://github.com/ClickHouse/clickpy","commit_stats":{"total_commits":87,"total_committers":6,"mean_commits":14.5,"dds":0.09195402298850575,"last_synced_commit":"6fcf5f03037e2fa2ff5785edf02cee53f8363b76"},"previous_names":["clickhouse/clickpy"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fclickpy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fclickpy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fclickpy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fclickpy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ClickHouse","download_url":"https://codeload.github.com/ClickHouse/clickpy/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242467161,"owners_count":20133105,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","clickhouse","pypi","pypi-packages","python","real-time"],"created_at":"2024-10-21T18:01:15.913Z","updated_at":"2025-03-07T21:30:45.464Z","avatar_url":"https://github.com/ClickHouse.png","language":"JavaScript","funding_links":[],"categories":["HTML","Integrations"],"sub_categories":["Data Visualization and Analysis"],"readme":"# ClickPy - New Python Package Analytics Service Powered by ClickHouse\n\nInterested to see how your package is being adopted? How often is it being installed? Which countries are popular? Or maybe you're just curious to see which packages are emerging or hot right now?\n\nClickPy, using ClickHouse, answers these with real-time analytics on PyPI package downloads. Available as a service for users to run locally. All open-source and reproducible.\n\nAvailable at [clickpy.clickhouse.com](https://clickpy.clickhouse.com).\n\n![landing_page](./images/landing_page.png)\n\n![analytics](./images/analytics.png)\n\nEvery Python package download, e.g. `pip install`, anywhere, anytime, produces a row. The result is hundreds of billions of rows (closing in on a Trillion at 1.4b a day).\n\nThe downloads for Python modules are available in BigQuery - a row for every package download in the world and the largest BigQuery public dataset at about 700b rows. Wanting to do some serious analytics leads to a few frustrations, though:\n\n- speed for queries - BigQuery is great for complex SQL, less so for fast analytics.\n- cost :) especially if i wanna offer this for free as a service.\n\nWith ClickHouse we can provide cost-efficient and fast analytics for free.\n\n## Features\n\nData\n\n- 600+ billion rows\n- Almost 600k packages\n\nAnalytics via live dashboards\n\n- Top packages and recent releases\n- Emerging repos - most popular new packages released in the last 6 months\n- Needing a refresh - popular packages not updated in the last 6 months\n- Hot packages - Biggest changes in downloads in the last 6 months\n- Download statistics for any Python package over time\n- For any package:\n    - Download statistics over time with drill-down\n    - Downloads by Python version over time\n    - Downloads by Python version over time\n    - Downloads by system over time\n    - Downloads by country\n    - File types by installer\n    - Slice and dice by version, time, python version, installer or country\n\nPowered by ClickHouse. App in NextJS.\n\n## Motivation\n\nMany of us learn best by example and doing. This app is for those wanting to build real-time analytics applications.\n\nReal-time analytics applications have a few requirements:\n\n- Billions of rows\n- Low latency queries allowing users to slice and dice with filters\n- High query concurrency\n- A great user experience\n\nAnyone who is building such an application has similar challenges.\n\n- Which database to use? ClickHouse obviously :)\n- How to use ClickHouse to get the best performance? ClickPy is your example.\n\n## PyPI data\n\nPython is ubiquitous and the programming language we often get started with or turn to for quick tasks.\n\nThe Python Package Index, abbreviated as PyPI and also known as the Cheese Shop, is the official third-party software repository for Python. Python developers use this for hosting and installing packages. By default, pip uses PyPI to look for packages.\n\nEvery time a package is downloaded, a log entry is generated in a CDN log. This contains the details you would expect:\n\n- the package name\n- the version\n- IP address of download (obfuscated and resolved to country)\n- Python version used\n- installer mechanism\n- system used\n- and more..\n\nPyPI does not display download statistics for a number of reasons described [here](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/#id8) - not least, it's inefficient and hard to work with a CDN.\n\nInstead, an [implementation of linehard](https://github.com/pypi/linehaul-cloud-function) feeds this data to BigQuery, where it's [queryable as a public dataset](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/#public-dataset).\n\nBigQuery is great as a data warehouse. But it's neither fast enough nor able to handle the concurrency required to power user-facing analytics.\n\nThe solution? ClickHouse - the fastest and most resource-efficient open-source database for real-time apps and analytics.\n\nThis requires us to export the BigQuery data to a GCS bucket and import it into ClickHouse.\n\n## How is this all soo fast? Whats the secret sauce?\n\nTwo main reasons:\n\n- ClickHouse was designed to be fast for analytics data. See [Why is ClickHouse so fast?](https://clickhouse.com/docs/en/concepts/why-clickhouse-is-so-fast)\n- Materialized views and dictionaries\n\n### What is Materialized view in ClickHouse?\n\nIn its simplest form, a Materialized view is simply a query that triggers when an insert is made to a table. \n\nKey to this is the idea that Materialized views don't hold any data themselves. They simply execute a query on the inserted rows and send the results to another \"target table\" for storage. \n\nImportantly, the query that runs can aggregate the rows into a smaller result set, allowing queries to run faster on the target table. This approach effectively moves work from **query time to insert time**.\n\n\n[![What is a Materialized View in ClickHouse?](http://img.youtube.com/vi/QUigKP7iy7Y/0.jpg)](http://www.youtube.com/watch?v=QUigKP7iy7Y \"What is a Materialized View in ClickHouse?\")\n\n#### A real example\n\nConsider our `pypi` table, where a row represents a download. Suppose we wish to identify the 5 most popular projects. A naive query might do something like this:\n\n```sql\nSELECT\n    project,\n    count() AS c\nFROM pypi.pypi\nGROUP BY project\nORDER BY c DESC\nLIMIT 5\n\n┌─project────┬───────────c─┐\n│ boto3      │ 13564182186 │\n│ urllib3    │ 10994463491 │\n│ botocore   │  9937667176 │\n│ requests   │  8914244571 │\n│ setuptools │  8589052556 │\n└────────────┴─────────────┘\n\n5 rows in set. Elapsed: 182.068 sec. Processed 670.43 billion rows, 12.49 TB (3.68 billion rows/s., 68.63 GB/s.)\n```\n\nThis requires a full table scan. While 180s might be ok (and 4 billion rows/sec is fast!), it is not quick enough for ClickPy.\n\nA materialized view can help with this query (and many more!). ClickPy uses such a view `pypi_downloads_mv`, shown below.\n\n```sql\nCREATE MATERIALIZED VIEW pypi.pypi_downloads_mv TO pypi.pypi_downloads\n(\n    `project` String,\n    `count` Int64\n\n) AS SELECT project, count() AS count\nFROM pypi.pypi\nGROUP BY project\n```\n\nThis view executes the aggregation `SELECT project, count() AS count FROM pypi.pypi GROUP BY project` on data that has been inserted into a row. The result is sent to the \"target table\" `pypi.pypi_downloads`. This, in turn, has a special engine configuration:\n\n```sql\nCREATE TABLE pypi.pypi_downloads\n(\n    `project` String,\n    `count` Int64\n)\nENGINE = SummingMergeTree\nORDER BY project\n```\n\nThe [SummingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/summingmergetree) replaces all the rows with the same `ORDER BY` key (`project` in this case) with one row which contains summarized values for the columns with the numeric data type. Rows with the same `project` value will be asynchronously merged and the `count` summed - hence `SummingMergeTree`.\n\nTo query this table, we can use the query below:\n\n```sql\nSELECT\n    project,\n    sum(count) AS c\nFROM pypi.pypi_downloads\nGROUP BY project\nORDER BY c DESC\nLIMIT 5\n\n┌─project────┬───────────c─┐\n│ boto3      │ 13564182186 │\n│ urllib3    │ 10994463491 │\n│ botocore   │  9937667176 │\n│ requests   │  8914244571 │\n│ setuptools │  8589052556 │\n└────────────┴─────────────┘\n\n5 rows in set. Elapsed: 0.271 sec. Processed 599.09 thousand rows, 18.71 MB (2.21 million rows/s., 69.05 MB/s.)\nPeak memory usage: 59.71 MiB.\n```\n\n180s to 0.27s. Not bad. \n\nNote how we use a `sum(count`)` in case all rows have not been merged.\n\nThe above represents the simplest example of a Materialized view used by ClickPy. For others, see [ClickHouse](./ClickHouse.md). It also represents the case where our aggregation produces a count or sum. Other aggregations (e.g. averages, quantiles etc.) are supported. In fact, all ClickHouse aggregations can have their state stored by a materialized view!\n\nFor further details see:\n\n- [An intro to Materialized Views in ClickHouse](https://youtu.be/QUigKP7iy7Y?si=AvKnI-UtDbusbk-y) - this uses ClickPy as an example.\n- [Building Real-time Applications with ClickHouse Materialized Views](https://www.youtube.com/watch?v=j_kKKX1bguw) - meetup video, showing ClickPy as an example and how materialized views work.\n- [Using Materialized Views in ClickHouse](https://clickhouse.com/blog/using-materialized-views-in-clickhouse) - blog with examples.\n- [ Materialized Views and Projections Under the Hood ](https://www.youtube.com/watch?v=QDAJTKZT8y4) - Great video for those interested in internals.\n\n### Dictionaries\n\nDictionaries provide an in-memory key-value pair representation of our data, optimized for low latent look-up queries. We can utilize this structure to improve the performance of queries in general, with JOINs particularly benefiting where one side of the JOIN represents a look-up table that fits into memory.\n\nIn ClickPy's case, we utilize a dictionary `pypi.last_updated_dict` to maintain the last time a package was updated. This is used in several queries to ensure they meet our latency requirements.\n\nFor further details on dictionaries, see the blog post [Using Dictionaries to Accelerate Queries](https://clickhouse.com/blog/faster-queries-dictionaries-clickhouse).\n\n### Powering the UI\n\nBroadly, when exploring the ClickPy interface, each visualization is powered by one materialized view. The full list of queries can be found in the file [clickhouse.js](./src/utils/clickhouse.js).\n\nConsider the list of \"Emerging repos\" on the landing page.\n\n![emerging_repos](./images/emerging_repos.png)\n\nThis simple visual is powered by two materialized views: `pypi_downloads_per_day` and `pypi_downloads_max_min`. For the full query, see [here](https://github.com/ClickHouse/clickpy/blob/12d565202b88b97b51d557da0bc777ad65d5ba60/src/utils/clickhouse.js#L380).\n\n#### Choosing the right query\n\nClickPy is an interactive application. Users can apply filters on the data. While these filters are currently quite limited, we plan to expand them in the future. While materialized views can power a static visualization, filters may mean that a specific view is no longer usable. For example, consider the following chart showing downloads over time (for the popular `Boto3` package):\n\n![downloads_over_time](./images/downloads_over_time.png),\n\nInitially, this chart requires a simple query to the materialized view [`pypi_downloads_per_day`](./ClickHouse.md#pypi_downloads_per_day).\n\nHowever, if a filter is applied to the `version` column, this view is insufficient - it doesn't capture the `version` column in its aggregation. In this case, we switch to the [`pypi_downloads_per_day_by_version`](./ClickHouse.md#pypi_downloads_per_day_by_version) view.\n\nWhy not always use the latter view, you ask? Well, it contains more columns in its aggregation, and thus, the target table has more rows, is larger, and queries are possibly a little slower. Small margins, yes, but important for the best user experience.\n\nSelecting the right view for a visualization involves a simple heuristic. We simply select the view which has the fewest number of columns and covers the set of required columns in the query. The complete logic can be found [here](https://github.com/ClickHouse/clickpy/blob/12d565202b88b97b51d557da0bc777ad65d5ba60/src/utils/clickhouse.js#L28).\n\n## Deployment\n\nEither go to the public example at [clickpy.clickhouse.com](https://clickpy.clickhouse.com) or deploy yourself.\n\nFor the latter, you have 2 choices:\n\n - Export the BigQuery data yourself to GCS and import it into ClickHouse\n - Use the public instance of ClickHouse with read-only credentials (see below)*\n\nWe cover both options below.\n\n * This instance is sufficient to run the application but has quotas applied.\n\n### Dependencies\n\n- node \u003e= v16.15\n- npm \u003e= 9.1\n- ClickHouse \u003e= 23.8\n- Python \u003e= 3.8 (if loading data from GCS)\n\n### ClickHouse\n\n#### Public instance\n\nFor users wishing to make changes to just the app and use the existing ClickHouse instance with the data, the following credentials can be used:\n\n```\nhost: https://sql-clickhouse.clickhouse.com\nport: 443\nuser: demo\n```\nUsers can connect to this instance with the clickhouse-client and issue queries i.e.\n\n```bash\nclickhouse client -h sql-clickhouse.clickhouse.com --user demo --secure\n```\n\nSee [App Configuration](#configuration).\n\n#### Self-hosted\n\n##### Creating tables and views\n\nClickPy relies on two main tables: `pipi` and `projects` within a `pypi` database. `pypi` is the majority, with a row for every package download at over 600b rows. The `projects` table contains a row per package and contains \u003c 1m rows.\n\nAs well as these two main tables, ClickPy relies on materialized views and dictionaries to provide the sub-second query performance across over 600 billion rows.\n\nUsers can either use the script `./scripts/create_tables.sh` to create the required views, dictionaries, and tables or perform this step by hand - see [ClickHouse.md](./ClickHouse.md) for full details on the table schemas and DDL required.\n\nThe `create_tables.sh` assumes the clickhouse instance is secured by SSL, using the `--secure` flag for the `clickhouse-client`. Modify as required.\n\n```bash\nCLICKHOUSE_USER=default CLICKHOUSE_HOST=example.clickhouse.com CLICKHOUSE_PASSWORD=password ./create_tables.sh\n```\n\nAll schemas assume the use of the MergeTree table engine. For users of [ClickHouse Cloud](clickhouse.cloud/), this will automatically replicate the data. Self-managed users may need to share the data (as well as the associated target tables of the Materialized views) across multiple nodes, depending on the size of hardware available. This is left as an exercise for the user.\n\n```\nNote: Although the data is 15TB uncompressed, it is less than \u003c 50GB on disk compressed, making this application deployable on moderate hardware as a single node.\n```\n\nFor details on populating the database, see [Importing data](#importing-data) below.\n\n##### Exporting data\n\nFor users wanting to host the data themselves, this requires the export of the data from BigQuery - ideally to Parquet, prior to import into ClickHouse. This is a significant export (15TB) and can take multiple hours to run. The following query will export the data into a single bucket:\n\n```sql\nDECLARE export_path string;\nSET export_path = CONCAT('gs://\u003cbucket\u003e/file_downloads-*.parquet');\n\nEXPORT DATA\nOPTIONS (\n    uri = (export_path),\n    format = 'PARQUET',\n    overwrite = true\n)\nAS (\nSELECT timestamp, \n    country_code, \n    url, \n    project, \n    file, \n    STRUCT\u003cname string, version string\u003e(details.installer.name, details.installer.version) as installer,\n    details.python as python,\n    STRUCT\u003cname string, version string\u003e(details.implementation.name, details.implementation.version) as implementation,\n    STRUCT\u003cname string, version string, id string, libc STRUCT\u003clib string, version string\u003e\u003e(details.distro.name, details.distro.version, details.distro.id,(details.distro.libc.lib, details.distro.libc.version)) as distro,\n    STRUCT\u003cname string, release string\u003e(details.system.name, details.system.release) as system,\n    details.cpu as cpu,\n    details.openssl_version as openssl_version,\n    details.setuptools_version as setuptools_version, details.rustc_version as rustc_version, tls_protocol, tls_cipher\n    FROM bigquery-public-data.pypi.file_downloads WHERE timestamp \u003e '2000-01-01 00:00:00'\n);\n```\n\nThe export can also be broken up using techniques described [here](https://clickhouse.com/docs/en/migrations/bigquery#1-export-table-data-to-gcs).\n\nFiles will be exported with a numeric suffix e.g. `file_downloads-000000000012.parquet`.\n\n##### Importing data\n\nThe `projects` table can be populated with a few simple `INSERT INTO SELECT` statements:\n\n```sql\nINSERT INTO projects SELECT *\nFROM s3('https://storage.googleapis.com/clickhouse_public_datasets/pypi/packages/packages-*.parquet')\n```\n\nThis data is up-to-date as of `2023-08-16`. For more recent versions of the data, users can export the `bigquery-public-data.pypi.distribution_metadata` table to GCS from BigQuery.\n\nFor the larger `pypi` table, we recommend the scripts provided [here](https://github.com/ClickHouse/examples/tree/main/large_data_loads).\n\nAlternatively, the following can be used as the basis for importing the data in chunks manually by using glob patterns. In the example below, we target the files with a numeric suffix beginning with `-00000000001*`. Provide the locations to your bucket via `\u003cbucket\u003e`:\n\n```sql\nINSERT INTO pypi SELECT timestamp::Date as date, country_code, project, file.type as type, installer.name as installer, arrayStringConcat(arraySlice(splitByChar('.', python), 1, 2), '.') as python_minor, system.name as system, file.version as version FROM s3('https://\u003cbucket\u003e/file_downloads-00000000001*.parquet', 'Parquet', 'timestamp DateTime64(6), country_code LowCardinality(String), url String, project String, `file.filename` String, `file.project` String, `file.version` String, `file.type` String, `installer.name` String, `installer.version` String, python String, `implementation.name` String, `implementation.version` String, `distro.name` String, `distro.version` String, `distro.id` String, `distro.libc.lib` String, `distro.libc.version` String, `system.name` String, `system.release` String, cpu String, openssl_version String, setuptools_version String, rustc_version String,tls_protocol String, tls_cipher String') WHERE python_minor != '' AND system != '' SETTINGS input_format_null_as_default = 1, input_format_parquet_import_nested = 1\n```\n\nFor details on tuning insert performance, see [here](https://clickhouse.com/blog/supercharge-your-clickhouse-data-loads-part2).\n\n##### Data size\n\nWhile the export (as of 15/10/2023 is over 15TB of parquet), this compresses extremely well in ClickHouse by over 320x to deliver a total disk usage of less than 50GB.\n\n```sql\nSELECT\n    table,\n    sum(rows) AS rows,\n    formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,\n    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,\n    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio\nFROM system.parts\nWHERE (table LIKE 'pypi') AND active\nGROUP BY table\nORDER BY table DESC\n┌─table─┬─────────rows─┬─compressed_size─┬─uncompressed_size─┬──ratio─┐\n│ pypi  │ 670430346833 │ 47.75 GiB       │ 15.26 TiB         │ 327.25 │\n└───────┴──────────────┴─────────────────┴───────────────────┴────────┘\n\n1 row in set. Elapsed: 0.011 sec.\n```\n\n## Application\n\n### Configuration\n\nCopy the file `.env.example` to `.env.local`.\n\nModify the settings with your clickhouse cluster details, e.g. if using the public instance.\n\n```\nCLICKHOUSE_HOST=https://sql-clickhouse.clickhouse.com\nCLICKHOUSE_USERNAME=demo\nCLICKHOUSE_PASSWORD=\nPYPI_DATABASE=pypi\n```\n\n### Running\n\nInstall dependencies:\n\n```bash\nnpm install\n```\n\nTo run locally:\n\n```bash\nnpm run dev\n```\n\n### Deploying to production\n\nThe easiest way to deploy the Next.js app is to use the [Vercel Platform](https://vercel.com/new) from the creators of Next.js.\n\nWe welcome other contributions to helping with deployment.\n\n## Contributing and Development\n\nJust another NextJS project, run the development server:\n\n```bash\nnpm run dev\n# or\nyarn dev\n# or\npnpm dev\n```\n\nOpen [http://localhost:3000](http://localhost:3000) with your browser to see the result.\n\nPlease fork and raise PR's to contribute. Changes and ideas are welcome.\n\n## License\n\nApache License 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FClickHouse%2Fclickpy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FClickHouse%2Fclickpy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FClickHouse%2Fclickpy/lists"}