{"id":13622586,"url":"https://github.com/ClickHouse/clickhub","last_synced_at":"2025-04-15T09:33:05.013Z","repository":{"id":154634416,"uuid":"621827738","full_name":"ClickHouse/clickhub","owner":"ClickHouse","description":"Github analytics powered by the world's fastest real-time analytics database","archived":false,"fork":false,"pushed_at":"2024-01-10T16:59:29.000Z","size":4022,"stargazers_count":13,"open_issues_count":2,"forks_count":1,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-08-01T21:54:11.588Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ClickHouse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-31T13:25:22.000Z","updated_at":"2024-04-24T15:31:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"b8fa12e8-80f2-4b09-9345-98b30a126b2f","html_url":"https://github.com/ClickHouse/clickhub","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fclickhub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fclickhub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fclickhub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fclickhub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ClickHouse","download_url":"https://codeload.github.com/ClickHouse/clickhub/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223668236,"owners_count":17182893,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T21:01:21.341Z","updated_at":"2024-11-08T10:30:33.941Z","avatar_url":"https://github.com/ClickHouse.png","language":"Python","readme":"# ClickHub\n\nGitHub analytics with the world's fastest real-time analytics database.\n\n## Capabilities\n\n- Imports a github repo to ClickHouse (currently assumes tables are pre-created)\n- Job queue for repositories to import consumed by workers. Scales linearly.\n\nNote: repos are cloned locally. This can require significant disk space for a large number of repos.\n\n## Pre-requisites\n\n- python3.10+\n- git - authenticated with ssh keys\n- clickhouse-client\n- ClickHouse instance with support for KeeperMap and `keeper_map_strict_mode`\n\n## Installing\n\n`pip install -r requirements.txt`\n\nPre-create tables in ClickHouse. Default database is `git`.\n\n## Running\n\n\n```bash\nusage: clickhub.py [-h] [-c CONFIG] [-d] {schedule,start_worker,import,update_all_repos} ...\n\ngithub importer\n\npositional arguments:\n  {schedule,start_worker,import,update_all_repos}\n    schedule            Schedule a repo for import (add queue to queue)\n    start_worker        start a worker to consume from queue\n    import              import a repo\n    update_all_repos    schedule all current repos for update\n\noptions:\n  -h, --help            show this help message and exit\n  -c CONFIG, --config CONFIG\n                        config (default: config.yml)\n  -d, --debug           debug (default: False)\n```\n\n### Import a repository\n\nImports a repository. Note this uses local machine.\n\n```bash\npython clickhub.py import --repo_name \u003cname\u003e\n```\n\nCaution: ensure this isn't being imported by a worker on the current machine. This is useful for adding a repo only.\n\n### Schedule a repo\n\nAdds the repo to work queue.\n\n```bash\npython clickhub.py schedule --repo_name \u003cname\u003e\n```\n\n### Start worker\n\nStarts a worker consuming from queue\n\n```bash\npython clickhub.py start_worker\n```\n\n### Update all repos\n\nSchedules a job for all current repositories. Determined by setting `repo_lookup_table`. \n\n```bash\npython clickhub.py update_all_repos\n```\n\n### Bulk schedule repos\n\nBulk schedules repos based on lines in file. Repos should be line delimited. 40,000 repos are provided in \n[repos.txt](repos.txt).\n\n```bash\npython clickhub.py bulk_schedule --file repos.txt\n```\n\n## Config\n\nSee [config.yml](config.yml)\n\n```yaml\n# clickhouse details\nhost: ''\nport: 8443\nnative_port: 9440\nusername: default\npassword: ''\nsecure: true\n# location to clone repos\ndata_cache: '/opt/git_cache'\n# queue details\nmax_queue_length: 10000\n# period between worker polls\nsleep_time: 10\n# table on which we look up current repos\nrepo_lookup_table: 'git.commits'\n```\n\n\n## Table Schemas\n\n```sql\nCREATE TABLE git.commits\n(\n    `hash`           String,\n    `author`         LowCardinality(String),\n    `time`           DateTime,\n    `message`        String,\n    `files_added`    UInt32,\n    `files_deleted`  UInt32,\n    `files_renamed`  UInt32,\n    `files_modified` UInt32,\n    `lines_added`    UInt32,\n    `lines_deleted`  UInt32,\n    `hunks_added`    UInt32,\n    `hunks_removed`  UInt32,\n    `hunks_changed`  UInt32,\n    `repo_name`      LowCardinality(String),\n    `updated_at`     DateTime MATERIALIZED now()\n) ENGINE = ReplacingMergeTree\nORDER BY (repo_name, time, hash)\n```\n\n```sql\nCREATE TABLE git.file_changes\n(\n    `change_type`           Enum8('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),\n    `path`                  LowCardinality(String),\n    `old_path`              LowCardinality(String),\n    `file_extension`        LowCardinality(String),\n    `lines_added`           UInt32,\n    `lines_deleted`         UInt32,\n    `hunks_added`           UInt32,\n    `hunks_removed`         UInt32,\n    `hunks_changed`         UInt32,\n    `commit_hash`           String,\n    `author`                LowCardinality(String),\n    `time`                  DateTime,\n    `commit_message`        String,\n    `commit_files_added`    UInt32,\n    `commit_files_deleted`  UInt32,\n    `commit_files_renamed`  UInt32,\n    `commit_files_modified` UInt32,\n    `commit_lines_added`    UInt32,\n    `commit_lines_deleted`  UInt32,\n    `commit_hunks_added`    UInt32,\n    `commit_hunks_removed`  UInt32,\n    `commit_hunks_changed`  UInt32,\n    `repo_name`             LowCardinality(String),\n    `updated_at`            DateTime MATERIALIZED now()\n) ENGINE = ReplacingMergeTree\nORDER BY (repo_name, time, commit_hash, path)\nSETTINGS index_granularity = 8192\n```\n\n```sql\nCREATE TABLE git.line_changes\n(\n    `sign`                       Int8,\n    `line_number_old`            UInt32,\n    `line_number_new`            UInt32,\n    `hunk_num`                   UInt32,\n    `hunk_start_line_number_old` UInt32,\n    `hunk_start_line_number_new` UInt32,\n    `hunk_lines_added`           UInt32,\n    `hunk_lines_deleted`         UInt32,\n    `hunk_context`               LowCardinality(String),\n    `line`                       LowCardinality(String),\n    `indent`                     UInt8,\n    `line_type`                  Enum8('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3),\n    `prev_commit_hash`           String,\n    `prev_author`                LowCardinality(String),\n    `prev_time`                  DateTime,\n    `file_change_type`           Enum8('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),\n    `path`                       LowCardinality(String),\n    `old_path`                   LowCardinality(String),\n    `file_extension`             LowCardinality(String),\n    `file_lines_added`           UInt32,\n    `file_lines_deleted`         UInt32,\n    `file_hunks_added`           UInt32,\n    `file_hunks_removed`         UInt32,\n    `file_hunks_changed`         UInt32,\n    `commit_hash`                String,\n    `author`                     LowCardinality(String),\n    `time`                       DateTime,\n    `commit_message`             String,\n    `commit_files_added`         UInt32,\n    `commit_files_deleted`       UInt32,\n    `commit_files_renamed`       UInt32,\n    `commit_files_modified`      UInt32,\n    `commit_lines_added`         UInt32,\n    `commit_lines_deleted`       UInt32,\n    `commit_hunks_added`         UInt32,\n    `commit_hunks_removed`       UInt32,\n    `commit_hunks_changed`       UInt32,\n    `repo_name`                  LowCardinality(String),\n    `updated_at`                 DateTime MATERIALIZED now()\n) ENGINE = ReplacingMergeTree\nORDER BY (repo_name, time, commit_hash, path, line_number_old, line_number_new)\n```\n\n```sql\nCREATE TABLE git.work_queue\n(\n    `repo_name` String,\n    `scheduled` DateTime,\n    `priority` Int32,\n    `worker_id` String,\n    `started_time` DateTime,\n)\nENGINE = KeeperMap('git_queue')\nPRIMARY KEY repo_name\n\n\n```\n\n```sql\nclickhouse\n-local --query \"SELECT c1::String as hash, c2::String as author, c3::DateTime('utc') as time, c4::String as message, c5::UInt32 as files_added, c6::UInt32 as files_deleted, c7::UInt32 as files_renamed, c8::UInt32 as files_modified, c9::UInt32 as lines_added, c10::UInt32 as lines_deleted, c11::UInt32 as hunks_added, c12::UInt32 as hunks_removed, c13::UInt32 as hunks_changed, 'ClickHouse/ClickHouse'::String as repo_name FROM file('commits.tsv') FORMAT Native\" |  clickhouse-client --query \"INSERT INTO git.commits FORMAT Native\"\n```\n\n```sql\nclickhouse\n-local --query \"SELECT c1::Enum8('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6) as change_type, c2::String as path, c3::String as old_path, c4::String as file_extension, c5::UInt32 as lines_added, c6::UInt32 as lines_deleted, c7::UInt32 as hunks_added, c8::UInt32 as hunks_removed, c9::UInt32 as hunks_changed, c10::String as commit_hash, c11::String as author, c12::DateTime as time, c13::String as commit_message, c14::UInt32 as commit_files_added, c15::UInt32 as commit_files_deleted, c16::UInt32 as commit_files_renamed, c17::UInt32 as commit_files_modified, c18::UInt32 as commit_lines_added, c19::UInt32 as commit_lines_deleted, c20::UInt32 as commit_hunks_added, c21::UInt32 as commit_hunks_removed, c22::UInt32 as commit_hunks_changed, 'ClickHouse/ClickHouse'::String as repo_name FROM file('file_changes.tsv') FORMAT Native\" |  clickhouse-client --query \"INSERT INTO git.file_changes FORMAT Native\"\n\n```\n\n```sql\nclickhouse\n-local --query \"SELECT c1::Int8 as sign, c2::UInt32 as line_number_old, c3::UInt32 as line_number_new, c4::UInt32 as hunk_num, c5::UInt32 as hunk_start_line_number_old, c6::UInt32 as hunk_start_line_number_new, c7::UInt32 as hunk_lines_added, c8::UInt32 as hunk_lines_deleted,  c9::String as hunk_context, c10::String as line, c11::UInt8 as indent, c12::Enum8('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3) as line_type, c13::String as prev_commit_hash, c14::String as prev_author, c15::DateTime as prev_time, c16::Enum8('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6) as file_change_type, c17::String as path, c18::String as old_path, c19::String as file_extension, c20::UInt32 as file_lines_added, c21::UInt32 as file_lines_deleted, c22::UInt32 as file_hunks_added, c23::UInt32 as file_hunks_removed, c24::UInt32 as file_hunks_changed, c25::String as commit_hash, c26::String as author, c27::DateTime as time, c28::String as commit_message, c29::UInt32 as commit_files_added, c30::UInt32 as commit_files_deleted, c31::UInt32 as commit_files_renamed, c32::UInt32 as commit_files_modified, c33::UInt32 as commit_lines_added, c34::UInt32 as commit_lines_deleted, c35::UInt32 as commit_hunks_added, c36::UInt32 as commit_hunks_removed, c37::UInt32 as commit_hunks_changed, 'ClickHouse/ClickHouse'::String as repo_name FROM file('line_changes.tsv') FORMAT Native\"  | clickhouse-client --query \"INSERT INTO git.line_changes FORMAT Native\"\n```\n\n```sql\nCREATE TABLE default.github_stars\n(\n    `repo_name` LowCardinality(String),\n    `stars`     UInt64\n) ENGINE = SummingMergeTree\nORDER BY repo_name\n\n\n\nCREATE\nMATERIALIZED VIEW github_stars_mv TO github_stars AS\nSELECT repo_name,\n       count() AS stars\nFROM github_events\nWHERE event_type = 'WatchEvent'\nGROUP BY repo_name\n    INSERT\nINTO github_stars\nSELECT repo_name, countIf(event_type = 'WatchEvent', 0) AS stars\nFROM github_events\nGROUP BY repo_name\n\n```\n\n### Useful commands\n\n\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FClickHouse%2Fclickhub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FClickHouse%2Fclickhub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FClickHouse%2Fclickhub/lists"}