{"id":27881786,"url":"https://github.com/src-d/uast2clickhouse","last_synced_at":"2025-07-21T03:31:52.176Z","repository":{"id":79454033,"uuid":"203394424","full_name":"src-d/uast2clickhouse","owner":"src-d","description":"Push flattened Babelfish UASTs to ClickHouse DB.","archived":false,"fork":false,"pushed_at":"2019-09-24T14:00:16.000Z","size":75,"stargazers_count":1,"open_issues_count":1,"forks_count":2,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-05-05T05:05:09.750Z","etag":null,"topics":["babelfish","clickhouse","code-as-data","uast"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-08-20T14:35:04.000Z","updated_at":"2019-11-13T16:33:37.000Z","dependencies_parsed_at":null,"dependency_job_id":"e2897504-dd04-41a0-abbf-c4f770c220c3","html_url":"https://github.com/src-d/uast2clickhouse","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/src-d/uast2clickhouse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fuast2clickhouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fuast2clickhouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fuast2clickhouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fuast2clickhouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/uast2clickhouse/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fuast2clickhouse/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266235306,"owners_count":23897175,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["babelfish","clickhouse","code-as-data","uast"],"created_at":"2025-05-05T05:05:09.034Z","updated_at":"2025-07-21T03:31:52.169Z","avatar_url":"https://github.com/src-d.png","language":"Go","readme":"# uast2clickhouse\n\nThe CLI tool to put [Babelfish's UASTs](https://docs.sourced.tech/babelfish/uast/uast-specification-v2) to [ClickHouse DB](https://clickhouse.yandex).\n\nIt is written in Go and has zero dependencies. The list of solved problems includes:\n\n- Normalizing the UAST even stronger than in the Semantic mode.\n- Converting a tree structure to a linear list of \"interesting\" nodes.\n- Handling runtime errors which are typical to big data processing: OOMs, crashes, DB insertion failures, etc.\n- Running distributed and unattended.\n\n### Installation\n\nYou need a [Go compiler \u003e=1.11](https://golang.org/).\n\n```\nexport GO111MODULE=on\ngo build uast2clickhouse\n```\n\n### Usage\n\nInstall ClickHouse \u003e= 19.4 and initialize the DB schema:\n\n```\nclickhouse-client --query=\"CREATE TABLE uasts (\n  id Int32,\n  left Int32,\n  right Int32,\n  repo String,\n  lang String,\n  file String,\n  line Int32,\n  parents Array(Int32),\n  pkey String,\n  roles Array(Int16),\n  type String,\n  orig_type String,\n  uptypes Array(String),\n  value String\n) ENGINE = MergeTree() ORDER BY (repo, file, id);\n\nCREATE TABLE meta (\n   repo String,\n   siva_filenames Array(String),\n   file_count Int32,\n   langs Array(String),\n   langs_bytes_count Array(UInt32),\n   langs_lines_count Array(UInt32),\n   langs_files_count Array(UInt32),\n   commits_count Int32,\n   branches_count Int32,\n   forks_count Int32,\n   empty_lines_count Array(UInt32),\n   code_lines_count Array(UInt32),\n   comment_lines_count Array(UInt32),\n   license_names Array(String),\n   license_confidences Array(Float32),\n   stars Int32,\n   size Int64,\n   INDEX stars stars TYPE minmax GRANULARITY 1\n) ENGINE = MergeTree() ORDER BY repo;\"\n```\n\nThen run on each of the nodes\n\n```\n./uast2clickhouse --heads heads.csv --db default:password@10.150.0.9/default /path/to/parquet\n```\n\nor\n\n```\n./uast2clickhouse --heads heads.csv --db default:password@10.150.0.9/default 10.150.0.9:11300\n```\n\n`heads.csv` contain the mapping from the HEAD UUIDs in Parquet to the actual repository names. If you\nwork with [PGA](https://github.com/src-d/datasets/tree/master/PublicGitArchive), [download it](https://drive.google.com/open?id=136vsGWfIwfd0IrAdfphIU6lkMmme4-Pj) or [generate with list-pga-heads](https://github.com/src-d/datasets/tree/master/PublicGitArchive/list-pga-heads).\n`--db default:password@10.150.0.9/default` is the ClickHouse connection string.\n`10.150.0.9:11300` is a sample [beanstalkd](https://github.com/beanstalkd/beanstalkd) message queue address for distributed processing. \nYou should specify `--read-streams` and `--db-streams` to reach the peak performance. `--read-streams` sets the number of\ngoroutines to read the Parquet file, and `--db-streams`  set the number of HTTP threads which upload the SQL insertions to ClickHouse.\nUsually `--db-streams` is bigger than `--read-streams`. The bigger values increase the memory pressure.\n\n### Sample operation\n\nInput: UASTs extracted from PGA'19, 204068 Parquet files overall in a 6 TB Google Cloud volume.\nDB instance configuration: Google Cloud \"highcpu\" with 64 cores, 58GB of RAM. 6 local NVMe SSDs joined in RAID0 and formatted to ext4 with journal disabled. Ubuntu 18.04.\nWorker configuration: Ubuntu 18.04 with 20GB of SSD disk and the UASTs volume attached read-only at `/mnt/uasts`.\n\n- Install and run `beanstalkd` on the DB instance. Build locally and `scp` there the [`beanstool` binary](https://github.com/src-d/beanstool).\n- List all the Parquet files with `find /mnt/uasts -name '*.parquet' | gzip \u003e tasks.gz` on one of the workers.\n- `scp tasks.gz` to the DB instance. `zcat tasks.gz | xargs -n1 ./beanstool put --ttr 1000h -t default -b` to fill the queue.\n- Install and setup ClickHouse on the DB instance. There are sample [`/etc/clickhouse-server/config.xml`](config.xml) and [`/etc/clickhouse-server/users.xml`](users.xml).\n- Execute the pushing procedure in 4 stages.\n\n1. 16 workers, 2 cores, 4 GB RAM each. `./uast2clickhouse --read-streams 2 --db-streams 6 --heads heads.csv --db default:password@10.150.0.9/default 10.150.0.9:11300`. This succeeds with ~80% of the tasks. Then `./beanstool kick --num NNN -t default`.\n2. 16 workers, 2 cores, 4 GB RAM each. `./uast2clickhouse --read-streams 1 --db-streams 1 --heads heads.csv --db default:password@10.150.0.9/default 10.150.0.9:11300`. This succeeds with all but 1k tasks.\n3. 16 workers, 2 cores, 16 GB RAM each (\"highmem\"). Same command. This leaves only ~10 tasks.\n4. 2 workers, 4 cores, 32 GB RAM each (\"highmem\"). Same command, full success.\n\n- Create the secondary DB indexes.\n\n```\nSET allow_experimental_data_skipping_indices = 1;\nALTER TABLE uasts ADD INDEX lang lang TYPE set(0) GRANULARITY 1;\nALTER TABLE uasts ADD INDEX type type TYPE set(0) GRANULARITY 1;\nALTER TABLE uasts ADD INDEX value_exact value TYPE bloom_filter() GRANULARITY 1;\nALTER TABLE uasts ADD INDEX left (repo, file, left) TYPE minmax GRANULARITY 1;\nALTER TABLE uasts ADD INDEX right (repo, file, right) TYPE minmax GRANULARITY 1;\nALTER TABLE uasts MATERIALIZE INDEX lang;\nALTER TABLE uasts MATERIALIZE INDEX type;\nALTER TABLE uasts MATERIALIZE INDEX value_exact;\nALTER TABLE uasts MATERIALIZE INDEX left;\nALTER TABLE uasts MATERIALIZE INDEX right;\nOPTIMIZE TABLE uasts FINAL;\n```\n\nThe whole thing takes ~1 week.\n\n### Tests\n\nThere are sadly no tests at the moment. We are going to fix this.\n\n### License\n\nApache 2.0, see [LICENSE](LICENSE).\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fuast2clickhouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Fuast2clickhouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fuast2clickhouse/lists"}