{"id":25476475,"url":"https://github.com/jonsafari/clustercat","last_synced_at":"2026-03-02T13:11:52.763Z","repository":{"id":26901564,"uuid":"30363238","full_name":"jonsafari/clustercat","owner":"jonsafari","description":"Fast Word Clustering Software","archived":false,"fork":false,"pushed_at":"2025-02-08T03:14:57.000Z","size":474,"stargazers_count":78,"open_issues_count":0,"forks_count":11,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-05-25T23:12:03.804Z","etag":null,"topics":["c","clusters","d3js","exchange-algorithm","machine-learning","python","word-clusters","word-embeddings"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jonsafari.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-02-05T15:42:21.000Z","updated_at":"2025-02-12T22:50:47.000Z","dependencies_parsed_at":"2024-08-16T22:52:19.075Z","dependency_job_id":"dd7d6bb0-b2b8-411a-8ddb-24cd74607b20","html_url":"https://github.com/jonsafari/clustercat","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jonsafari/clustercat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonsafari%2Fclustercat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonsafari%2Fclustercat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonsafari%2Fclustercat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonsafari%2Fclustercat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jonsafari","download_url":"https://codeload.github.com/jonsafari/clustercat/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonsafari%2Fclustercat/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30003728,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T12:19:43.414Z","status":"ssl_error","status_checked_at":"2026-03-02T12:19:02.215Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c","clusters","d3js","exchange-algorithm","machine-learning","python","word-clusters","word-embeddings"],"created_at":"2025-02-18T12:57:19.374Z","updated_at":"2026-03-02T13:11:52.724Z","avatar_url":"https://github.com/jonsafari.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ClusterCat: Fast, Flexible Word Clustering Software\n\n[![Build Status](https://travis-ci.org/jonsafari/clustercat.svg?branch=master)](https://travis-ci.org/jonsafari/clustercat)\n[![License: LGPL v3](https://img.shields.io/badge/License-LGPL%20v3-blue.svg)](http://www.gnu.org/licenses/lgpl-3.0)\n[![License: MPL 2.0](https://img.shields.io/badge/License-MPL%202.0-brightgreen.svg)](https://opensource.org/licenses/MPL-2.0)\n\n\n## Overview\n\nClusterCat induces word classes from unannotated text.\nIt is programmed in modern C, with no external libraries.\nA Python wrapper is also provided.\n\nWord classes are unsupervised part-of-speech tags, requiring no manually-annotated corpus.\nWords are grouped together that share syntactic/semantic similarities.\nThey are used in many dozens of applications within natural language processing, machine translation, neural net training, and related fields.\n\n\n## Installation\n### Linux\nYou can use either GCC 4.6+ or Clang 3.7+, but GCC is usually faster.\n\n      sudo apt update  \u0026\u0026  sudo apt install gcc make libomp-dev\n      make -j\n\n### macOS / OSX\nThe current version of Clang in Xcode doesn't fully support [OpenMP][], so instead install GCC from [Homebrew][]:\n\n      brew update  \u0026\u0026  brew install gcc@9 libomp  \u0026\u0026  xcode-select --install\n      make -j CC=/opt/homebrew/bin/gcc-9\n\n\n## Commands\nThe binary program `clustercat` gets compiled into the `bin` directory.\n\n**Clustering** preprocessed text (already tokenized, normalized, etc) is pretty simple:\n\n      bin/clustercat [options] \u003c train.tok.txt \u003e clusters.tsv\n\nThe word-classes are induced from a bidirectional [predictive][] [exchange algorithm][].\nThe format of the output class file has each line consisting of `word`*TAB*`class` (a word type, then tab, then class).\n\nCommand-line argument **usage** may be obtained by running with program with the **`--help`** flag:\n\n      bin/clustercat --help\n\n\n## Python\nInstallation and usage details for the Python module are described in a separate [readme](python/README.md).\n\n\n## Features\n- Print **[word vectors][]** (a.k.a. word embeddings) using the `--word-vectors` flag.  The binary format is compatible with word2vec's tools.\n- Start training using an **existing word cluster mapping** from other clustering software (eg. mkcls) using the `--class-file` flag.\n- Adjust the number of **threads** to use with the `--threads` flag.  The default is 8.\n- Adjust the **number of clusters** or vector dimensions using the `--classes` flag. The default is approximately the square root of the vocabulary size.\n- Includes **compatibility wrapper script ` bin/mkcls `** that can be run just like mkcls.  You can use more classes now :-)\n\n\n## Comparison\n| Training Set                                        | [Brown][] | ClusterCat | [mkcls][] | [Phrasal][] | [word2vec][] |\n| ------------                                        | --------- | ---------- | --------- | ----------- | ------------ |\n| 1 Billion English tokens,   800 clusters  | 12.5 hr   | **1.4** hr | 48.8 hr   | 5.1 hr      | 20.6 hr      |\n| 1 Billion English tokens,   1200 clusters | 25.5 hr   | **1.7** hr | 68.8 hr   | 6.2 hr      | 33.7 hr      |\n| 550 Million Russian tokens, 800 clusters  | 14.6 hr   | **1.5** hr | 75.0 hr   | 5.5 hr      | 12.0 hr      |\n\n\n## Visualization\nSee [bl.ocks.org][] for nice data visualizations of the clusters for various languages, including English, German, Persian, Hindi, Czech, Catalan, Tajik, Basque, Russian, French, and Maltese.\n\nFor example:\n\n ![French Clustering Thumbnail](visualization/d3/french_cluster_thumbnail.png)\n ![Russian Clustering Thumbnail](visualization/d3/russian_cluster_thumbnail.png)\n ![Basque Clustering Thumbnail](visualization/d3/basque_cluster_thumbnail.png)\n\nYou can generate your own graphics from ClusterCat's output.\nAdd the flag  `--print-freqs`  to ClusterCat, then type the command:\n\n      bin/flat_clusters2json.pl --word-labels \u003c clusters.tsv \u003e visualization/d3/clusters.json\n\nYou can either upload the [JSON][] file to [gist.github.com][], following instructions on the [bl.ocks.org](http://bl.ocks.org) front page, or you can view the graphic locally by running a minimal webserver in the `visualization/d3` directory:\n\n      python -m SimpleHTTPServer 8116 2\u003e/dev/null \u0026\n\nThen open a tab in your browser to [localhost:8116](http://localhost:8116) .\n\nThe default settings are sensible for normal usage, but for visualization you probably want much fewer word types and clusters -- less than 10,000 word types and 120 clusters.\nYour browser will thank you.\n\n\n## Perplexity\nThe perplexity that ClusterCat reports uses a bidirectional bigram class language model, which is richer than the unidirectional bigram-based perplexities reported by most other software.\nRicher models provide a better evaluation of the quality of clusters, having more sensitivity (power) to detect improvements.\nIf you want to directly compare the quality of clusters with a different program's output, you have a few options:\n\n1. Load another clustering using `--class-file` , and see what the other clustering's initial bidirectional bigram perplexity is before any words get exchanged.\n2. Use an external class-based language model.  These are usually two-sided (unlexicalized) models, so they favor two-sided clusterers.\n3. Evaluate on a downstream task.  This is best.\n\n\n## Contributions\nContributions are welcome, via [pull requests][].\n\n\n## Citation\nIf you use this software please cite the following\n\nDehdari, Jon, Liling Tan, and Josef van Genabith. 2016. [BIRA: Improved Predictive Exchange Word Clustering](http://www.aclweb.org/anthology/N16-1139.pdf).\nIn *Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)*, pages 1169–1174, San Diego, CA, USA.  Association for Computational Linguistics.\n\n    @inproceedings{dehdari-etal2016,\n     author    = {Dehdari, Jon  and  Tan, Liling  and  van Genabith, Josef},\n     title     = {{BIRA}: Improved Predictive Exchange Word Clustering},\n     booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)},\n     month     = {June},\n     year      = {2016},\n     address   = {San Diego, CA, USA},\n     publisher = {Association for Computational Linguistics},\n     pages     = {1169--1174},\n     url       = {http://www.aclweb.org/anthology/N16-1139.pdf}\n    }\n\n[lgpl3]: https://www.gnu.org/copyleft/lesser.html\n[mpl2]: https://www.mozilla.org/MPL/2.0\n[c99]: https://en.wikipedia.org/wiki/C99\n[homebrew]: http://brew.sh\n[openmp]: https://en.wikipedia.org/wiki/OpenMP\n[predictive]: https://www.aclweb.org/anthology/P/P08/P08-1086.pdf\n[exchange algorithm]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.2354\n[brown]: https://github.com/percyliang/brown-cluster\n[mkcls]: https://github.com/moses-smt/mgiza\n[phrasal]: https://github.com/stanfordnlp/phrasal\n[word2vec]: https://code.google.com/archive/p/word2vec/\n[word vectors]: https://en.wikipedia.org/wiki/Word_embedding\n[bl.ocks.org]: http://bl.ocks.org/jonsafari\n[JSON]: https://en.wikipedia.org/wiki/JSON\n[gist.github.com]: https://gist.github.com\n[pull requests]: https://help.github.com/articles/creating-a-pull-request\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonsafari%2Fclustercat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjonsafari%2Fclustercat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonsafari%2Fclustercat/lists"}