{"id":13400441,"url":"https://github.com/posterior/treecat","last_synced_at":"2025-10-29T05:23:15.458Z","repository":{"id":57458055,"uuid":"93913649","full_name":"posterior/treecat","owner":"posterior","description":"A Bayesian latent tree model of high-dimensional heterogeneous data","archived":false,"fork":false,"pushed_at":"2017-09-14T22:08:26.000Z","size":671,"stargazers_count":23,"open_issues_count":14,"forks_count":4,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-09-19T12:07:55.049Z","etag":null,"topics":["bayesian","machine-learning","mcmc","python","unsupervised-learning"],"latest_commit_sha":null,"homepage":"http://treecat.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/posterior.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-10T03:34:44.000Z","updated_at":"2022-01-20T03:18:52.000Z","dependencies_parsed_at":"2022-09-09T22:50:58.063Z","dependency_job_id":null,"html_url":"https://github.com/posterior/treecat","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posterior%2Ftreecat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posterior%2Ftreecat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posterior%2Ftreecat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posterior%2Ftreecat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/posterior","download_url":"https://codeload.github.com/posterior/treecat/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221440184,"owners_count":16821599,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bayesian","machine-learning","mcmc","python","unsupervised-learning"],"created_at":"2024-07-30T19:00:52.066Z","updated_at":"2025-10-29T05:23:15.359Z","avatar_url":"https://github.com/posterior.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"![](https://cdn.rawgit.com/posterior/treecat/master/doc/logo.png)\n\n# TreeCat\n\n[![Docs](https://readthedocs.org/projects/treecat/badge/?version=latest)](http://treecat.readthedocs.io/en/latest/?badge=latest)\n[![Build Status](https://travis-ci.org/posterior/treecat.svg?branch=master)](https://travis-ci.org/posterior/treecat)\n[![Latest Version](https://badge.fury.io/py/pytreecat.svg)](https://pypi.python.org/pypi/pytreecat)\n[![DOI](https://zenodo.org/badge/93913649.svg)](https://zenodo.org/badge/latestdoi/93913649)\n\n## Intended Use\n\nTreeCat is an inference engine for machine learning and Bayesian inference.\nTreeCat is appropriate for analyzing medium-sized tabular data with\ncategorical and ordinal values, possibly with missing observations.\n\n|                        | TreeCat supports       |\n| ---------------------- | ---------------------- |\n| Feature Types          | categorical, ordinal   |\n| # Rows (n)             | 1000-100K              |\n| # Features (p)         | 10-1000                |\n| # Cells (n \u0026times; p)  | 10K-10M                |\n| # Categories           | 2-10ish                |\n| Max Ordinal            | 10ish                  |\n| Missing obervations?   | yes                    |\n| Repeated observations? | yes                    |\n| Sparse data?           | no, use something else |\n| Unsupervised           | yes                    |\n| Semisupervised         | yes                    |\n| Supervised             | no, use something else |\n\n## Installing\n\nIf you already have [Numba](http://numba.pydata.org) installed,\nyou should be able to simply\n\n```sh\npip install pytreecat\n```\n\nIf you're new to Numba, we recommend installing it using\n[miniconda](https://conda.io/miniconda.html) or\n[Anaconda](https://www.continuum.io/downloads).\n\nIf you want to install TreeCat for development,\nthen clone the source code and create a new conda env\n```sh\ngit clone git@github.com:posterior/treecat\ncd treecat\nconda env create -f environment.3.yml\nsource activate treecat3\npip install -e .\n```\n\n## Quick Start\n\n[comment]: # (When modifying this, also update readme_test.test_quickstart)\n\n1.  Format your data as a [`data.csv`](treecat/testdata/tiny_data.csv)\n    file with a header row.\n    It's fine to include extra columns that won't be used.\n\n    Contents of [`data.csv`](treecat/testdata/tiny_data.csv):\n\n    | title     | genre    | decade | rating |\n    | --------- | -------- | ------ | ------ |\n    | vertigo   | thriller | 1950s  | 5      |\n    | up        | family   | 2000s  | 3      |\n    | desk set  | comedy   | 1950s  | 4      |\n    | santapaws | family   | 2010s  |        |\n    | ...       | ...      | ...    | ...    |\n\n2.  Generate two schema files\n    [`types.csv`](treecat/testdata/tiny_types.csv) and\n    [`values.csv`](treecat/testdata/tiny_values.csv)\n    using TreeCat's `guess-schema` command:\n\n    ```sh\n    $ treecat guess-schema data.csv types.csv values.csv\n    ```\n\n    Contents of [`types.csv`](treecat/testdata/tiny_types.csv):\n\n    | name   | type        | total | unique | singletons |\n    | ------ | ----------- | ----- | ------ | ---------- |\n    | title  |             |    11 |     11 |         11 |\n    | genre  | categorical |    11 |      7 |          4 |\n    | decade | categorical |    11 |      6 |          3 |\n    | rating | ordinal     |    10 |      5 |          2 |\n\n    Contents of [`values.csv`](treecat/testdata/tiny_values.csv):\n\n    | name   | value    | count |\n    | ------ | -------- | ----- |\n    | genre  | drama    |     3 |\n    | genre  | family   |     2 |\n    | genre  | fantasy  |     2 |\n    | decade | 1950s    |     3 |\n    | ...    | ...      |   ... |\n    \n    You can manually fix any incorrectly guessed feature types,\n    or add/remove feature values.\n    TreeCat ignores any feature with an empty type field.\n\n3.  Import your csv files into treecat's internal format.\n    We'll call our dataset `dataset.pkz` (a gzipped pickle file).\n\n    ```sh\n    $ treecat import-data data.csv types.csv values.csv '' dataset.pkz\n    ```\n\n    (the empty argument '' is an optional structural prior that we ignore).\n\n4.  Train an ensemble model on your dataset.\n    This typically takes ~15minutes for a 1M cell dataset.\n\n    ```sh\n    $ treecat train dataset.pkz model.pkz\n    ```\n\n5.  Load your trained model into a server\n\n    ```python\n    from treecat.serving import serve_model\n    server = serve_model('dataset.pkz', 'model.pkz')\n    ```\n\n6.  Run queries against the server.\n    For example we can compute expecations\n    ```python\n    samples = server.sample(100, evidence={'genre': 'drama'})\n    print(np.mean([s['rating'] for s in samples]))\n    ```\n    or explore feature structure through the latent correlation matrix\n    ```python\n    print(server.latent_correlation())\n    ```\n\n## Tuning Hyperparameters\n\nTreeCat requires tuning of two parameters:\n`learning_init_epochs` (like the number of iterations) and\n`model_num_clusters` (the number of latent classes above each feature).\nThe easiest way to tune these is to do grid search using the `treecat.validate` module\nwith a csv file of example parameters.\n\nContents of [`tuning.csv`](treecat/testdata/tuning.csv):\n\n| model_num_clusters | learning_init_epochs |\n| ------------------ | -------------------- |\n|                  2 |                    2 |\n|                  2 |                    3 |\n|                  4 |                    2 |\n|                ... |                  ... |\n\n```sh\n# This reads parameters from tuning.csv and dumps results to tuning.pkz\n$ treecat.validate tune-csv dataset.pkz tuning.csv tuning.pkz\n```\n\nThe `tune-csv` command prints its results, but if you want to seem them later, you can\n\n```sh\n$ treecat.format cat tuning.pkz\n```\n\n## The Server Interface\n\nTreeCat's\n[server](https://github.com/posterior/treecat/blob/master/treecat/serving.py)\ninterface supports primitives for Bayesian inference and\ntools to inspect latent structure:\n\n- `server.sample(N, evidence=None)`\n  draws `N` samples from the joint posterior distribution over observable data,\n  optionally conditioned on `evidence`.\n  \n- `server.logprob(rows, evidence=None)`\n  computes posterior log probability of `data`,\n  optionally conditioned on `evidence`.\n\n- `server.median(evidence)`\n  computes L1-loss-minimizing estimates, conditioned on `evidence`.\n\n- `server.observed_perplexity()`\n  computes the [perplexity](https://en.wikipedia.org/wiki/Perplexity)\n  (a soft measure of cardinality) of each observed feature.\n\n- `server.latent_perplexity()`\n  computes the perplexity of the latent class behind each observed feature.\n\n- `server.latent_correlation()`\n  computes the latent-latent correlation between each pair of latent variables.\n\n- `server.estimate_tree()`\n  computes a maximum a posteriori estimate of the latent tree structure.\n\n- `server.sample_tree(N)`\n  draws `N` samples from posterior distribution over the latent tree structures.\n\n## The Model\n\nTreeCat's generative model is closest to Zhang and Poon's Latent Tree Analysis [1],\nwith the notable difference that TreeCat fixes exactly one latent node per observed node.\nTreeCat is historically a descendent of Mansinghka et al.'s CrossCat, a model in which latent nodes (\"views\" or \"kinds\") are completely independent.\nTreeCat addresses the same kind of high-dimensional categorical distribution\nthat Dunson and Xing's mixture-of-product-multinomial models [3] addresses.\nWhile TreeCat currently supports only categorical and ordinal feature types,\nit is straight-forward to generalize to other feature types with conjugate\npriors such as real (normal-inverse-chi-squared), integer (gamma-Poisson), and\nangular (von-Mises).\nThis generalization places it in the class of models high-dimensional heterogeneous data with Valera et al. [4].\n\nLet `V` be a set of vertices (one vertex per feature).\u003cbr /\u003e\nLet `C[v]` be the dimension of the `v`th feature.\u003cbr /\u003e\nLet `N` be the number of datapoints.\u003cbr /\u003e\nLet `K[n,v]` be the number of observations of feature `v` in row `n`\n(e.g. 1 for a categorical variable, 0 for missing data, or\n`k` for an ordinal value with minimum 0 and maximum `k`).\n\nTreeCat is the following generative model:\n```python\nE ~ UniformSpanningTree(V)    # An undirected tree.\nfor v in V:\n    Pv[v] ~ Dirichlet(size = [M], alpha = 1/2)\nfor (u,v) in E:\n    Pe[u,v] ~ Dirichlet(size = [M,M], alpha = 1/(2*M))\n    assume(Pv[u] == sum(Pe[u,v], axis = 1))\n    assume(Pv[v] == sum(Pe[u,v], axis = 0))\nfor v in V:\n    for i in 1:M:\n        Q[v,i] ~ Dirichlet(size = [C[v]])\nfor n in 1:N:\n    for v in V:\n        X[n,v] ~ Categorical(Pv[v])\n    for (u,v) in E:\n        (X[n,u],X[n,v]) ~ Categorical(Pe[u,v])\n    for v in V:\n        Z[n,v] ~ Multinomial(Q[v,X[n,v]], count = K[n,v])\n```\nwhere we've avoided adding an arbitrary root to the tree, and instead presented\nthe model as a manifold with overlapping variables and constraints.\n\n## The Inference Algorithm\n\nThis package implements fully Bayesian MCMC inference using subsample-annealed\ncollapsed Gibbs sampling. There are two pieces of latent state that are sampled:\n\n- Latent class assignments for each row for each vertex (feature).\n  These are sampled by single-site collapsed Gibbs sampler with a linear\n  subsample-annealing schedule.\n\n- The latent tree structure is sampled by randomly removing an edge\n  and replacing it. Since removing an edge splits the graph into two\n  connected components, the only replacement locations that are feasible\n  are those that re-connect the graph.\n\nThe single-site Gibbs sampler uses dynamic programming to simultaneously sample\nthe complete latent assignment vector for each row. A dynamic programming\nprogram is created each time the tree structure changes. This program is\ninterpreted by various virtual machines for different purposes (training the\nmodel, sampling from the posterior, computing log probability of the posterior).\nThe virtual machine for training is jit-compiled using numba.\n\n## References\n\n1. Nevin L. Zhang, Leonard K. M. Poon (2016) \u003cbr /\u003e\n   [Latent Tree Analysis](https://arxiv.org/pdf/1610.00085.pdf)\n2. Vikash Mansinghka, Patrick Shafto, Eric Jonas, Cap Petschulat, Max Gasner, Joshua B. Tenenbaum (2015) \u003cbr /\u003e\n   [CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data](https://arxiv.org/pdf/1512.01272)\n3. David B. Dunson, Chuanhua Xing (2012) \u003cbr /\u003e\n   [Nonparametric Bayes Modeling of Multivariate Categorical Data](https://dx.doi.org/10.1198%2Fjasa.2009.tm08439)\n4. Isabel Valera, Melanie F Pradier, Zoubin Ghahramani (2017) \u003cbr /\u003e\n   [General Latent Feature Modeling for Data Exploration Tasks](https://arxiv.org/pdf/1707.08352).\n\n## License\n\nCopyright (c) 2017 Fritz Obermeyer. \u003cbr /\u003e\nTreeCat is licensed under the [Apache 2.0 License](/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposterior%2Ftreecat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fposterior%2Ftreecat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposterior%2Ftreecat/lists"}