{"id":13697138,"url":"https://github.com/gregversteeg/bio_corex","last_synced_at":"2026-02-10T02:39:20.582Z","repository":{"id":50119970,"uuid":"81851076","full_name":"gregversteeg/bio_corex","owner":"gregversteeg","description":"A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations","archived":false,"fork":false,"pushed_at":"2021-10-06T14:48:42.000Z","size":15431,"stargazers_count":137,"open_issues_count":14,"forks_count":30,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-08-03T18:21:48.717Z","etag":null,"topics":["information-theory","machine-learning","python","unsupervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gregversteeg.png","metadata":{"files":{"readme":"readme.md","changelog":"Changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-02-13T17:20:36.000Z","updated_at":"2024-01-04T16:11:25.000Z","dependencies_parsed_at":"2022-09-04T10:31:49.375Z","dependency_job_id":null,"html_url":"https://github.com/gregversteeg/bio_corex","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregversteeg%2Fbio_corex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregversteeg%2Fbio_corex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregversteeg%2Fbio_corex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gregversteeg%2Fbio_corex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gregversteeg","download_url":"https://codeload.github.com/gregversteeg/bio_corex/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224373629,"owners_count":17300532,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-theory","machine-learning","python","unsupervised-learning"],"created_at":"2024-08-02T18:00:53.002Z","updated_at":"2026-02-10T02:39:18.680Z","avatar_url":"https://github.com/gregversteeg.png","language":"Python","funding_links":[],"categories":["Models"],"sub_categories":["Embedding based Topic Models"],"readme":"# Bio CorEx: recover latent factors with Correlation Explanation (CorEx)\n\nThe principle of Total *Cor*-relation *Ex*-planation has recently been introduced as a way to reconstruct latent factors\nthat are informative about relationships in data. This project consists of python code to build these representations.\nWhile the methods are domain-agnostic, the version of CorEx presented here was designed to handle challenges inherent\nin several biomedical problems: missing data, continuous variables, and severely under-sampled data. \n\nA preliminary version of the technique is described in this paper.      \n[*Discovering Structure in High-Dimensional Data Through Correlation Explanation*](http://arxiv.org/abs/1406.1222), \nNIPS 2014.     \nThis version uses theoretical developments described here:      \n[*Maximally Informative Hierarchical Representions of High-Dimensional Data*](http://arxiv.org/abs/1410.7404), \nAISTATS 2015.       \nFinally, the Bayesian approach implemented here resulted form work with Shirley Pepke and is described here:     \n[*Comprehensive discovery of subsample gene expression components by information explanation: therapeutic implications in cancer*](http://biorxiv.org/content/early/2016/09/19/043257), \nin BMC Medical Genomics (accepted).     \nYou can also see applications of this code to neuroscience data\n[*here*](http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=2600367), \n[*here*](https://www.researchgate.net/profile/Madelaine_Daianu2/publication/299377637_Relative_Value_of_Diverse_Brain_MRI_and_Blood-Based_Biomarkers_for_Predicting_Cognitive_Decline_in_the_Elderly/links/56f2b0bd08aea5a8982ff958.pdf), \nand [*here.*](https://www.researchgate.net/profile/Madelaine_Daianu2/publication/305726530_Information-Theoretic_Clustering_of_Neuroimaging_Metrics_Related_to_Cognitive_Decline_in_the_Elderly/links/57a8ab9d08aed76703f87777.pdf)\n\nFor sparse binary data, try [*CorEx topic*](http://github.com/gregversteeg/corex_topic/). \nThere is also a [linear CorEx](https://github.com/gregversteeg/LinearCorex) that is quite fast, and in the paper, [Low Complexity Gaussian Latent Factor Models and a Blessing of Dimensionality](https://arxiv.org/abs/1706.03353), we showed that CorEx exhibits unique advantages in under-sampled, high-dimensional data. \n\n### Dependencies\n\nCorEx only requires numpy and scipy. If you use OS X, I recommend installing the [Scipy Superpack](http://fonnesbeck.github.io/ScipySuperpack/).\n\nThe visualization capabilities in vis_corex.py require other packages: \n* matplotlib - Already in scipy superpack.\n* seaborn\n* pandas\n* [networkx](http://networkx.github.io)  - A network manipulation library. \n* [graphviz](http://www.graphviz.org) (Optional, for compiling produced .dot files into pretty graphs. The command line \ntools are called from vis_corex. Graphviz should be compiled with the triangulation library *gts* for best visual results).\n\n### Install\n\nTo install, download using [this link](https://github.com/gregversteeg/bio_corex/archive/master.zip) \nor clone the project by executing this command in your target directory:\n```\ngit clone https://github.com/gregversteeg/bio_corex.git\n```\nUse *git pull* to get updates. The code is under development. \nPlease contact me about issues.  \n\n### Issue to be resolved\n\nThere's a line in the corex.py that says:\n```\nsig_ml = sig_ml.clip(0.25)\n```\nThis was added after the 2015 AISTATS paper and it helped to avoid numerical errors on gene expression datasets. \nHowever, thanks to a diligent user, we discovered that the finance results in the AISTATS paper don't reproduce if you have this line.\nRemoving it returns the original behavior (but also leads to errors on some datasets). \nI plan to make this line into an optional hyper-parameter, and then add a discussion to Troubleshooting to invoke this if there are numerical errors. \nAlso see the discussion in the pull requests from a contributor who found in his dataset that making this quantity bigger was sometimes helpful. \nIn the meantime, I want people to be aware of it, and consider commenting out the line if you are getting poor results or making it bigger if you are getting numerical errors.\n\n## Example usage with command line interface\n\nThese examples use data included in the tests folder. `python vis_corex.py -h` gives an overview of the options. \n\n`python vis_corex.py data/test_data.csv`\n\nIn this simple example there are five variables, v1...v5 where v1, v2, v3 are in one cluster and v4, v5 are in another. \nLooking in \"corex_output/graphs\" you should see pdf files with the graphs (with graphviz). \n\n`python vis_corex.py data/test_big5.csv --layers=5,1 --missing=-1 -v --no_row_names -o big5`\n\nThis reads the CSV file containing some Big-5 personality survey data. It uses 5 hidden units (and associated clusters)\n at the first layer, and 1 at the second layer. \n Option -v gives verbose outputs. By default, it is assumed that the first column and row are labels, \nthis expectation can be changed with options. There are a few missing values specified with -1. Note that for discrete\ndata, the discrete values each variable takes have to be like 0,1,...\nFinally, all the output files are placed in the directory \"big5\". \n\nLooking in the directory \"big5/graphs\", you should see a pdf that shows the questions clustered into five groups. See\nthe [full raw data](http://personality-testing.info/_rawdata/BIG5.zip) for information on questions. \"Text_files\" \nsummarizes the clusters and gives the latent factor (or personality trait in this case) associated with each cluster, \nfor each sample (survey taker). \n\nHere are the options for the gene expression data cited above. \n```\npython vis_corex.py data/matrix.tcga_ov.geneset1.log2.varnorm.RPKM.txt --delimiter=' ' --layers=200,30,8,1 --dim_hidden=3 --max_iter=100 --missing=-1e6 -c -b -v -o output_folder --ram=8 --cpu=4\n```\nThe delimiter is set because the data is separated by spaces, not the default commas. \nThe layers match our specification for the paper. dim_hidden says that each latent factor can take three states\ninstead of the default of two. \nThe default expectation is discrete data. The -c option is used for continuous data. The *v* is for verbose output.\nThis dataset has a small number of samples so we turned on bayesian smoothing with -b, although this slows down computation.\nThe ram is approximate ram in GB of your machine and setting cpu=n lets you use however many cpus/cores are on your machine. \nThis took me over a day to run. You can try something faster by using layers=20,5,1 and reducing max_iter. \n\nAlso look in the \"relationships\" folder to look at pairwise relationships between variables in the same group. You should \nsee that the variables are strongly dependent. The plot marker corresponds to the latent factor Yj for that group, and\nthe point color corresponds to p(yj|x) for that point.\n\n\n## Python API usage\n\n### Example\n\nThe API design is based on the scikit-learn package. You define a model (`model=Corex(with options here)`) then use\n the `model.fit(data)` method to fit it on data, then you can transform new data with `model.transform(new_data)`. \n The model has many other methods to access mutual information, measures of TC, and more. \n```python\nimport corex as ce\n\nX = np.array([[0,0,0,0,0], # A matrix with rows as samples and columns as variables.\n              [0,0,0,1,1],\n              [1,1,1,0,0],\n              [1,1,1,1,1]], dtype=int)\n\nlayer1 = ce.Corex(n_hidden=2, dim_hidden=2, marginal_description='discrete', smooth_marginals=False)  \n# Define the number of hidden factors to use (n_hidden=2). \n# And each latent factor is binary (dim_hidden=2)\n# marginal_description can be 'discrete' or 'gaussian' if your data is continuous\n# smooth_marginals = True turns on Bayesian smoothing\nlayer1.fit(X)  # Fit on data. \n\nlayer1.clusters  # Each variable/column is associated with one Y_j\n# array([0, 0, 0, 1, 1])\nlayer1.labels[:, 0]  # Labels for each sample for Y_0\n# array([0, 0, 1, 1])\nlayer1.labels[:, 1]  # Labels for each sample for Y_1\n# array([0, 1, 0, 1])\nlayer1.tcs  # TC(X;Y_j) (all info measures reported in nats). \n# array([ 1.385,  0.692])\n# TC(X_Gj) \u003e=TC(X_Gj ; Y_j)\n# For this example, TC(X1,X2,X3)=1.386, TC(X4,X5) = 0.693\n```\n\nSuppose you want to transform some test data (that wasn't used in training). Assume that `X_test` is a matrix of such data. \n*Make sure the columns of `X_test` exactly match the columns of X.*  If the test data doesn't include all the same columns, you can specify missing values. Check `layer1.missing_values` or set missing_values=(some number) at training time so that you know how to set missing values in training data. For instance, the default is missing_values=-1, then you an put a -1 in your test data for any column that is missing. \nContinuing the example above, you would generate labels on test data as follows. \n```python\nX_test = np.array([[1,1,1,0,1]])  # 1 sample/row of data with the same 5 columns as example above\ny = layer1.transform(X_test, details=False)\n# array([[1, 0]]), I.e., Y_0 = 1 and Y_1 = 0\np, log_z = layer1.transform(X_test, details=True)\n# p is p(yj | x), the probability of each factor taking a certain value.\n# The shape is n_hidden, number of samples, dim_hidden\n#array([[[ 0. ,  1. ]],\n#      [[ 0.5,  0.5]]])\n# So Y_0 takes values 1 with probability 1 (it matches the training example)\n# But Y_1 takes values 0 or 1 with probability 0.5 (it is a mix of the two training examples)\ntest_tcs = np.mean(log_z[:,:,0], axis=1)\n# array([ 1.385, -6.216])\n# Compare this value to layer1.tcsl = array([ 1.385,  0.692])\n# This tells us that the \"test TC\" for factor Y_0 is similar to the training data, \n# but the \"test TC\" for factor Y_1 is completely different. \n```\nI interpret negative values in the test TC as meaning *correlations that appeared in the training data do not appear in the test data*. But we haven't studied this phenomena in depth. If you get a bunch of negative values for test TC, though, it signals a significant difference between your training and testing data. \n\n\n### Data format\n\nYou can specify the type of the variables by passing the option `marginal_description='discrete'` for discrete variables or\n`marginal_description='gaussian'` for continuous variables. \nFor the discrete version of CorEx, you must input a matrix of integers whose rows represent samples and whose columns\nrepresent different variables. The values must be integers `{0,1,...,k-1}` where k represents the maximum number of \nvalues that each variable, x_i can take. By default, entries equal to -1 are treated as missing. This can be \naltered by passing a *missing_values* argument when initializing CorEx. \n\"smooth_marginals\" tells whether to use Bayesian shrinkage estimators for marginal distributions to reduce noise.\nIt is turned on by default but is off in the example above (since it only has 4 samples, the smoothing will mess it up).\n\n### CorEx outputs\n\nAs shown in the example, `clusters` gives the variable clusters for each hidden factor `Y_j` and \n`labels` gives the labels for each sample for each `Y_j`. \nProbabilistic labels can be accessed with `p_y_given_x`. \n\nThe total correlation explained by each hidden factor, `TC(X;Y_j)`, is accessed with `tcs`. Outputs are sorted\nso that Y_0 is always the component that explains the highest TC. \nLike point-wise mutual information, you can define point-wise total correlation measure for an individual sample, `x^l`     \n`TC(X = x^l;Y_j) == log Z_j(x)`   \nThis quantity is accessed with `log_z`. This represents the correlations explained by `Y_j` for an individual sample.\nA low (or even negative!) number can be obtained. This can be interpreted as a measure of how surprising an individual\nobservation is. This can be useful for anomaly detection. \n\nSee the main section of vis_corex.py for more ideas of how to do visualization.\n\n## Details\n\n### Computational complexity\n\nThis version has time and memory requirements like O(num. samples * num. variables * num. hidden units). By implementing\n mini-batch updates, we could eliminate the dependence on the number of samples. Sorry I haven't gotten to this yet. I\n have been able to run examples with thousands of variables, thousands of samples, and 100 latent factors on my laptop.\n It might also be important to check that your numpy implementation is linked to a good linear algebra library like \n lapack or BLAS. \n\n### Hierarchical CorEx\nThe simplest extension is to stack CorEx representations on top of each other. \n```\nlayer1 = ce.Corex(n_hidden=100)\nlayer2 = ce.Corex(n_hidden=10)\nlayer3 = ce.Corex(n_hidden=1)\nY1 = layer1.fit_transform(X)\nY2 = layer2.fit_transform(Y1.labels)\nY3 = layer2.fit_transform(Y2.labels)\n```\nThe sum of total correlations explained by each layer provides a successively tighter lower bound on TC(X) (see AISTATS paper). \n To assess how large your representations should be, look at quantities\nlike layer.tcs. Do all the Y_j's explain some correlation (i.e., all the TCs are significantly larger than 0)? If not\nyou should probably use a smaller representation.\n\n### Missing values\nYou can set missing values (by specifying missing_values=-1, when calling, e.g.). CorEx seems robust to missing data.\nThis hasn't been extensively tested yet though, and we don't really understand the \neffect of data missing not at random. \n\n### Getting better results\nYou can use  the option smooth_marginals to turn on the use of Bayesian smoothing methods (off by default) for \nestimating the marginal distributions. This is slower, but reduces spurious correlations, especially if the number\nof samples is small (less than 200) or the number of variables or dim_hidden are big. \n\nAlso note that CorEx can find different local optima after different random restarts. You can run it k times and take\nthe best solution with the \"repeat\" option. \n\nWarning: in recent experiments on gene expression that contained lots of zero counts, we got bad results. (The paper had removed columns that included zero counts.)  I'm not sure what the underlying cause is (bad data versus some issue that CorEx has with zero-inflated data), but I strongly recommend removing columns/genes with lots of zeros. \n\n### Troubleshooting visualization\n\nFor Mac users: \n\nTo get the visualization of the hierarchy looking nice sometimes takes a little effort. To get graphs to compile correctly do the following. \nUsing \"brew\" to install, you need to do \"brew install gts\" followed by \"brew install --with-gts graphviz\". \nThe (hacky) way that the visualizations are produced is the following. The code, vis_corex.py, produces a text file called \"graphs/graph.dot\". This just encodes the edges between nodes in dot format. Then, the code calls a command line utility called sfdp that is part of graphviz, \n```\nsfdp tree.dot -Tpdf -Earrowhead=none -Nfontsize=12  -GK=2 -Gmaxiter=1000 -Goverlap=False -Gpack=True -Gpackmode=clust -Gsep=0.02 -Gratio=0.7 -Gsplines=True -o nice.pdf\n```\nThese dot files can also be opened with OmniGraffle if you would like to be able to manipulate them by hand. \nIf you want, you can try to recompile graphs yourself with different options to make them look nicer. Or you can edit the dot files to get effects like colored nodes, etc. \n\nAlso, note that you can color nodes in the graphs by putting prepending a color to column label names in the CSV file.\nFor instance, blue_column_1_label_name will show column_1_label_name in blue in the graphs folder. Any matplotlib colors are allowed. \nSee the BIG5 data file and graphs produced by the command line utility. \n\nFor Ubuntu users:\n\nCredits: https://gitlab.com/graphviz/graphviz/issues/1237\n\n1. Remove any existing installation with `conda uninstall graphviz`. (If you did not install with Conda, you might need to do `sudo apt purge graphviz` and/or `pip uninstall graphviz`).\n    \n2. run `sudo apt install libgts-dev`\n\n3. run `sudo pkg-config --libs gts`\n    \n4. run `sudo pkg-config --cflags gts`\n\n5. Download `graphviz-2.40.1.tar.gz` from [here](https://graphviz.gitlab.io/pub/graphviz/stable/SOURCES/graphviz.tar.gz)\n\n6. Navigate to directory containing download, and extract with `tar -xvf graphviz-2.40.1.tar.gz` (or newer whatever the download is named.)\n\n7. `cd` into extracted folder (ie `cd graphviz-2.40.1`) and run `sudo ./configure --with-gts`\n\n8. Run `sudo make` in the folder\n\n9. Run `sudo make install` in the folder\n\n10. Reinstall library using `pip install graphviz`\n    \n\n### Other files that are produced\n*text_files/groups.txt*\nLists the variables in each group.\n\n*text_files/labels.txt*\nGives a column for each latent factor (in layer 1) and a row for each patient/sample. The entry is the value of the latent factor (0,…dim_hidden-1)\n\n*text_files/cont_labels.txt*\nGives a continuous number to sort each patient with respect to each latent factor. \n\n*relationships*\nFor each latent factor, it shows pairwise plots between the top genes in each group. Each point corresponds to a sample/patient and the color corresponds to the learned latent factor. \n\n### All options\nWhen you run vis_corex.py with the -h option, you get all the command line options. \n```\npython vis_corex.py -h\nUsage: vis_corex.py [options] data_file.csv \nIt is assumed that the first row and first column of the data CSV file are labels.\nUse options to indicate otherwise.\n\nOptions:\n  -h, --help            show this help message and exit\n\n  Input Data Format Options:\n    -c, --continuous    Input variables are continuous (default assumption is\n                        that they are discrete).\n    -t, --no_column_names\n                        We assume the top row is variable names for each\n                        column. This flag says that data starts on the first\n                        row and gives a default numbering scheme to the\n                        variables (1,2,3...).\n    -f, --no_row_names  We assume the first column is a label or index for\n                        each sample. This flag says that data starts on the\n                        first column.\n    -m MISSING, --missing=MISSING\n                        Treat this value as missing data. Default is -1e6. \n    -d DELIMITER, --delimiter=DELIMITER\n                        Separator between entries in the data, default is ','.\n\n  CorEx Options:\n    -l LAYERS, --layers=LAYERS\n                        Specify number of units at each layer: 5,3,1 has 5\n                        units at layer 1, 3 at layer 2, and 1 at layer 3\n    -k DIM_HIDDEN, --dim_hidden=DIM_HIDDEN\n                        Latent factors take values 0, 1..k. Default k=2\n    -b, --bayesian_smoothing\n                        Turn on Bayesian smoothing when estimating marginal\n                        distributions (p(x_i|y_j)). Slower, but reduces\n                        appearance of spurious correlations if the number of\n                        samples is \u003c 200 or if dim_hidden is large.\n    -r REPEAT, --repeat=REPEAT\n                        Run r times and return solution with best TC.\n\n  Output Options:\n    -o OUTPUT, --output=OUTPUT\n                        A directory to put all output files.\n    -v, --verbose       Print rich outputs while running.\n    -e MAX_EDGES, --edges=MAX_EDGES\n                        Show at most this many edges in graphs.\n    -q, --regraph       Don't re-run corex, just re-generate outputs (perhaps\n                        with edges option changed).\n\n  Computational Options:\n    -a RAM, --ram=RAM   Approximate amount of RAM to use (in GB).\n    -p CPU, --cpu=CPU   Number of cpus/cores to use.\n    -w MAX_ITER, --max_iter=MAX_ITER\n                        Max number of iterations to use.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgregversteeg%2Fbio_corex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgregversteeg%2Fbio_corex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgregversteeg%2Fbio_corex/lists"}