{"id":18418971,"url":"https://github.com/smrfeld/python_prob_pca_tutorial","last_synced_at":"2025-04-13T06:12:10.449Z","repository":{"id":115478562,"uuid":"324631837","full_name":"smrfeld/python_prob_pca_tutorial","owner":"smrfeld","description":"Tutorial for probabilistic PCA in Python and Mathematica","archived":false,"fork":false,"pushed_at":"2020-12-26T20:41:59.000Z","size":1225,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-02-16T06:44:59.246Z","etag":null,"topics":["mathematica","pca","python","tutorial"],"latest_commit_sha":null,"homepage":"https://medium.com/practical-coding/the-simplest-generative-model-you-probably-missed-c840d68b704","language":"Mathematica","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/smrfeld.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-26T20:41:13.000Z","updated_at":"2024-10-06T20:45:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"de644f2a-62ac-4f4b-aa55-7ed73f6f1b0d","html_url":"https://github.com/smrfeld/python_prob_pca_tutorial","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smrfeld%2Fpython_prob_pca_tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smrfeld%2Fpython_prob_pca_tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smrfeld%2Fpython_prob_pca_tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smrfeld%2Fpython_prob_pca_tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/smrfeld","download_url":"https://codeload.github.com/smrfeld/python_prob_pca_tutorial/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248670436,"owners_count":21142904,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["mathematica","pca","python","tutorial"],"created_at":"2024-11-06T04:15:16.032Z","updated_at":"2025-04-13T06:12:10.408Z","avatar_url":"https://github.com/smrfeld.png","language":"Mathematica","readme":"# Tutorial on probabilistic PCA in Python and Mathematica\n\n[You can read a complete tutorial on Medium here.](https://medium.com/practical-coding/the-simplest-generative-model-you-probably-missed-c840d68b704)\n\n## Running\n\n* Python: `python prob_pca.py`. The figures are output to the [figures_py](figures_py) directory.\n* Mathematica: Run the notebook `prob_pca.nb`. The figures are output to the [figures_ma](figures_ma) directory.\n\n## Description\n\nYou can find more information in the original paper: [\"Probabilistic principal component analysis\" by Tipping \u0026 Bishop](https://www.jstor.org/stable/2680726?seq=1#metadata_info_tab_contents).\n\n### Import data\n\nLet's import and plot some 2D data:\n```\ndata = import_data()\n\nd = data.shape[1]\n\nprint(\"\\n---\\n\")\n\nmu_ml = np.mean(data,axis=0)\nprint(\"Data mean:\")\nprint(mu_ml)\n\ndata_cov = np.cov(data,rowvar=False)\nprint(\"Data cov:\")\nprint(data_cov)\n```\nYou can see the plotting functions in the complete repo. Visualizing the data shows the 2D distribution:\n\n\u003cimg src=\"figures_ma/data_scatter.png\" alt=\"drawing\" width=\"400\"/\u003e\n\n\u003cimg src=\"figures_ma/data.png\" alt=\"drawing\" width=\"400\"/\u003e\n\n### Max likelihood\n\nAssume the latent distribution has the form:\n\n\u003cimg src=\"figures_math/latent.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nThe visibles are sampled from the conditional distribution:\n\n\u003cimg src=\"figures_math/vis_from_hid.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nFrom these, the marginal distribution for the visibles is:\n\n\u003cimg src=\"figures_math/marginal.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\u003cimg src=\"figures_math/c_mat.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nWe are interested in maximizing the log likelihood:\n\n\u003cimg src=\"figures_math/log_likelihood.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nwhere `S` is the covariance matrix of the data. The solution is obtained using the eigendecomposition of the data covariance matrix `S`:\n\n\u003cimg src=\"figures_math/eigendecomp.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nWhere the columns of `U` are the eigenvectors, and `D` is a diagonal matrix with the eigenvalues `\\lambda`. Let the number of dimensions of the dataset be `d` (in this case, `d=2`). Let the number of latent variables be `q`, then the maximum likelihood solution is:\n\n\u003cimg src=\"figures_math/ml_sigma.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\u003cimg src=\"figures_math/ml_weight.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nand `\\mu` are just set to the mean of the data.\n\nHere the eigenvalues `\\lambda` are sorted from high to low, with `\\lambda_1` being the largest and `\\lambda_d` the smallest. Here `U_q` is the matrix where columns are the corresponding eigenvectors of the `q` largest eigenvalues, and `D_q` is a diagonal matrix with the largest eigenvalues. `R` is an arbitrary rotation matrix - here we can simply take `R=I` for simplicity (see Bishop for a detailed discussion). We have discarded the dimensions beyond `q` - the ML variance `\\sigma^2` is then the average variance of these discarded dimensions. \n\nHere is the corresponding Python code to calculate these max-likelihood solutions:\n```\n# No hidden variables \u003c no visibles = d\nq = 1\n\n# Variance\nlambdas, eigenvecs = np.linalg.eig(data_cov)\nidx = lambdas.argsort()[::-1]   \nlambdas = lambdas[idx]\neigenvecs = - eigenvecs[:,idx]\nprint(eigenvecs)\n# print(eigenvecs @ np.diag(lambdas) @ np.transpose(eigenvecs))\n\nvar_ml = (1.0 / (d-q)) * sum([lambdas[j] for j in range(q,d)])\nprint(\"Var ML:\")\nprint(var_ml)\n\n# Weight matrix\nuq = eigenvecs[:,:q]\nprint(\"uq:\")\nprint(uq)\n\nlambdaq = np.diag(lambdas[:q])\nprint(\"lambdaq\")\nprint(lambdaq)\n\nweight_ml = uq * np.sqrt(lambdaq - var_ml * np.eye(q))\nprint(\"Weight matrix ML:\")\nprint(weight_ml)\n```\n\n### Sampling latent variables\n\nAfter determining the ML parameters, we can sample the hidden units from the visible according to:\n\n\u003cimg src=\"figures_math/hid_from_vis.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\u003cimg src=\"figures_math/m_mat.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nYou can implement it in Python as follows:\n```\nact_hidden = sample_hidden_given_visible(\n    weight_ml=weight_ml,\n    mu_ml=mu_ml,\n    var_ml=var_ml,\n    visible_samples=data\n    )\n```\nwhere we have defined:\n```\ndef sample_hidden_given_visible(\n    weight_ml : np.array, \n    mu_ml : np.array,\n    var_ml : float,\n    visible_samples : np.array\n    ) -\u003e np.array:\n\n    q = weight_ml.shape[1]\n    m = np.transpose(weight_ml) @ weight_ml + var_ml * np.eye(q)\n\n    cov = var_ml * np.linalg.inv(m)\n    act_hidden = []\n    for data_visible in visible_samples:\n        mean = np.linalg.inv(m) @ np.transpose(weight_ml) @ (data_visible - mu_ml)\n        sample = np.random.multivariate_normal(mean,cov,size=1)\n        act_hidden.append(sample[0])\n    \n    return np.array(act_hidden)\n```\n\nThe result is data which looks a lot like the standard normal distribution:\n\n\u003cimg src=\"figures_ma/sample_hidden_from_visible.png\" alt=\"drawing\" width=\"400\"/\u003e\n\n### Sample new data points\n\nWe can sample new data points by first drawing new samples from the hidden distribution (a standard normal):\n```\nmean_hidden = np.full(q,0)\ncov_hidden = np.eye(q)\n\nno_samples = len(data)\nsamples_hidden = np.random.multivariate_normal(mean_hidden,cov_hidden,size=no_samples)\n```\n\n\u003cimg src=\"figures_ma/sample_hidden_std_normal.png\" alt=\"drawing\" width=\"400\"/\u003e\n\nand then sample new visible samples from those:\n```\nact_visible = sample_visible_given_hidden(\n    weight_ml=weight_ml,\n    mu_ml=mu_ml,\n    var_ml=var_ml,\n    hidden_samples=samples_hidden\n    )\n\nprint(\"Covariance visibles (data):\")\nprint(data_cov)\nprint(\"Covariance visibles (sampled):\")\nprint(np.cov(act_visible,rowvar=False))\n\nprint(\"Mean visibles (data):\")\nprint(np.mean(data,axis=0))\nprint(\"Mean visibles (sampled):\")\nprint(np.mean(act_visible,axis=0))\n```\nwhere we have defined:\n```\ndef sample_visible_given_hidden(\n    weight_ml : np.array, \n    mu_ml : np.array,\n    var_ml : float,\n    hidden_samples : np.array\n    ) -\u003e np.array:\n\n    d = weight_ml.shape[0]\n\n    act_visible = []\n    for data_hidden in hidden_samples:\n        mean = weight_ml @ data_hidden + mu_ml\n        cov = var_ml * np.eye(d)\n        sample = np.random.multivariate_normal(mean,cov,size=1)\n        act_visible.append(sample[0])\n    \n    return np.array(act_visible)\n```\n\nThe result are data points that closely resemble the data distribution:\n\u003cimg src=\"figures_ma/sample_visible_from_hidden_scatter.png\" alt=\"drawing\" width=\"400\"/\u003e\n\n\u003cimg src=\"figures_ma/sample_visible_from_hidden.png\" alt=\"drawing\" width=\"400\"/\u003e\n\n### Rescaling the latent distribution\n\nFinally, we can rescale the latent variables to have any Gaussian distribution:\n\n\u003cimg src=\"figures_math/latent_resc.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nFor example:\n```\nmean_hidden = np.array([120.0])\ncov_hidden = np.array([[23.0]])\nno_samples = len(data)\nsamples_hidden = np.random.multivariate_normal(mean_hidden,cov_hidden,size=no_samples)\n```\n\n\u003cimg src=\"figures_ma/sample_hidden_from_rescaled_normal.png\" alt=\"drawing\" width=\"400\"/\u003e\n\nWe can simply transform the parameters and then **still** sample new valid visible samples from those:\n\n\u003cimg src=\"figures_math/ml_resc.png\" alt=\"drawing\" height=\"100%\"/\u003e\n\nNota that `\\sigma^2` is unchanged from before.\n\nIn Python we can do the rescaling:\n```\nweight_ml_rescaled = weight_ml @ np.linalg.inv(sl.sqrtm(cov_hidden))\nmu_ml_rescaled = mu_ml - weight_ml_rescaled @ mean_hidden\n\nprint(\"Mean ML rescaled:\")\nprint(mu_ml_rescaled)\n\nprint(\"Weight matrix ML rescaled:\")\nprint(weight_ml_rescaled)\n```\nand then repeat the sampling with the new weights \u0026 mean:\n```\nact_visible = sample_visible_given_hidden(\n    weight_ml=weight_ml_rescaled,\n    mu_ml=mu_ml_rescaled,\n    var_ml=var_ml,\n    hidden_samples=samples_hidden\n    )\n```\n\nAgain, the samples look like they could come from the original data distribution:\n\n\u003cimg src=\"figures_ma/rescaled_sample_visible_from_hidden_scatter.png\" alt=\"drawing\" width=\"400\"/\u003e\n\n\u003cimg src=\"figures_ma/rescaled_sample_visible_from_hidden.png\" alt=\"drawing\" width=\"400\"/\u003e","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmrfeld%2Fpython_prob_pca_tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsmrfeld%2Fpython_prob_pca_tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmrfeld%2Fpython_prob_pca_tutorial/lists"}