{"id":25007855,"url":"https://github.com/samashi47/em-algorithm","last_synced_at":"2026-04-20T13:03:14.096Z","repository":{"id":222681410,"uuid":"745873024","full_name":"Samashi47/EM-algorithm","owner":"Samashi47","description":"Expectation–Maximization (EM) algorithm implementation in R and Python, and a comparison with K-means.","archived":false,"fork":false,"pushed_at":"2024-03-21T23:43:58.000Z","size":5550,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-05T02:55:54.591Z","etag":null,"topics":["clustering","em-algorithm","kmeans","machine-learning","python3","r"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Samashi47.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-20T12:05:28.000Z","updated_at":"2024-07-23T17:03:50.000Z","dependencies_parsed_at":"2024-02-15T17:05:36.318Z","dependency_job_id":null,"html_url":"https://github.com/Samashi47/EM-algorithm","commit_stats":null,"previous_names":["samashi47/em-algorithm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samashi47%2FEM-algorithm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samashi47%2FEM-algorithm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samashi47%2FEM-algorithm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samashi47%2FEM-algorithm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Samashi47","download_url":"https://codeload.github.com/Samashi47/EM-algorithm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246262613,"owners_count":20749175,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","em-algorithm","kmeans","machine-learning","python3","r"],"created_at":"2025-02-05T02:56:00.834Z","updated_at":"2026-04-20T13:03:09.032Z","avatar_url":"https://github.com/Samashi47.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# EM-algorithm - Academic project\n\nAn academic research and implementation of the expectation–maximization algorithm, with Python and R.\n\nTo start off, clone the project: \n```shell\ngit clone https://github.com/Samashi47/EM-algorithm.git\n```\n\nThen:\n```shell\ncd EM-algorithm\n```\n\n# Python implementation\n\nAfter cloning the project, go to the `Python-implementation` folder:\n\n```shell\ncd Python-implementation\n```\n\nThen, create your virutal environment:\n\n**Windows**\n\n```shell\npy -3 -m venv .venv\n```\n\n**MacOS/Linus**\n\n```shell\npython3 -m venv .venv\n```\n\nAnd, activate it:\n\n**Windows**\n\n```shell\n.venv\\Scripts\\activate\n```\n\n**MacOS/Linus**\n\n```shell\n. .venv/bin/activate\n```\n\nYou can run the following command to install the dependencies:\n\n```shell\npip3 install -r requirements.txt\n```\n\n**To run the code:**\n1. Select the kernel in the jupyter notebook in the **Python-implementation** folder.\n2. Run the cells.\n\n# R implementation\n\n\u003e [!NOTE]\n\u003e\n\u003e Here we suppose that R is fully installed and configured on your computer.\n\u003e\n\u003e R-markdownn doesn't require any further configuration to run on Rstudio or VSCode, but for a more rich experience on VSCode (live preview, generate HTML, LaTeX and pdf files) you need a TeX distribution, and pandoc on your computer. You can install pandoc from https://pandoc.org/installing.html\n\nTo start with the R implementation, you should install the required packages first, go to the R console, then: \n```R\ninstall.packages(c(\"base\", \"methods\", \"datasets\", \"utils\", \"grDevices\", \"graphics\", \"stats\", \"plyr\", \"mvtnorm\", \"ggplot2\"))\n```\nThen you are ready to run the implementations in the .rmd files chunk by chunk.\n\n**Use**\n\nTo use the implementation, first you got to initiate starting values for the mean, cov, and probabilities.\n\n1. The mean is a matrix, of dimensions (nbr of wanted clusters, nbr of used columns to generate clusters), of means of every column, for the number of wanted clusters.\n2. The cov is a tensor, of dimensions (nbr of used columns to generate clusters, nbr of used columns to generate clusters, nbr of wanted clusters), of covariance between the datasets columns, for the number of wanted clusters.\n3. The probs is a list, of dimensions (nbr of wanted clusters), of probabilities that a given data point belongs to a cluster.\n\nTo do that in code, we first generate a list of means for each column, and a covariance matrix between columns:\n```R\nlibrary(plyr)\n\n# Create starting values\nMu = daply(iris2, NULL, function(x) colMeans(x)) + runif(4, 0, 0.5)\nCov = dlply(iris2, NULL, function(x) var(x) + diag(runif(4, 0, 0.5)))\n```\n\n```R\ncolumn.names \u003c- colnames(iris2)\nrow.names \u003c- c(\"Cluster 1\", \"Cluster 2\", \"Cluster 3\")\n```\n\nThen we create a 2D array of means for the number of wanted clusters with a noise to not have indentical rows, and a tensor of covariance matrices for the number of wanted clusters:\n```R\ninitMu = array(c(Mu[1] + 0.1, Mu[1] + 0.2, Mu[1] + 0.3, Mu[2] + 0.1, Mu[2] + 0.2, Mu[2] + 0.3, Mu[3] + 0.1, Mu[3] + 0.2, Mu[3] + 0.3, Mu[4] + 0.1, Mu[4] + 0.2, Mu[4] + 0.4) , dim = c(3, 4),dimnames = list(row.names,column.names))\ninitCov \u003c- list('Cluster 1' = Cov[[1]], 'Cluster 2' = Cov[[1]], 'Cluster 3' = Cov[[1]])\n```\n\nFor probabilities, we can initiate them manually:\n```R\ninitProbs = c(.1, .2, .7)\n```\n\nOr, randomly:\n```R\ninitProbs = sort(runif(3, min=0.1, max=0.9))\n```\n\nFinally, we encapsulate the initiated params in a variable called `initParams`:\n```R\ninitParams \u003c- list(mu = initMu, var = initCov, probs = initProbs)\n```\n\nAnd run the algorithm with:\n```R\nresults = gaussmixEM(params=initParams, X=as.matrix(iris2), clusters = 3, tol=1e-10, maxits=1500, showits=T)\nprint(results)\n```\n\n# References\n\n- Martin Haugh. The EM Algorithm. Published 2015. https://www.columbia.edu/~mh2078/MachineLearningORFE/EM_Algorithm.pdf\n\n- Henrik Hult. Lecture 8. https://www.math.kth.se/matstat/gru/Statistical%20inference/Lecture8.pdf\n\n- Sean Borman. The Expectation Maximization Algorithm, A short tutorial. Published July 18, 2004. https://www.lri.fr/~sebag/COURS/EM_algorithm.pdf\n\n- Tengyu Ma. and Andrew Ng. CS229 Lecture notes. Published May 13, 2019. https://cs229.stanford.edu/notes2020spring/cs229-notes8.pdf\n\n- Keng B. The Expectation-Maximization Algorithm. Bounded Rationality. Published October 7, 2016. https://bjlkeng.io/posts/the-expectation-maximization-algorithm/","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamashi47%2Fem-algorithm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsamashi47%2Fem-algorithm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamashi47%2Fem-algorithm/lists"}