https://github.com/samashi47/em-algorithm

Expectation–Maximization (EM) algorithm implementation in R and Python, and a comparison with K-means.
https://github.com/samashi47/em-algorithm

clustering em-algorithm kmeans machine-learning python3 r

Last synced: 12 months ago
JSON representation

Expectation–Maximization (EM) algorithm implementation in R and Python, and a comparison with K-means.

Host: GitHub
URL: https://github.com/samashi47/em-algorithm
Owner: Samashi47
Created: 2024-01-20T12:05:28.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-03-21T23:43:58.000Z (almost 2 years ago)
Last Synced: 2025-02-05T02:55:54.591Z (about 1 year ago)
Topics: clustering, em-algorithm, kmeans, machine-learning, python3, r
Language: Jupyter Notebook
Homepage:
Size: 5.29 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # EM-algorithm - Academic project

An academic research and implementation of the expectation–maximization algorithm, with Python and R.

To start off, clone the project: 

```shell

git clone https://github.com/Samashi47/EM-algorithm.git

```

Then:

```shell

cd EM-algorithm

```

# Python implementation

After cloning the project, go to the `Python-implementation` folder:

```shell

cd Python-implementation

```

Then, create your virutal environment:

**Windows**

```shell

py -3 -m venv .venv

```

**MacOS/Linus**

```shell

python3 -m venv .venv

```

And, activate it:

**Windows**

```shell

.venv\Scripts\activate

```

**MacOS/Linus**

```shell

. .venv/bin/activate

```

You can run the following command to install the dependencies:

```shell

pip3 install -r requirements.txt

```

**To run the code:**

1. Select the kernel in the jupyter notebook in the **Python-implementation** folder.

2. Run the cells.

# R implementation

> [!NOTE]

>

> Here we suppose that R is fully installed and configured on your computer.

>

> R-markdownn doesn't require any further configuration to run on Rstudio or VSCode, but for a more rich experience on VSCode (live preview, generate HTML, LaTeX and pdf files) you need a TeX distribution, and pandoc on your computer. You can install pandoc from https://pandoc.org/installing.html

To start with the R implementation, you should install the required packages first, go to the R console, then: 

```R

install.packages(c("base", "methods", "datasets", "utils", "grDevices", "graphics", "stats", "plyr", "mvtnorm", "ggplot2"))

```

Then you are ready to run the implementations in the .rmd files chunk by chunk.

**Use**

To use the implementation, first you got to initiate starting values for the mean, cov, and probabilities.

1. The mean is a matrix, of dimensions (nbr of wanted clusters, nbr of used columns to generate clusters), of means of every column, for the number of wanted clusters.

2. The cov is a tensor, of dimensions (nbr of used columns to generate clusters, nbr of used columns to generate clusters, nbr of wanted clusters), of covariance between the datasets columns, for the number of wanted clusters.

3. The probs is a list, of dimensions (nbr of wanted clusters), of probabilities that a given data point belongs to a cluster.

To do that in code, we first generate a list of means for each column, and a covariance matrix between columns:

```R

library(plyr)

# Create starting values

Mu = daply(iris2, NULL, function(x) colMeans(x)) + runif(4, 0, 0.5)

Cov = dlply(iris2, NULL, function(x) var(x) + diag(runif(4, 0, 0.5)))

```

```R

column.names <- colnames(iris2)

row.names <- c("Cluster 1", "Cluster 2", "Cluster 3")

```

Then we create a 2D array of means for the number of wanted clusters with a noise to not have indentical rows, and a tensor of covariance matrices for the number of wanted clusters:

```R

initMu = array(c(Mu[1] + 0.1, Mu[1] + 0.2, Mu[1] + 0.3, Mu[2] + 0.1, Mu[2] + 0.2, Mu[2] + 0.3, Mu[3] + 0.1, Mu[3] + 0.2, Mu[3] + 0.3, Mu[4] + 0.1, Mu[4] + 0.2, Mu[4] + 0.4) , dim = c(3, 4),dimnames = list(row.names,column.names))

initCov <- list('Cluster 1' = Cov[[1]], 'Cluster 2' = Cov[[1]], 'Cluster 3' = Cov[[1]])

```

For probabilities, we can initiate them manually:

```R

initProbs = c(.1, .2, .7)

```

Or, randomly:

```R

initProbs = sort(runif(3, min=0.1, max=0.9))

```

Finally, we encapsulate the initiated params in a variable called `initParams`:

```R

initParams <- list(mu = initMu, var = initCov, probs = initProbs)

```

And run the algorithm with:

```R

results = gaussmixEM(params=initParams, X=as.matrix(iris2), clusters = 3, tol=1e-10, maxits=1500, showits=T)

print(results)

```

# References

- Martin Haugh. The EM Algorithm. Published 2015. https://www.columbia.edu/~mh2078/MachineLearningORFE/EM_Algorithm.pdf

- Henrik Hult. Lecture 8. https://www.math.kth.se/matstat/gru/Statistical%20inference/Lecture8.pdf

- Sean Borman. The Expectation Maximization Algorithm, A short tutorial. Published July 18, 2004. https://www.lri.fr/~sebag/COURS/EM_algorithm.pdf

- Tengyu Ma. and Andrew Ng. CS229 Lecture notes. Published May 13, 2019. https://cs229.stanford.edu/notes2020spring/cs229-notes8.pdf

- Keng B. The Expectation-Maximization Algorithm. Bounded Rationality. Published October 7, 2016. https://bjlkeng.io/posts/the-expectation-maximization-algorithm/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/samashi47/em-algorithm

Awesome Lists containing this project

README