{"id":25067015,"url":"https://github.com/shoyamanishi/mlnotes","last_synced_at":"2026-02-14T00:36:53.899Z","repository":{"id":187800792,"uuid":"251182169","full_name":"ShoYamanishi/MLNotes","owner":"ShoYamanishi","description":"Some notes worth sharing that I made through my self-study on machine learning.","archived":false,"fork":false,"pushed_at":"2020-05-29T05:40:26.000Z","size":7796,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-06T20:35:28.752Z","etag":null,"topics":["deep-neural-networks","machine-learning","self-study"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ShoYamanishi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-03-30T02:32:28.000Z","updated_at":"2024-03-22T18:18:05.000Z","dependencies_parsed_at":"2023-08-12T07:13:16.426Z","dependency_job_id":null,"html_url":"https://github.com/ShoYamanishi/MLNotes","commit_stats":null,"previous_names":["shoyamanishi/mlnotes"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShoYamanishi%2FMLNotes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShoYamanishi%2FMLNotes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShoYamanishi%2FMLNotes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShoYamanishi%2FMLNotes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ShoYamanishi","download_url":"https://codeload.github.com/ShoYamanishi/MLNotes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246481004,"owners_count":20784458,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-neural-networks","machine-learning","self-study"],"created_at":"2025-02-06T20:29:03.050Z","updated_at":"2025-09-22T03:32:23.933Z","avatar_url":"https://github.com/ShoYamanishi.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# MLNOTES\n\nThis is a collection of expository documents about machine learning and neural networks \nthat I have made during the Corona quarantine,\nwhich has given me convenient time to sort out scribbles I have made \nin the past and to reorganize them into Tex documents.\nI learned it mainly by the following books:\n\n* Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 1 edition, 2007.\n\n* Deep Learning, Yoshua Bengio, Ian Goodfellow, Aaron Courville, MIT Press, In preparation., 2016.\n\n* Simon J. D. Prince. Computer Vision: Models, Learning, and Inference. Cambridge University Press, 2012.\n\n* David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2011.\n\n## Index\n\n* [PRML 1.58](docs/expectation_of_sample_variance/exp_sample_var.pdf)\n* [Basics Cheat Sheat](docs/basics_cheat_sheat/basics_cheat_sheat.pdf)\n* [EM Algorithm](docs/em_algorithm/em_alg.pdf)\n* [Variational Inference and Mean Field Approximation](docs/variational_inference/variational_inference.pdf)\n* [Latent Direchlet Allocation](docs/latent_direchlet_allocation/lda.pdf)\n* [Sampling](docs/sampling/sampling.pdf)\n* [Graphical Models and Belief Propagation](docs/graphical_models/graphical_models.pdf)\n* [Sequencial Models (HMM, Baum-Welch, Viterbi, Kalman Filter, RTS-Smoother)](docs/baum_welch_viterbi_kalman/baum_welch_viterbi_kalman.pdf)\n* [Expectation Propagation](docs/expectation_propagation/expectation_propagation.pdf)\n* [DNN, CNN, RNN, LSTM, Attention, and Transformer](docs/dnn_cnn_rnn_lstm_attention_transformer/dnn_cnn_rnn_lstm_attention.pdf)\n\nI made them, as it was simply fun, and also it solidifies my understanding of\nthose topics.\nThe official purpose would be:\n\n* To provide a quick refresher to my future-self without going through books and articles and without using paper and a pencil to fill the gaps between equations.\n\n* To help other independent-minded self-learners on important topics of machine learning.\n\nThose documents are made such that the gaps between the equations along the derivations are minimized to avoid using paper and a pencil to fill them,\nat the cost of redundancy and lengthy-ness.\n\n# Topics\n\n## [PRML 1.58](docs/expectation_of_sample_variance/exp_sample_var.pdf)\n\nThis is my take on why the sample variance is biased.\nThis equation (1.58) in PRML is apparently the first problem many readers \nincluding me encounter.\nIt’s in pp27, Chapter 1, Introduction.\nIt is not apparent to me why it holds, but no proof is given in the book.\n\n## [Basics Cheat Sheat](docs/basics_cheat_sheat/basics_cheat_sheat.pdf)\n\nDeriving the MLE and the loss functions for training from KL divergence.\nFacts of a simple linear Gaussian regression , binary classification, and\nmulti-class classification.\n\n\n## [EM Algorithm](docs/em_algorithm/em_alg.pdf)\n\nThis is a personal notes as my own memory aid on EM. for my future-self.\nThis is also a summplemental material to explain why the maximization \nstep works for MLE with i.i.d. samples, \nwhich was not obvious to me in the text books.\nPRML Chap 9.4. explains the EM algorithm with the standard graph plot of \nlower bounds and θ along the horizontal axis.\nComputer Vision by JD Prince Chap 7.3., also has a better and more succinct \nexplanation of EM.\nThe purpose of this document is to explain the EM algorithm without \nsignificant gap throught the course of deductions at the cost of lengthiness, \nand to explain why the M-Step works for MLE, \nwhich is omitted in the text books.\n\n\n## [Variational Inference and Mean Field Approximation](docs/variational_inference/variational_inference.pdf)\n\nThis is a quick refresher on variational inference and mean field \napproximation.\nThe purpose of this document is as a self-contained document to enhance the \ncourse notes by D. Blei with my own annotations to fill the gaps \nbetween deductions in order for my future-self to refresh this topic quickly \nwithout a pencil and paper.\nThe variational inference and the mean field approximation are explained in \nPRML[2] 10.1 and Barber[1] 28.3, 28.4,\nbut I like the course notes by D. Blei available online at Princeton is best, \nas it gives the flow of explanation from the problem setting down to the \noptimality of the factroized approximator for the exponential families.\nHowever, it is a bit too terse to me.\n\n## [Latent Direchlet Allocation](docs/latent_direchlet_allocation/lda.pdf)\n\nThis is an expository document for latent Direchlet allocation in the full \nBayesian setting for β matrix.\nThis is an outcome of my own self study into the original article, \nand it has turned out to be a very good streamlined study material \nfor EM-algorithm and variational Bayes worth documenting by myself \nfor my own better understanding.\nThe main purpose of this document is to quickly and effortlessly refresh \nmy memory in the future as my own memory aid.\nAnother purpose is to fill the gap between the original article and the blog \narticles available on Internet.\nThe original article is too terse, and it takes me a lot of paper-and-pencil \nwork to comprehend the contents.\nIt also briefly touches on the full Bayesian treatment.\nOn the other hand, the blog articles focus on a quick grasp of the concept \nwith rich illustrations, and rigorous mathematical treatment is usually \nommitted.\nThis document is characterized as follows.\n\n* Comprehensive math treatment from the modeling down to just before implementation for both training and inference.\n\n* Full-bayesian β with prior η.\n\n* Small gaps between two adjacent equations through the course of deductions at the cost of lengthiness.\n\n* Discussion on possibility of treatment of using the columns of β as word embeddings with non-informative priors.\n\n\n## [Sampling](docs/sampling/sampling.pdf)\n\nThis is a personal expository material on sampling mainly for my future self \nto quickly refresh the topics.\nIt also contains explanations to the topics that are unclear to me \nin the books and articles.\nThe following topics are covered.\n\n* Basic sampling : from uniform distribution to a particular distribution.\n\n* Rejection sampling\n\n* Importance sampling\n\n* Uni/Bivariate Gaussian Distribution : Box-Muller algorithm\n\n* Univariate Gaussian Distribution with rejectio sampling : Ziggurat algorithm\n\n* Multivariate full-covariance Gaussian distribution MCMC (Metropolis-Hastings)\n\n* Gibbs Sampling\n\n* MCMC with Hamiltonian dynamics\n\nThe highlights are the comprehensive explanation of Ziggurat algorithm, which is\nused to sample from multivariate normal distributions, and a rigorous formation\nof MCMC using Hamiltonian dynamics.\nMost articles and books put emphasis on Hamiltonian dynamics and its numerical \nintegration, but more rigorous formation of MCMC using Hamiltonian dynamics is often\nnot well explained.\nEspecially, formation of a proposal function, an acceptance function, \ntransition probability distribution and ergodicity are not well discussed.\nThis section focus on those topics, rather than the Hamiltonion dynamics and \nthe numerical simulation.\n\n\n## [Graphical Models and Belief Propagation](docs/graphical_models/graphical_models.pdf)\nIt summrizes some key points about the probabilistic graphical models.\n\n* Conditional independence\n\n* D-Separation\n\n* Markov Blanket\n\nIt also treats the belief propagation in chains and trees for maginal distribution and MAP.\n\n\n## [Sequencial Models (HMM, Baum-Welch, Viterbi, Kalman Filter, RTS-Smoother)](docs/baum_welch_viterbi_kalman/baum_welch_viterbi_kalman.pdf)\n\nThis is a personal notes as my own memory aid on Hidden Markov Models and Linear Dynamical Systems.\nSpecifically the following topics.\n\n* Baum-Welch EM algorithm \n* Viterbi algorithm\n* Kalman Filter\n* Rauch-Tung-Striebel smoother and EM algorithm\n\nChapter 13 of PRML is an excellent source for HMM (Baum-Welch, Viterbi)\nand Kalman Filter, but not so good for Kalman smoother (RTS smoother).\nEspecially the derivation of  p(z_n, zn+1| z_1, ..., z_N), \nwhich is required for EM-algorithm, is a bit shaky between (13.103) and (13.104).\nFor deriving RTS smoother, I used an excellent course notes from Professor Särkkä of Aalto Univ.\nAlso, Chap 24 of Barber contains comprehensive materials for LDS,\nbut it is a bit difficult to understand and I personally do not like the style of notations.\n\n## [Expectation Propagation](docs/expectation_propagation/expectation_propagation.pdf)\n\nThis is an expository document for expectation propagation for my future self.\nIt is aimed at a self contained document.\nIt converts the following three topics\n\n* general expectation propagation with the exponential family\n\n* detailed explanation of the clutter problem\n\n* detailed explanation of loopy belief propagation\n\n\nI have found the following subtle but important points during my own learning, which are not well explained\nin the existing literature. The emphasis are given on those points in this document.\n\n### KL-divergence takes proper (normalized) density functions.\nThe algorithm depends on the minimization of the KL divergenace to which two proper (normalized) density\nmust be given, but we approximate a conditional q(θ) ~= p(x|θ) where x is observed.\nThis is not mormalized and a careful conversion is needed when applying the KL divergence.\n\n### distinction between *moments* and *natural* parametes: The algorithm operates on\nthe moments, which are not necessarily the natural parameters for the underlying model.\nFor example, Gaussian distribution takes the 1st-order moment\n as the mean parameter but the 2nd-order moment is different from variance.\n\n### careful treatment of normalization coefficients (partition function). Throughout the algorithm\nfactors are added and removed from the current approximations. For those operations the normalizations\ncoefficients are carefully maintaned.\n\n### *moment matching* requires some tricks. The moment matching for the example clutter problem requires\nsome tricks, which are not explained well in the existing literature.\n\nMinka is the original and seminal article of the expectation propagation.\nThat is too concise as a study material as a lot of details are omitted. It presents\nthe clutter problem, but the updated momments are presented without details.\nPRML follows the same style as Minka but\nthe details of update of approximation maintaining the normalization coefficient (partition function)\nis omitted.\nBarber briefly touches on the belief propagation in relation to expectation propagation\nin section 28.7.\nThe course notes by Honkela at Helsinki Univ. gives a very nice explanation.\nHowever the treatment of the normalization coefficients is not thorough.\nThe lecture video by Simon Barthelmé\nat Centre International de Rencontres Mathématiques gives a good explanation for cavity, hybrid,\nnarutal parameters and moment parameters.\n\nNone of the materials above are detailed enough for normies like me to study this topic, and that\nwas the motivation for me to write this up for my furuter self and possibly others.\n\n## [DNN, CNN, RNN, LSTM, Attention, and Transformer](docs/dnn_cnn_rnn_lstm_attention_transformer/dnn_cnn_rnn_lstm_attention.pdf)\nThis document describes the following.\n\n* DNN Backprop mechanism\n\n* CNN Forward propagation and Backprop in multi-channel 2-dimensional convolution kernel with step size $s$.\n\n* RNN Backprop\n\n* The reason of LSTM as a generalization of leaky units with learned parameters to cope with vanishing gradient problem.\n\n* Traditional Attention on top of bi-directional RNN\n\n* Transformer's multi-head attention part.\n\n## Contrastive Divergence\n[planned]\n\n## RBM\n[planned]","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshoyamanishi%2Fmlnotes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshoyamanishi%2Fmlnotes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshoyamanishi%2Fmlnotes/lists"}