{"id":22319359,"url":"https://github.com/jayinai/ml-interview","last_synced_at":"2025-04-12T17:45:37.245Z","repository":{"id":37768176,"uuid":"89018481","full_name":"jayinai/ml-interview","owner":"jayinai","description":"Preparing for machine learning interviews","archived":false,"fork":false,"pushed_at":"2022-11-07T06:14:54.000Z","size":72,"stargazers_count":902,"open_issues_count":3,"forks_count":220,"subscribers_count":36,"default_branch":"master","last_synced_at":"2025-04-03T20:11:10.494Z","etag":null,"topics":["interview-preparation","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jayinai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-21T19:48:17.000Z","updated_at":"2025-03-07T12:20:27.000Z","dependencies_parsed_at":"2022-09-16T16:00:14.176Z","dependency_job_id":null,"html_url":"https://github.com/jayinai/ml-interview","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jayinai%2Fml-interview","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jayinai%2Fml-interview/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jayinai%2Fml-interview/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jayinai%2Fml-interview/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jayinai","download_url":"https://codeload.github.com/jayinai/ml-interview/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248609545,"owners_count":21132915,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["interview-preparation","machine-learning"],"created_at":"2024-12-04T00:05:46.469Z","updated_at":"2025-04-12T17:45:37.222Z","avatar_url":"https://github.com/jayinai.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# This repos is depreciated, check out the latest [Nailing Machine Learning Concepts](https://github.com/jayinai/nail-ml-concept)\n\nThis repository covers how to prepare for machine learning interviews, mainly\nin the format of questions \u0026 answers. Asides from machine learning knowledge,\nother crucial aspects include:\n\n* [Explain your resume](#explain-your-resume)\n* [SQL](#sql)\n\nGo directly to [machine learning](#machine-learning)\n\n\n## Explain your resume\n\nYour resume should specify interesting ML projects you got involved in the past,\nand **quantitatively** show your contribution. Consider the following comparison:\n\n\u003e Trained a machine learning system\n\nvs.\n\n\u003e Trained a deep vision system (SqueezeNet) that has 1/30 model size, 1/3 training\n\u003e time, 1/5 inference time, and 2x faster convergence compared with traditional\n\u003e ConvNet (e.g., ResNet)\n\nWe all can tell which one is gonna catch interviewer's eyeballs and better show\ncase your ability.\n\nIn the interview, be sure to explain what you've done well. Spend some time going\nover your resume before the interview.\n\n\n## SQL\n\nAlthough you don't have to be a SQL expert for most machine learning positions,\nthe interviews might ask you some SQL related questions so it helps to refresh\nyour memory beforehand. Some good SQL resources are:\n\n* [W3schools SQL](https://www.w3schools.com/sql/)\n* [SQLZOO](http://sqlzoo.net/)\n\n\n## Machine learning\n\nFirst, it's always a good idea to review [Chapter 5](http://www.deeplearningbook.org/contents/ml.html) \nof the deep learning book, which covers machine learning basics.\n\n\n* [Linear regression](#linear-regression)\n* [Logistic regression](#logistic-regression)\n* [KNN](#knn)\n* [SVM](#svm)\n* [Naive Bayes]\n* [Decision tree](#decision-tree)\n* [Bagging](#bagging)\n* [Random forest](#random-forest)\n* [Boosting](#boosting)\n* [Stacking](#stacking)\n* [Clustering]\n* [MLP](#mlp)\n* [CNN](#cnn)\n* [RNN and LSTM](#rnn-and-lstm)\n* [word2vec](#word2vec)\n* [Generative vs discriminative](#generative-vs-discriminative)\n* [Paramteric vs Nonparametric](#paramteric-vs-nonparametric)\n\n\n\n### Linear regression\n\n* how to learn the parameter: minimize the cost function\n* how to minimize cost function: gradient descent\n* regularization: \n    - L1 (lasso): can shrink certain coef to zero, thus performing feature selection\n    - L2 (ridge): shrink all coef with the same proportion; almost always outperforms L1\n    - combined (Elastic Net): \n* assumes linear relationship between features and the label\n* can add polynomial and interaction features to add non-linearity\n\n![lr](http://scikit-learn.org/stable/_images/sphx_glr_plot_cv_predict_thumb.png)\n\n[back to top](#machine-learning)\n\n\n### Logistic regression\n\n* Generalized linear model (GLM) for classification problems\n* Apply the sigmoid function to the output of linear models, squeezing the target\nto range [0, 1] \n* Threshold to make prediction: if the output \u003e .5, prediction 1; otherwise prediction 0\n* a special case of softmax function, which deals with multi-class problem\n\n[back to top](#machine-learning)\n\n### KNN\n\nGiven a data point, we compute the K nearest data points (neighbors) using certain\ndistance metric (e.g., Euclidean metric). For classification, we take the majority label\nof neighbors; for regression, we take the mean of the label values.\n\nNote for KNN technically we don't need to train a model, we simply compute during\ninference time. This can be computationally expensive since each of the test example\nneed to be compared with every training example to see how close they are.\n\nThere are approximation methods can have faster inference time by\npartitioning the training data into regions.\n\nNote when K equals 1 or other small number the model is prone to overfitting (high variance), while\nwhen K equals number of data points or other large number the model is prone to underfitting (high bias)\n\n![KNN](https://cambridgecoding.files.wordpress.com/2016/03/training_data_only_99_1.png?w=610)\n\n[back to top](#machine-learning)\n\n\n### SVM\n\n* can perform linear, nonlinear, or outlier detection (unsupervised)\n* large margin classifier: not only have a decision boundary, but want the boundary\nto be as far from the closest training point as possible\n* the closest training examples are called the support vectors, since they are the points\nbased on which the decision boundary is drawn\n* SVMs are sensitive to feature scaling\n\n![svm](https://qph.ec.quoracdn.net/main-qimg-675fedee717331e478ecfcc40e2e4d38)\n\n\n[back to top](#machine-learning)\n\n\n### Decision tree\n\n* Non-parametric, supervised learning algorithms\n* Given the training data, a decision tree algorithm divides the feature space into\nregions. For inference, we first see which\nregion does the test data point fall in, and take the mean label values (regression)\nor the majority label value (classification).\n* **Construction**: top-down, chooses a variable to split the data such that the \ntarget variables within each region are as homogeneous as possible. Two common\nmetrics: gini impurity or information gain, won't matter much in practice.\n* Advantage: simply to understand \u0026 interpret, mirrors human decision making\n* Disadvantage: \n    - can overfit easily (and generalize poorly)if we don't limit the depth of the tree\n    - can be non-robust: A small change in the training data can lead to a totally different tree\n    - instability: sensitive to training set rotation due to its orthogonal decision boundaries\n\n![decision tree](http://www.fizyka.umk.pl/~wduch/ref/kdd-tut/d-tree-iris.gif)\n\n[back to top](#machine-learning)\n\n\n### Bagging\n\nTo address overfitting, we can use an ensemble method called bagging (bootstrap aggregating),\nwhich reduces the variance of the meta learning algorithm. Bagging can be applied\nto decision tree or other algorithms.\n\nHere is a [great illustration](http://scikit-learn.org/stable/auto_examples/ensemble/plot_bias_variance.html#sphx-glr-auto-examples-ensemble-plot-bias-variance-py) of a single estimator vs. bagging\n\n![bagging](http://scikit-learn.org/stable/_images/sphx_glr_plot_bias_variance_001.png)\n\n* Bagging is when samlping is performed *with* replacement. When sampling is performed *without* replacement, it's called pasting.\n* Bagging is popular due to its boost for performance, but also due to that individual learners can be trained in parallel and scale well\n* Ensemble methods work best when the learners are as independent from one another as possible\n* Voting: soft voting (predict probability and average over all individual learners) often works better than hard voting\n* out-of-bag instances (37%) can act validation set for bagging\n\n\n\n[back to top](#machine-learning)\n\n\n### Random forest\n\nRandom forest improves bagging further by adding some randomness. In random forest,\nonly a subset of features are selected at random to construct a tree (while often not subsample instances).\nThe benefit is that random forest **decorrelates** the trees. \n\nFor example, suppose we have a dataset. There is one very predicative feature, and a couple\nof moderately predicative features. In bagging trees, most of the trees\nwill use this very predicative feature in the top split, and therefore making most of the trees\nlook similar, **and highly correlated**. Averaging many highly correlated results won't lead\nto a large reduction in variance compared with uncorrelated results. \nIn random forest for each split we only consider a subset of the features and therefore\nreduce the variance even further by introducing more uncorrelated trees.\n\nI wrote a [notebook](notebooks/bag-rf-var.ipynb) to illustrate this point.\n\nIn practice, tuning random forest entails having a large number of trees (the more the better, but\nalways consider computation constraint). Also, `min_samples_leaf` (The minimum number of\nsamples at the leaf node)to control the tree size and overfitting. Always CV the parameters. \n\n**Feature importance**\n\nIn a decision tree, important features are likely to appear closer to the root of the tree. We can get\na feature's importance for random forest by computing the averaging depth at which it appears across all\ntrees in the forest.\n\n\n[back to top](#machine-learning)\n\n\n### Boosting\n\n**How it works**\n\nBoosting builds on weak learners, and in an iterative fashion. In each iteration,\na new learner is added, while all existing learners are kept unchanged. All learners\nare weighted based on their performance (e.g., accuracy), and after a weak learner\nis added, the data are re-weighted: examples that are misclassified gain more weights,\nwhile examples that are correctly classified lose weights. Thus, future weak learners\nfocus more on examples that previous weak learners misclassified.\n\n\n**Difference from random forest (RF)**\n\n* RF grows trees **in parallel**, while Boosting is sequential\n* RF reduces variance, while Boosting reduces errors by reducing bias\n\n\n**XGBoost (Extreme Gradient Boosting)**\n\n\n\u003e XGBoost uses a more regularized model formalization to control overfitting, which gives it better performance\n\n[back to top](#machine-learning)\n\n\n### Stacking\n\n* Instead of using trivial functions (such as hard voting) to aggregate the predictions from individual learners, train a model to perform this aggregation\n* First split the training set into two subsets: the first subset is used to train the learners in the first layer\n* Next the first layer learners are used to make predictions (meta features) on the second subset, and those predictions are used to train another models (to obtain the weigts of different learners) in the second layer\n* We can train multiple models in the second layer, but this entails subsetting the original dataset into 3 parts\n\n![stacking](http://www.kdnuggets.com/wp-content/uploads/backward-propagation-stacker-models.jpg)\n\n[back to top](#machine-learning)\n\n\n### MLP\n\nA feedforward neural network where we have multiple layers. In each layer we\ncan have multiple neurons, and each of the neuron in the next layer is a linear/nonlinear\ncombination of the all the neurons in the previous layer. In order to train the network\nwe back propagate the errors layer by layer. In theory MLP can approximate any functions.\n\n![mlp](http://neuroph.sourceforge.net/tutorials/images/MLP.jpg)\n\n[back to top](#machine-learning)\n\n### CNN\n\nThe Conv layer is the building block of a Convolutional Network. The Conv layer consists\nof a set of learnable filters (such as 5 * 5 * 3, width * height * depth). During the forward\npass, we slide (or more precisely, convolve) the filter across the input and compute the dot \nproduct. Learning again happens when the network back propagate the error layer by layer.\n\nInitial layers capture low-level features such as angle and edges, while later\nlayers learn a combination of the low-level features and in the previous layers \nand can therefore represent higher level feature, such as shape and object parts.\n\n![CNN](http://www.kdnuggets.com/wp-content/uploads/dnn-layers.jpg)\n\n[back to top](#machine-learning)\n\n### RNN and LSTM\n\nRNN is another paradigm of neural network where we have difference layers of cells,\nand each cell only take as input the cell from the previous layer, but also the previous\ncell within the same layer. This gives RNN the power to model sequence. \n\n![RNN](http://karpathy.github.io/assets/rnn/diags.jpeg)\n\nThis seems great, but in practice RNN barely works due to exploding/vanishing gradient, which \nis cause by a series of multiplication of the same matrix. To solve this, we can use \na variation of RNN, called long short-term memory (LSTM), which is capable of learning\nlong-term dependencies. \n\nThe math behind LSTM can be pretty complicated, but intuitively LSTM introduce \n    - input gate\n    - output gate\n    - forget gate\n    - memory cell (internal state)\n    \nLSTM resembles human memory: it forgets old stuff (old internal state * forget gate) \nand learns from new input (input node * input gate)\n\n![lstm](http://deeplearning.net/tutorial/_images/lstm_memorycell.png)\n\n[back to top](#machine-learning)\n\n\n### word2vec\n\n* Shallow, two-layer neural networks that are trained to construct linguistic context of words\n* Takes as input a large corpus, and produce a vector space, typically of several hundred\ndimension, and each word in the corpus is assigned a vector in the space\n* The key idea is context: words that occur often in the same context should have same/opposite\nmeanings.\n* Two flavors\n    - continuous bag of words (CBOW): the model predicts the current word given a window of surrounding context words\n    - skip gram: predicts the surrounding context words using the current word\n\n![word2vec](https://deeplearning4j.org/img/countries_capitals.png)\n\n[back to top](#machine-learning)\n\n\n### Generative vs discriminative\n\n* Discriminative algorithms model *p(y|x; w)*, that is, given the dataset and learned\nparameter, what is the probability of y belonging to a specific class. A discriminative algorithm\ndoesn't care about how the data was generated, it simply categorizes a given example\n* Generative algorithms try to model *p(x|y)*, that is, the distribution of features given\nthat it belongs to a certain class. A generative algorithm models how the data was\ngenerated.\n\n\u003e Given a training set, an algorithm like logistic regression or\n\u003e the perceptron algorithm (basically) tries to find a straight line—that is, a\n\u003e decision boundary—that separates the elephants and dogs. Then, to classify\n\u003e a new animal as either an elephant or a dog, it checks on which side of the\n\u003e decision boundary it falls, and makes its prediction accordingly.\n\n\u003e Here’s a different approach. First, looking at elephants, we can build a\n\u003e model of what elephants look like. Then, looking at dogs, we can build a\n\u003e separate model of what dogs look like. Finally, to classify a new animal, we\n\u003e can match the new animal against the elephant model, and match it against\n\u003e the dog model, to see whether the new animal looks more like the elephants\n\u003e or more like the dogs we had seen in the training set.\n\n[back to top](#machine-learning)\n\n\n### Paramteric vs Nonparametric\n\n* A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model.\n* A model where the number of parameters is not determined prior to training. Nonparametric does not mean that they have NO parameters! On the contrary, nonparametric models (can) become more and more complex with an increasing amount of data.\n\n[back to top](#machine-learning)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjayinai%2Fml-interview","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjayinai%2Fml-interview","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjayinai%2Fml-interview/lists"}