{"id":20475395,"url":"https://github.com/ashwinpn/ml-learnings","last_synced_at":"2025-03-05T14:27:01.058Z","repository":{"id":201600084,"uuid":"243650071","full_name":"ashwinpn/ML-Learnings","owner":"ashwinpn","description":"Concepts, Research Papers, Implementations - The works.","archived":false,"fork":false,"pushed_at":"2020-10-19T22:17:59.000Z","size":506,"stargazers_count":1,"open_issues_count":0,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-16T03:15:31.251Z","etag":null,"topics":["applied-mathematics","artificial-intelligence","convolutional-neural-networks","graph-theory","machine-learning","machine-learning-algorithms","neural-network","probabilistic-graphical-models","reinforcement-learning","tutorial"],"latest_commit_sha":null,"homepage":"https://ashwinpn.github.io/ML-Learnings/","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashwinpn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-02-28T01:05:15.000Z","updated_at":"2020-10-19T22:18:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"21c64a08-225a-48eb-a7df-81c625f67026","html_url":"https://github.com/ashwinpn/ML-Learnings","commit_stats":null,"previous_names":["ashwinpn/ml-learnings"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwinpn%2FML-Learnings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwinpn%2FML-Learnings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwinpn%2FML-Learnings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwinpn%2FML-Learnings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashwinpn","download_url":"https://codeload.github.com/ashwinpn/ML-Learnings/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242042147,"owners_count":20062355,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["applied-mathematics","artificial-intelligence","convolutional-neural-networks","graph-theory","machine-learning","machine-learning-algorithms","neural-network","probabilistic-graphical-models","reinforcement-learning","tutorial"],"created_at":"2024-11-15T15:15:53.931Z","updated_at":"2025-03-05T14:27:01.034Z","avatar_url":"https://github.com/ashwinpn.png","language":null,"readme":"- For Virtualiazation solution leveraging the hyper-v, try the Linux WSL [Windows Subsystem for Linux] : https://gist.github.com/leodutra/a6cebe11db5414cdaedc6e75ad194a00 \n- Accessing the NYU HPC [Prince Cluster - SoHo, DUMBO] : https://wikis.nyu.edu/display/NYUHPC/High+Performance+Computing+at+NYU\n\n## NOTES / IMPORTANT INSIGHTS\n- https://github.com/donnemartin/data-science-ipython-notebooks\n- Generating Click-bait using Recurrent Neural Networs : Word-by-Word Predcition [char-rnn] + Minimize prediction error.\n- Limited Memory BGFS : https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm\n- Wide and Deep Recommender Systems : https://arxiv.org/abs/1606.07792\n- Distributed Tensorflow and Horovod.\n- Dependency injections for End-to-End Machine Learning Projects.\n- A guide to writing clean machine learning code : https://towardsdatascience.com/clean-machine-learning-code-bd32bd0e9212\n## 6-14-2020\n-AutoHAS: Differentiable Hyper-parameter and\nArchitecture Search : https://arxiv.org/pdf/2006.03656.pdf\n- Hindsight Experience Replay : https://arxiv.org/pdf/1707.01495.pdf\n## 6-8-2020\n- Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video : https://papers.nips.cc/paper/8299-unsupervised-scale-consistent-depth-and-ego-motion-learning-from-monocular-video\n- Stand-Alone Self-Attention in Vision Models : https://papers.nips.cc/paper/8302-stand-alone-self-attention-in-vision-models\n## 6-6-2020\n- Computer Vision on Mars - https://www.ri.cmu.edu/pub_files/pub4/matthies_larry_2007_1/matthies_larry_2007_1.pdf\n- \"Valuation\", Aswath Damodaran @ NYU Stern, https://www.youtube.com/watch?v=uH-ffKIgb38\n## 6-5-2020\n## 6-4-2020\n## 6-2-2020\n- THE CURIOUS CASE OF NEURAL TEXT DeGENERATION : https://arxiv.org/pdf/1904.09751.pdf\n- Beam Search Strategies for Neural Machine Translation : https://www.aclweb.org/anthology/W17-3207.pdf\n## 5-31-2020\n- An overview of Bayesian analysis - http://lethalletham.com/Letham_bayesian_analysis.pdf \n- Re-Examining Linear Embeddings for High-Dimensional Bayesian Optimization - https://arxiv.org/abs/2001.11659\n- Forecasting at scale (Time series) - Facebook Prophet : https://peerj.com/preprints/3190.pdf\n## 5-29-2020\n- Approximation Schemes for ReLU Regression : https://arxiv.org/pdf/2005.12844.pdf\n- Distributed Algorithms for Covering, Packing and\nMaximum Weighted Matching : https://arxiv.org/pdf/2005.13628.pdf\n## 5-27-2020\n## 5-25-2020\n- The PGM-index: a fully-dynamic compressed learned index\nwith provable worst-case bounds : https://dl.acm.org/doi/pdf/10.14778/3389133.3389135\n## 5-24-2020\n- NLP and Knowledge graphs - generate word embeddings from knowledge graphs.\n- While training with Adam helps in getting fast convergence, the resulting model will often have worse generalization performance than when training with SGD with momentum. Another issue is that even though Adam has adaptive learning rates its performance improves when using a good learning rate schedule. Especially early in the training, it is beneficial to use a lower learning rate to avoid divergence. This is because in the beginning, the model weights are random, and thus the resulting gradients are not very reliable. A learning rate that is too large might result in the model taking too large steps and not settling in on any decent weights. When the model overcomes these initial stability issues the learning rate can be increased to speed up convergence. This process is called learning rate warm-up, and one version of it is described in the paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.\n- What is perplexity? What is its place in NLP?\n```\nPerplexity is a way to express a degree of confusion a model has in predicting. More entropy = more confusion. Perplexity is used to evaluate language models in NLP. A good language model assigns a higher probability to the right prediction.\n```\n- What is the problem with ReLu?\n\n```\n1] Exploding gradient(Solved by gradient clipping)\n2] Dying ReLu : No learning if the activation is 0 (Solved by parametric relu)\n3] Mean and variance of activations is not 0 and 1. (Partially solved by subtracting around 0.5 from activation. Better explained in fastai videos)\n```\n\n## 5-23-2020\n- Programming Tensor cores for CUDA - https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/\n- Bias Variance Decompositions using XGBoost - https://devblogs.nvidia.com/bias-variance-decompositions-using-xgboost/\n- Run XGBoost : Decreasing Test Error.\n \n ```\nimport csv\nimport numpy as np\nimport os.path\nimport pandas\nimport time\nimport xgboost as xgb\nimport sys\nif sys.version_info[0] \u003e= 3:\n    from urllib.request import urlretrieve\nelse:\n    from urllib import urlretrieve\n\ndata_url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz\"\ndmatrix_train_filename = \"higgs_train.dmatrix\"\ndmatrix_test_filename = \"higgs_test.dmatrix\"\ncsv_filename = \"HIGGS.csv.gz\"\ntrain_rows = 10500000\ntest_rows = 500000\nnum_round = 1000\n\nplot = True\n\n# return xgboost dmatrix\ndef load_higgs():\n    if os.path.isfile(dmatrix_train_filename) \n      and os.path.isfile(dmatrix_test_filename):           \n        dtrain = xgb.DMatrix(dmatrix_train_filename)\n        dtest = xgb.DMatrix(dmatrix_test_filename)\n        if dtrain.num_row() == train_rows and dtest.num_row() == test_rows:\n            print(\"Loading cached dmatrix...\")\n            return dtrain, dtest\n\n    if not os.path.isfile(csv_filename):\n        print(\"Downloading higgs file...\")\n        urlretrieve(data_url, csv_filename)\n\n    df_higgs_train = pandas.read_csv(csv_filename, dtype=np.float32, \n                                     nrows=train_rows, header=None)\n    dtrain = xgb.DMatrix(df_higgs_train.ix[:, 1:29], df_higgs_train[0])\n    dtrain.save_binary(dmatrix_train_filename)\n    df_higgs_test = pandas.read_csv(csv_filename, dtype=np.float32, \n                                    skiprows=train_rows, nrows=test_rows, \n                                    header=None)\n    dtest = xgb.DMatrix(df_higgs_test.ix[:, 1:29], df_higgs_test[0])\n    dtest.save_binary(dmatrix_test_filename)\n\n    return dtrain, dtest\n\n\ndtrain, dtest = load_higgs()\nparam = {}\nparam['objective'] = 'binary:logitraw'\nparam['eval_metric'] = 'error'\nparam['tree_method'] = 'gpu_hist'\nparam['silent'] = 1\n\nprint(\"Training with GPU ...\")\ntmp = time.time()\ngpu_res = {}\nxgb.train(param, dtrain, num_round, evals=[(dtest, \"test\")], \n          evals_result=gpu_res)\ngpu_time = time.time() - tmp\nprint(\"GPU Training Time: %s seconds\" % (str(gpu_time)))\n\nprint(\"Training with CPU ...\")\nparam['tree_method'] = 'hist'\ntmp = time.time()\ncpu_res = {}\nxgb.train(param, dtrain, num_round, evals=[(dtest, \"test\")], \n          evals_result=cpu_res)\ncpu_time = time.time() - tmp\nprint(\"CPU Training Time: %s seconds\" % (str(cpu_time)))\n\nif plot:\n    import matplotlib.pyplot as plt\n    min_error = min(min(gpu_res[\"test\"][param['eval_metric']]), \n                    min(cpu_res[\"test\"][param['eval_metric']]))\n    gpu_iteration_time = \n        [x / (num_round * 1.0) * gpu_time for x in range(0, num_round)]\n    cpu_iteration_time = \n        [x / (num_round * 1.0) * cpu_time for x in range(0, num_round)]\n    plt.plot(gpu_iteration_time, gpu_res['test'][param['eval_metric']], \n             label='Tesla P100')\n    plt.plot(cpu_iteration_time, cpu_res['test'][param['eval_metric']], \n             label='2x Haswell E5-2698 v3 (32 cores)')\n    plt.legend()\n    plt.xlabel('Time (s)')\n    plt.ylabel('Test error')\n    plt.axhline(y=min_error, color='r', linestyle='dashed')\n    plt.margins(x=0)\n    plt.ylim((0.23, 0.35))\n    plt.show()\n ```\n \n## 5-22-2020\n- Variational Inference for NLP - http://nlp.cs.berkeley.edu/tutorials/variational-tutorial-slides.pdf\n## 5-20-2020\n- Quantum Computing for Epidemiology - ideal for exponentially growing problems? https://www.youtube.com/watch?v=zOGNoDO7mcU \n- A very interesting ideology at the intersection of Computer Science and Physics\n## Research Labs\n- Microsoft Station Q , University of California Santa Barbara (UCSB)\n- Google Quantum AI Lab\n- IBM Quantum Computation Center\n\n\n## 5-16-2020\n- Word alignment and the Expectation Maximization Algorithm. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.421.5497\u0026rep=rep1\u0026type=pdf\n\n## 4-22-2020\n- Sequence to Sequence modelling : How to determine efficiency for (say) Machine Translation tasks / text transduction tasks; How efficient is the BLEU metric?\n## 4-6-2020\n- https://github.com/ashwinpn/ML-Learnings/blob/master/RL.pdf\n- DeepFake detection discussion : https://www.kaggle.com/c/deepfake-detection-challenge/discussion/140236\n## 4-5-2020\n- Expectation Maximization (Baum - Welch) for Probabilistic Parsing (PCFG, lexicalized PCFG)\n\n## 4-4-2020\n- BERT for NLP\n\n## 3-21-2020\n- https://towardsdatascience.com/recommendation-system-series-part-2-the-10-categories-of-deep-recommendation-systems-that-189d60287b58\n- \"It is well-established that neural networks are able to approximate any continuous function with arbitrary precision by varying the activation choices and combinations\"\n- **https://dl.acm.org/doi/10.1145/2792838.2800187**\n- **https://dl.acm.org/doi/10.1145/3240323.3240357**\n- https://recsys.acm.org/recsys20/call/#content-tab-1-1-tab\n- https://recsys.acm.org/recsys20/challenge/\n- https://recsys.acm.org/recsys19/\n- https://towardsdatascience.com/recommender-systems-with-deep-learning-architectures-1adf4eb0f7a6\n- https://towardsdatascience.com/recommendation-system-series-part-2-the-10-categories-of-deep-recommendation-systems-that-189d60287b58\n\nBasic recap : k-NN, Naive Bayes, SVM, Decision Forests, Data Mining, Clustering, and, Classification\n\n- https://www.slideshare.net/moustaki/time-context-and-causality-in-recommender-systems\n- https://www.slideshare.net/linasbaltrunas9/context-aware-recommendations-at-netflix\n- https://netflixtechblog.com/to-be-continued-helping-you-find-shows-to-continue-watching-on-7c0d8ee4dab6\n- http://news.mit.edu/2017/better-recommendation-algorithm-1206\n\n\n## 3-20-2020\n- [Mckean-Vlasov process](https://en.wikipedia.org/wiki/McKean%E2%80%93Vlasov_process) - in the context of Monte Carlo methods.\n1. Monte Carlo methods are ideal for sampling when we have elements which interact with each other - thus its applicability to physics problems.\n- A sound advice on visualizations \u003cbr\u003e \nhttps://medium.com/nightingale/ten-considerations-before-you-create-another-chart-about-covid-19-27d3bd691be8\n- COVID observations\n1. See https://covid19-dash.github.io/\n2. Check out https://upload.wikimedia.org/wikipedia/commons/8/86/Average_yearly_temperature_per_country.png\n3. Then see https://en.wikipedia.org/wiki/List_of_countries_by_median_age#/media/File:Median_age_by_country,_2016.svg\n## 3-19-2020\n- The Pitfalls of A/B testing\n1. Sequential testing leads to a considerable amount of errors while forming your conclusions - interactions between different elements needs to be taken into account too, while making data driven decisions.\n2. The testing should be allowed to run till the end - since we are analysing randomized samples, the test results halfway through and the test results at the end could be polar opposites of each other (!).\n3. \"The smaller the improvement, the less reliable the results\".\n4. Need to retest it (at least a couple of times more). Even with a statistically significant result, there’s a quite large probability of false positive error.\n- Data Visualization Pitfalls\n1. https://junkcharts.typepad.com/\n- Bayesian Inference\n- Self-Attention in Transformers : One of the things to ponder about, if you want to understand the success of BERT.\n- Dealing with outliers and bad data\u003c/br\u003e\n[QR Factorization and Singular Value Decomposition](https://www.cs.princeton.edu/courses/archive/fall11/cos323/notes/cos323_f11_lecture09_svd.pdf)\n1. Robust regression [https://stats.idre.ucla.edu/r/dae/robust-regression/](https://stats.idre.ucla.edu/r/dae/robust-regression/)\n2. Least absolute deviation\n3. Iteratively weighted least squares\n## 3-18-2020\n- [Sentence BLEU score v/s Corpus BLEU score](https://stackoverflow.com/questions/40542523/nltk-corpus-level-bleu-vs-sentence-level-bleu-score)\n- How vector embeddings of words capture context (GenSim) - \"King\" - \"Man\" + \"Woman\" = \"Queen\"\n- w.r.t calculating P(context-word | center word) = Better alternatives to Singular Vector Decomposition (SVD)? \n 1. [SVD tutorial](https://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm)\n 2. [Krylov-Schur approach to the truncated SVD](http://www.cs.ox.ac.uk/files/721/NA-08-03.pdf) \n- Transfer Learning considerations\n1. What factors influence **When and how to fine-tune?** - size of the dataset, similarities with the original dataset.\n2. Pre-trained network weights provided : https://github.com/BVLC/caffe/wiki/Model-Zoo\n\n## 3-17-2020\n- Ian Goodfellow, [Generative Adversarial Networks (GANs) | Artificial Intelligence (AI) Podcast](https://www.youtube.com/watch?v=Z6rxFNMGdn0)\n1. Deep Learning can be, in simple words, put as taking a thought and refining it again and again, rather than deductive reasoning.\n2. Important questions regarding AI - How can we program machines to experience qualitative states of experiences - read as consciousness and self-awareness?\n3. Speech recognition is a very interesting and a complex problem, concisely described in the paper [\"Hidden Voice Commands\"](https://people.eecs.berkeley.edu/~daw/papers/voice-usenix16.pdf). Interestingly, it generated some sounds that a human would NEVER make (see AlphaGo).\n\n- [AlphaGo's develpoment documentary](https://www.youtube.com/watch?v=WXuK6gekU1Y)\n1. AlphaGo also played some moves that a human go player would never have been expected to have played = LEARNING.\n\n- NLP\n1. Check out wikification.\n\n\n## Flow GAN's\nCombining Maximum Likelihood and Adversarial Learning\n[Flow-GAN](https://arxiv.org/abs/1705.08868)\n\n## variational Autoencoders\n-[Tutorial on Variational Autoencoders](https://arxiv.org/abs/1606.05908)\n- Variational Autoencoders (VAEs) are powerful generative models.\n- They are one of the most popular approaches to unsupervised learning of complicated distributions.\n- VAE's have already shown promise in generating many kinds of complicated data, including handwritten digits, faces, house numbers, CIFAR images, physical models of scenes, segmentation, and predicting the future from static images. \n\n\n## Read up on BERT\nBidirectional Encoder Representations from Transformers - NLP Pre-training. \n\n## xGBoost\n\nXtreme Gradient Boosting - has given some of the best results recently on problems involving structured data.\n\n## Gradient Boosting\n- Why does AdaBoost work so well?\n- Gradient Boosting is based on an ensemble based decision tree model, i.e. generating a strong classifier from hypotheses testing of combination of weak classifiers (decision stumps)\n\n## Miscellany\n- Keras on Theano optimizers - SAGA, Liblinear (log loss for high dimensional data), ADAM (incremental gradient descent)\n- ADAM is basically (RMSprop + momentum term)\n- You can add Nesterov Accelerated Gradient (NAG) to make it better \u003cbr\u003e\n  [Incorporating Nesterov Momentum into Adam](http://cs229.stanford.edu/proj2015/054_report.pdf) \u003cbr\u003e\n  [NAG](https://blogs.princeton.edu/imabandit/2013/04/01/acceleratedgradientdescent/) \u003cbr\u003e\n- Yet the ADAM optimizer in some cases performs poorly as compared to vanilla-SGD?\n- Does ReLU always provide a better non-linearity?\n  \n\n## Reinforcement Learning\nThe agent learns from the environment and recives reward/penalties as the result of it's actions. It's objective is to devise policy function in order to maximize cumulative reward.\nIt's diffrent from supervised and unsupervised learning.\nIt is based on Markov Decision Processes. But model-free paradigms such as Q-Learning perform better, especially on complex tasks.\n- Monte Carlo Policy Gradient (REINFORCE, actor-critic)\n- There are problems which arise with gradient values and variance, need to define a baseline and use Bellman's equation\nExploration (exploring new states) v/s Exploitation (maximize overall reward)\n- Normal Greedy Approach : Only focus on exploitation\n- Epsilon Greedy Approach : Focus on exploration (with probability 1 - epsilon) and exploitation.\n- Deep Q Networks (DQN)\n  When the number of states / actions become too large, it is more efficient to use Neural Networks. \u003cbr\u003e\n  In case of DQN, instead of a Bellman Update, we rewrite the Bellman Equation to emulate RMSE form, which woule become our cost function. \u003cbe\u003e\n- Policy Improvement Methods\n- Temporal Difference Methods\n\n\n## Transfer Learning\nUse a model trained on one problem to do predictive modelling on another problem.\nFor instance, say you have a image classification task.\nYou can use the VGG Model shell, conveniently provided by Oxford at their Vector Graphics Group Website.\nYou definitely would need to change the last few layers based on your task, and other changes would require\nhypotheses testing / domain knowledge.\n\nTransfer learning really improves efficiency in the case where we need to perform supervised learning tasks, and we require\na significantly large, labeled dataset for tackling the problem successfully.\n\n## Visualization\n- Matplotlib is still popular in general\n- Can also use Pandas for visualization\n- Plotly.JS, D3.JS  for beautiful outputs that could be rendered in Browsers\n- Bokeh is becoming popular of late; It has bindings in Python, Lua, Julia, Java, Scala.\n\n## Regularization Techniques\nRegularization is used for reducing overfitting.\n\n- L1, L2 regularization : regularization over weights\n- ElasticNet - L1 + L2 regularization\n- Adversarial Learning - Problems faced : Some tasks which can be very easily performed by humans have been found to be very difficult for a computer. For example, if you introduce a little noise to the photo of a Lion , it may not be recognized as a Lion (or worse, not as an animal at all). Thus, you voluntarily introduce noise to the extended dataset to improve efficiency. This is called jittering.\n- Dropout - Eradicate some neural network nodes / layers to improve performance.\n- Tikhonov regularization / Ridge Regression - Regularization of [ill posed](https://en.wikipedia.org/wiki/Well-posed_problem) problems \n\n## Probabilistic Graphical Models\n- Inferential Learning\n- Markov Random Fields\n- Conditional Random Fields\n- Bayesian Networks\n\n## Stochastic Gradient Descent\n- What is the ideal batch size?\n- Dealing with Vanishing Gradients (very small values of d/dw)\n\n## CNN's\n- Pooling + Strides is used for downsampling of the feature map.\n- AlexNet, GoogLeNet, VGG, DenseNet.\n\n## Convergence\n- Vanilla-SGD achieves 1/t convergence over smoothing of a convex function\n- Nesterov Accelerated Gradient (NAG) achieves 1/t.t convergence over smoothing of a convex function\n- Newton Methods achieves 1/t.t.t convergence over smoothing of a convex function\n- Arora, Mianjy, et.al -- Study convex relaxation based formulations of optimization problems\n\n## Expectation Maximization\n- Baum-Welch \n- Forward-Backward Algorithm\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashwinpn%2Fml-learnings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashwinpn%2Fml-learnings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashwinpn%2Fml-learnings/lists"}