https://github.com/cdvel/ml-tutorials
Machine Learning Tutorials in Python (Notes and Implementations)
https://github.com/cdvel/ml-tutorials
Last synced: 8 months ago
JSON representation
Machine Learning Tutorials in Python (Notes and Implementations)
- Host: GitHub
- URL: https://github.com/cdvel/ml-tutorials
- Owner: cdvel
- Created: 2015-03-06T09:18:27.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2015-03-25T09:49:19.000Z (about 11 years ago)
- Last Synced: 2025-06-04T05:43:48.960Z (12 months ago)
- Language: Python
- Homepage:
- Size: 172 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ml-tutorials
1. [k-Nearest neighbors](http://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/)
- Finds the N most similar elements based on euclidean, hammington etc distance
* Instance-based
- model the problem using data instances(rows) in order to make predictive decisions.
- In kNN all observations are retained as part of our model = extreme instance-based
* Competitive learning
- internally model elements(instance) compete in order to make a predictive decision.
- objective similarity measure between instances causes each instance to compete to be most similar to a given unseen data instance and contribute
* Lazy learning
- The algorithm doesn't build a model until the time of prediction is required
- Only relevant data to the unseen data (localized model)
- Can be computationally expensive to repeat over larger training sets
- Makes no assumption over the data, only that a distance measure can be calculated. Non-parametric or non-linear, doesn't assume functional form.
2. [Naive bayes] (http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)
- Suits classification problems. Uses probabilities of each attribute belonging to each class to make a prediction
- Fast and effective supervised learning for probabilistic prediction
* The Model
- A summary of data in the training set
1. mean
2. standard deviation for each (no. attributes * class values)
3. calculates probability of a specific attribute belonging to each class value
* Assumptions
- Independent probabilities between attributes of a given classs
- Numerical attributes are normally distributed
* Conditional probability
- Probability of a class given a value of an attribute
- The product of _conditional probabilities_ for each attribute = probability of an instance belonging to a class
* Prediction
- Calculate probabilities of instance belonging to class
- Pick class with highest probability
* Implementations
- Encoding using [Log Probabilities](https://en.wikipedia.org/wiki/Log_probability):
- Reduces risk of floating point underflow (values too small to be represented)
- More efficient by using summation of log probabilities instead of product of probabilities
- Observations with the Iris dataset:
- Less accurate
- Faster
- Log = (0.5s - 0.8s, accuracy=66%) vs Prob = (0.6 - 0.8, accuracy=74%)
- Numerical stability (accuracy is more consistent)
- using categorical data (ratios)
- numeric attributes (with normal distribution)
- [How to get best from Naive bayes] (http://machinelearningmastery.com/better-naive-bayes/)