{"id":18418116,"url":"https://github.com/ishaansathaye/ml","last_synced_at":"2026-05-05T12:31:31.608Z","repository":{"id":141545283,"uuid":"256889620","full_name":"ishaansathaye/ML","owner":"ishaansathaye","description":"Machine Learning Documentation and Libraries","archived":false,"fork":false,"pushed_at":"2022-08-07T15:39:09.000Z","size":22480,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-13T08:56:02.118Z","etag":null,"topics":["ai","classification","clustering","data-science","ibm","machine-learning","machine-learning-algorithms","python","sql","theory"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ishaansathaye.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-19T01:33:55.000Z","updated_at":"2022-03-12T06:31:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"f7b488b6-2d92-43ed-9538-19985ee6a3ef","html_url":"https://github.com/ishaansathaye/ML","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ishaansathaye/ML","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ishaansathaye","download_url":"https://codeload.github.com/ishaansathaye/ML/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ishaansathaye%2FML/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32649514,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-05T11:29:49.557Z","status":"ssl_error","status_checked_at":"2026-05-05T11:29:48.587Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","classification","clustering","data-science","ibm","machine-learning","machine-learning-algorithms","python","sql","theory"],"created_at":"2024-11-06T04:12:36.001Z","updated_at":"2026-05-05T12:31:31.588Z","avatar_url":"https://github.com/ishaansathaye.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Machine Learning Documentation\n---------------------------\n## Regression Algorithms\n- Ordinal Regression\n- Poisson Regression\n- Fast forest quantile Regression\n- Linear, Polynomial, Lasso, Stepwise, Ridge Regression\n- Bayesian linear Regression\n- Neural network Regression\n- Decision forest Regression\n- Boosted decision tree Regression\n- KNN (K-nearest neighbors)\n---------------------------\n## Classification\n### Evaluation Metrics: Classification\n- **Jaccard Index (Best value at 1)**\n    - J = (correctYPredictions) / (realElements + predictedElements - correctYPredictions)\n- **F1 Score (Best value at 1)**\n    - Confusion Matrix shows chart with correct and wrong predictions\n    - Precision = TruePositive / (TruePositive + FalsePositive)\n    - Recall = TruePositive / (TruePositive + FalseNegative)\n    - F1 Score = 2*(prec * rec)/(prec + rec)\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/confm.png\" width=\"250\"\u003e\u003c/p\u003e\n\n- **Log loss (Best value or higher accuracy at 0)**\n    - Predicted output is a probability value between 0 and 1\n    - Log Loss Equation: \n\u003cp align=\"center\"\u003e\u003cimg src=\"images/logloss.png\" width=\"250\"\u003e\u003c/p\u003e\n\n### Classification Algorithms\n- **Decision Trees**\n    - Testing attributes or features: internal nodes are tests, branches are results of the test, and leaf node assigns patients to a class\n    - Attributes should split data so that there is less impurity\n    - Aiming for a pure node and impurity should go down as the tree grows -\u003e less entropy\n- **Naive Bayes**\n- **Linear Discriminant Analysis**\n- **k-Nearest Neighbor** - classifying cases based on their similarity to other cases\n    - On a scatter plot, the closest case can be associated to the unknown case that the algorithm is predicting\n    - Choosing 5 nearest neighbors and taking majority is more reliable\n    - k = amount of cases nearest to the unknown case\n    - Calculate the distant between cases using distance formula\n- **Logistic Regression**\n    - Binary classification: 0 or 1\n    - Returns a probability score between 0 and 1\n    - Sigmoid Function = Logistic Function -\u003e bigger then 1, smaller then 0\n    - Training Process: 1) Initialize theta 2) Calculate y_hat 3) Find error (real - predict) 4) Change theta to reduce cost function 5) Go to 2 and start again (Use gradient descent to reduce cost and accuracy to stop interaction)\n    - Logistic Regression Cost Function \u003cp align=\"center\"\u003e\u003cimg src=\"images/logcost.png\" width=\"250\"\u003e\u003c/p\u003e\n    - To get parameters need to minimize the cost function using gradient descent\n- **Neural Networks**\n- **Support Vector Machines** - supervised algorithm that classifies cases by finding a separator\n    - SVM outputs a hyperplane that separates cases and be used to classify unknown cases\n    - Data transformation - changing data to separate data -\u003e Kernel Functions (linear, polynomial, RBF, and Sigmoid)\n    - Finding the Hyperplane with support vectors closes to the margin lines: \u003cp align=\"center\"\u003e\u003cimg src=\"images/svmhype.png\" width=\"250\"\u003e\u003c/p\u003e\n    - Advantage: accurate in high dimension places and memory efficient\n    - Disadvantage: prone to over-fitting, no probability estimation, small datasets\n    - Applications: Image Recognition, Text Category, Detecting Spam, Sentiment Analysis, Gene Expression Classification, and other machine learning techniques\n---------------------------\n## Clustering\n- Cluster -\u003e a group of objects that are similar to other objects in that cluster and dissimilar to data points in other clusters \u003cp align=\"center\"\u003e\u003cimg src=\"images/cluster.png\" width=\"250\"\u003e\u003c/p\u003e\n- Form of unsupervised learning\n- Applications: Exploratory Data Analysis, Summary Generation, Outlier Detection, Finding Duplicates, Pre-processing Step\n\n### Clustering Algorithms\n- Partitioned-based Clustering (Relatively Efficient and forms sphere-like clusters) (For Medium and Large Databases)\n    - **k-Means** (unsupervised)\n        - Divides data into non-overlapping subsets\n        - k = number of clusters (choose random centroids of clusters)\n        - Reduce error: SSE = sum of squared differences between each point and its centroid\n        - Compute new centroids of clusters by taking mean of data points\n        - Repeat process of new centroids and calculating distance between points and centroid until converges and centroid does not move\n        - *Accuracy*\n            - External -\u003e compare with ground truth if available\n            - Internal -\u003e Average the distance between data points within a cluster\n            - Choosing k -\u003e graph of k vs mean distance of data points to cluster centroid (best when distance is low) __Find elbow point of graph or where the rate decreases sharply for best k__\n    - **k-Median**\n    - **Fuzzy c-Means**\n- Hierarchical Clustering (Intuitive and good for small datasets) -\u003e builds hierarchy of clusters where each node is a cluster and consists of the clusters of its daughter nodes\n    - **Agglomerative** (collect things)\n        - Bottom up or where pairs of clusters pair together\n        - Steps: *1)* Create n clusters for each data point *2)* Compute the proximity matrix *3)* Repeat -\u003e Merge two closest clusters and update the proximity matrix -\u003e Until only single cluster remains\n        - *Single-Linkage Clustering*\n            - Minimum distance between clusters\n        - *Complete-Linkage Clustering*\n            - Maximum distance between clusters\n        - *Average Linkage Clustering*\n            - Average distance between clusters\n        - *Centroid Linkage Clustering*\n            - Distance between cluster centroids\n    - **Divisive**\n        - Top Down or dividing the clusters\n    - Partitioned-Based (k-Means) vs Hierarchical Clustering: \u003cp align=\"center\"\u003e\u003cimg src=\"images/hier.png\" width=\"1000\"\u003e\u003c/p\u003e\n- Density-based Clustering (Produces arbitrary shaped clusters) (Good when spacial clusters or when noise in dataset)\n    - **DBSCAN** (Density-Based Spatial Clustering of Applications with Noise)\n        - Common clustering algorithm and works based on density of objects\n        - Parameters\n            - R (Radius of neighborhood) = if includes enough number of points within then it is a dense area\n            - M (Min number of neighbors) = minimum number of data points that we want in a neighborhood to define a cluster\n        - Core Points -\u003e data points that have the M points and are at the center\n        - Border Points -\u003e data points that do not have the minimum number of points in a neighborhood\n        - Clusters formed with at least one core point and can be connected by multiple core points\n---------------------------\n## Recommender Systems\n- Captures the patter of people's behavior and use it to predict things they want or like\n- **Content-Based**\n    - Figures out favorite aspects and makes recommendations to show things that share those aspects\n    - Take the rating and develop a matrix with the genre matrix and multiply to get a weighted matrix\n- **Collaborative Filtering**\n    - Find similar groups of users and shows recommendations based on what similar users might like\n    - *User-Based*\n        - Based on users' neighborhood\n        - Have a User Rating Matrix and then learn the similarity weights\n        - Create the weighted ratings matrix \n    - *Item-Based*\n        - Based on items' similarity\n        - Not based on content, but based on similarity between items \n    - Challenges\n        - Data Sparsity\n            - Users in general rate only a limited number of items\n        - Cold Start\n            - Recommendation to new users or new items\n        - Scalability\n            - Increase in number of users of items\n- *Implementing Recommender System*\n    - Memory-Based\n        - Uses entire user-item dataset to generate recommendation\n        - Uses statistical techniques\n    - Model-Based\n        - Develop model of user to attempt to learn their preference\n        - Uses machine learning techniques","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fishaansathaye%2Fml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fishaansathaye%2Fml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fishaansathaye%2Fml/lists"}