{"id":16865781,"url":"https://github.com/mebjas/ml-experiments","last_synced_at":"2025-03-18T17:38:23.800Z","repository":{"id":72001704,"uuid":"79041602","full_name":"mebjas/ml-experiments","owner":"mebjas","description":"some random machine learning experiments","archived":false,"fork":false,"pushed_at":"2017-02-22T19:51:28.000Z","size":8138,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-05-23T08:04:01.661Z","etag":null,"topics":["classifier","machine-learning","python","sklearn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mebjas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-01-15T15:37:12.000Z","updated_at":"2017-03-22T17:05:00.000Z","dependencies_parsed_at":"2023-03-07T16:15:50.628Z","dependency_job_id":null,"html_url":"https://github.com/mebjas/ml-experiments","commit_stats":{"total_commits":19,"total_committers":1,"mean_commits":19.0,"dds":0.0,"last_synced_commit":"682c7d6f1ecd6aef8a5e7002d22eb591160942b3"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mebjas%2Fml-experiments","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mebjas%2Fml-experiments/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mebjas%2Fml-experiments/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mebjas%2Fml-experiments/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mebjas","download_url":"https://codeload.github.com/mebjas/ml-experiments/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244271846,"owners_count":20426624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classifier","machine-learning","python","sklearn"],"created_at":"2024-10-13T14:48:29.015Z","updated_at":"2025-03-18T17:38:23.777Z","avatar_url":"https://github.com/mebjas.png","language":"Python","readme":"# Machine Learning Experiments Server\nThis is my kind of first hands on experimentation with machine learning algorithms and techniques. I'll keep updating my summaries here:\n\n### Experiment 1: Classifier to classify a Github Issue as `enhancement` or `bug` based purely on issue title.\n#### Quick Summary: Mined more that `1,00,000` Issue data from Github open source repositories. Most of them were `enhancement` or `bugs`. Tried a couple of alogirthms and techniques on them. And here's things I have learnt so far.\n - Training result (accuracy) seemed to go up with training data size (no of rows).\n - But for few ML algorithms, training time also go up with training data. Few algorithms seemed to take time proportional to training data while predicting. For example - `Gaussian Naive Bayes`, `SVM`. While it was pretty much constant in case of tree based algorithms like - `decision tree`, `random forest`, `adaboost with decision tree as weak learner`.\n - Here's the best accuracy I could achive so far with diff algorithms (w/o mentioning the parameters or training time or data size).\n \n  Algorithm     | Accuracy (%)  |\n  ------------- |---------------|\n  SVM           | 80.08         |\n  AdaBoost      | 74.82         |\n  Naive Bayes   | 68.84         |\n  ***Random Forest*** | ***85.8***         |\n  Decision Tree | 77.52         |\n  \n  Which made me an obvious fan of `Random Forest Ensemble` considering both speed and accuracy.\n  \n - In my case ***feature selection only seemed to improve the accuracy of `random forest classifier` by small margin***. Best results were observed `95 percentile` feature selection was applied. Without any feature selection in pipeline it was `85.74% accuracy` for same amount of data and parameters (`n_estimators = 15`, `criterio=entropy`). **However the training time reduced to `129s` with `95%ile` feature selection contrary to when it was not applied** when it took `1077s` -\u003e Nearly 8 times. One thing of interest was the results - accuracy (marginal diff though) and training time were different when feature selection with 100%ile selection and no feature selection was applied. In case of 100%ile selection time it took was `130s` in place of `1077s`.\n \n - ***Accuracy is not the only metric to consider - metrics like precision, recall \u0026 f1_score are important too***. At it's best got following data. Note that there were two labels so metrics like precisio, recall \u0026 fbeta will have two values\n \n  - | - | - |\n  ------------  | ---------- | ----------- |\n Random Forest | accuracy = `85.34%` | precision = `[0.87, 0.82]` |\n recall = `[0.92, 0.73]`| fbeta_score = `[0.89, 0.77]`||\n \n - PCA didn't seem to improve the accuracy of the classifier in inital experiments by around 5%. As can be assumed it took a toll in case of other metrics like precision, fbeta score \u0026 recall. Also it increased the time for the process.\n \n - Also, as per initial experiments stemming and stopword removal didn't seem to bring much improvements. It seemd to bring down the metrics in some cases.\n\n#### TODOS / Things to test\n - [x] Feature Cleaning pipeline\n    - [x] Stemming of issue text\n    - [x] Removing stopwords\n - [x] Principal Component Analysis\n - [x] GridSearchCV to find best parameters for classifier.\n - [ ] Classifier features based on POS tagging of the issue title text.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmebjas%2Fml-experiments","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmebjas%2Fml-experiments","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmebjas%2Fml-experiments/lists"}