{"id":15698625,"url":"https://github.com/ggeop/binary-classification-ml","last_synced_at":"2025-08-25T16:17:37.186Z","repository":{"id":102403963,"uuid":"133703994","full_name":"ggeop/Binary-Classification-ML","owner":"ggeop","description":"Machine Learning Project, build a Binary Classification Function in Python.","archived":false,"fork":false,"pushed_at":"2018-10-12T21:04:12.000Z","size":78,"stargazers_count":6,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-05-09T01:44:16.151Z","etag":null,"topics":["binary-classification","jupyter-notebook","numpy","pandas","python","sklearn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ggeop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-16T17:51:31.000Z","updated_at":"2021-11-08T12:56:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"762861be-eb3f-4aec-8acb-ee48bc3e6404","html_url":"https://github.com/ggeop/Binary-Classification-ML","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ggeop/Binary-Classification-ML","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggeop%2FBinary-Classification-ML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggeop%2FBinary-Classification-ML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggeop%2FBinary-Classification-ML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggeop%2FBinary-Classification-ML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ggeop","download_url":"https://codeload.github.com/ggeop/Binary-Classification-ML/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ggeop%2FBinary-Classification-ML/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272093814,"owners_count":24872244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-25T02:00:12.092Z","response_time":1107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binary-classification","jupyter-notebook","numpy","pandas","python","sklearn"],"created_at":"2024-10-03T19:31:21.216Z","updated_at":"2025-08-25T16:17:37.134Z","avatar_url":"https://github.com/ggeop.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![alt text](https://github.com/ggeop/Binary-Classification-ML/blob/master/img/TextClassificationExample.png)\nImage from:https://developers.google.com/machine-learning/guides/text-classification/\n\n# Binary-Classification-ML\nIn this project, we are going to build a function that will take in a Pandas data frame containing data for a binary classification problem. Our function will try out and tune many different models on the input data frame it receives and at the end it is going to return the model it thinks is best, as well as an expectation of its performance on new and unseen data in the future. To achieve this mighty task we are going to build several helper functions that our main function is going to have access to.\n\n## Setup\n\n```\n#Libraries\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport sklearn\nimport sklearn.linear_model\nimport sklearn.ensemble\nimport sklearn.metrics\n\n```\n### Extract Function\n\n```\ndef extract_x_and_y(df, y_column):\n    y=df[y_column]\n    del df[y_column]\n    x=df\n    return(x,y)\n    \n```\n\n### Split Function\n\n```\ndef split_x_and_y(X, y, test_size = 0.2, random_state = 42):\n    # % of the sample size\n    train_size=int(len(X)*test_size)\n    \n    #Make our results reproducible\n    np.random.seed(random_state)\n    \n    #Select randomly the rows for the training dataset\n    rows_array=np.random.choice(len(X),size=train_size,replace=False)\n    \n    #Create x,y train datasets\n    X_train=X.iloc[rows_array]\n    y_train=y.iloc[rows_array]\n    \n    #Select the rest arrays for the test dataset\n    total_rows=np.arange(len(X))\n    test_arrays=np.delete(total_rows,rows_array)\n    \n    #Create x,y test datasets\n    X_test=X.iloc[test_arrays]\n    y_test=y.iloc[test_arrays]\n    \n    return(X_train,y_train,X_test,y_test)\n    \n```\n\n### Models Classifiers\n\nCreate a function specify_models() that takes no parameters at all and returns a list of model definitions for each of \nthe above classifiers, where each model definition is the dictionary structure described previously.\n\n```\ndef specify_models():\n    \n    knear={'name':'K Nearest Neighbors Classifier',\n           'class':sklearn.neighbors.KNeighborsClassifier(),\n            'parameters':{'n_neighbors':range(1,12)}\n          }\n           \n    svc_linear={'name':'Support Vector Classifier with Linear Kernel',\n               'class':sklearn.svm.LinearSVC(),\n                'parameters':{'C':[0.001,0.01,0.1,1,10,100]}\n          }  \n    \n    sv_radial={'name':'Support Vector Classifier with Radial Kernel',\n               'class':sklearn.svm.SVC(kernel='rbf'),\n                'parameters':{'C':[0.001,0.01,0.1,1,10,100],'gamma':[0.001,0.01,0.1,1,10,100]}\n          }      \n    \n    loglas={'name':\"Logistic Regression with LASSO\",\n             'class':sklearn.linear_model.LogisticRegression(penalty='l1'),\n             'parameters':{'C':[0.001,0.01,0.1,1,10,100]}\n            }  \n    \n    sgdc={'name':\"Stochastic Gradient Descent Classifier\",\n            'class':sklearn.linear_model.SGDClassifier(),\n            'parameters':{'max_iter':[100,1000],'alpha':[0.0001,0.001,0.01,0.1]}\n            }  \n    \n    decis_tree={'name':\"Decision Tree Classifier\",\n            'class':sklearn.tree.DecisionTreeClassifier(),\n            'parameters':{'max_depth':range(3,15)}\n            } \n    \n    ranfor={'name':\"Random Forest Classifier\",\n            'class':sklearn.ensemble.RandomForestClassifier(),\n            'parameters':{'n_estimators':[10,20,50,100,200]}\n            } \n    \n    extrerantree={'name':\"Extremely Randomized Trees Classifier\",\n                    'class':sklearn.ensemble.ExtraTreesClassifier(),\n                    'parameters':{'n_estimators':[10,20,50,100,200]}\n                 } \n   \n    \n    lis=list([knear,svc_linear,sv_radial,loglas,sgdc,decis_tree,ranfor,extrerantree])\n    \n    return(lis)\n\n```\n### TRAIN THE MODEL\n\nWhat we have right now a list of dictionaries. Each dictionary essentially has the ingredients for us to train a model and tune the right parameters for that model. So, what we need now is a function, train_model() that takes in the following parameters:\n\n    model_dict : We will pass in the dictionaries from the list you just created one by one to this parameter\n    X: The input data\n    y: The target variable\n    metric : The name of a metric to use for evluating performance during cross validation. Please give this parameter a default value of 'f1' which is the F measure.\n    k : The number of folds to use with cross validation, the default should be 5\n\nThis function should essentially just call GridSearchCV() by correctly passing in the right information from all the different input parameters. The function should then return:\n\n    name : The human readable name for the model type that was trained\n    best_model : The best model that was trained\n    best_score : The best score (for the metric provided) that was found\n\n\n```\nfrom sklearn.model_selection import GridSearchCV\n\ndef train_model(model_dict, X, y, metric = 'f1', k = 5):\n    name=model_dict['name']\n    param_grid = model_dict['parameters']\n    clf=GridSearchCV(estimator=model_dict['class'], param_grid=param_grid, cv= k, scoring=metric)\n    best_score= clf.fit(X,y).best_score_\n    best_model= clf\n    return(name, best_model, best_score)\n```\n\n### Central Component\n\n```\ndef train_all_models(models, X, y, metric ='accuracy', k = 5):\n    #Initialize the list\n    final_list=list()\n    \n    for i in range(0,len(models)):\n        tr_model=train_model(models[i] ,X ,y , metric = metric, k=k)\n        final_list.append(tr_model)\n        \n    #Sort the final list    \n    final_list=sorted(final_list, key=lambda score: score[2], reverse=True)\n    return(final_list)\n```\n\n### Classifier Function\n\n```\ndef auto_train_binary_classifier(df, y_column, models, test_size = 0.2, random_state = 42, \n                                 metric = 'f1', k = 5):\n    \n    #Use the first function to split df to data and response\n    extr_df=extract_x_and_y(df, y_column)\n    \n    #Use the second function to split the dataframe to training and test\n    split_df=split_x_and_y(extr_df[0], extr_df[1], \n                           test_size = test_size, \n                           random_state = random_state\n                          )\n    \n    #Train all the models\n    final_model=train_all_models(models, split_df[0],split_df[1], metric = metric, k = k)\n    \n    #Take the best model, it's name and the score\n    best_model_name=final_model[1][0]\n    best_model=final_model[1][1]\n    train_set_score=final_model[1][2]\n    \n    ##################################\n    # Test set performance\n    ##################################\n    \n    predicted=final_model[1][1].predict(split_df[2])\n    test_set_score=sklearn.metrics.accuracy_score(split_df[3], predicted)\n    \n    return(best_model_name, best_model, train_set_score, test_set_score)\n\n```\n \n ### Testing\n \nThis section is an opportunity for you to test what you have implemented in this assignment. There are no more questions in this assignment, this section is only there to help you. In the code below, we've loaded up a data set into a Pandas dataframe and we call your auto_train_binary_classifier() function to see the result. Use this as an opportunity to see if your function returns an output that you expect.\n \n ```\nfrom sklearn.datasets import load_breast_cancer, load_iris\ncancer = load_breast_cancer()\ncancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)\ncancer_df['target'] = pd.Series(cancer.target)\n\n# The next commands will only work once you've implemented these functions above.\nmodels = specify_models()\nbest_model_name, best_model, train_set_score, test_set_score = auto_train_binary_classifier(cancer_df, 'target', models)\nprint(best_model_name)\nprint(best_model)\nprint(train_set_score)\nprint(test_set_score)\n \n ```\n \n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fggeop%2Fbinary-classification-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fggeop%2Fbinary-classification-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fggeop%2Fbinary-classification-ml/lists"}