{"id":19772037,"url":"https://github.com/danieldacosta/heartdisease-classification","last_synced_at":"2025-04-30T17:33:03.796Z","repository":{"id":39731646,"uuid":"192971483","full_name":"DanielDaCosta/HeartDisease-Classification","owner":"DanielDaCosta","description":"Machine Learning Classification Model","archived":false,"fork":false,"pushed_at":"2023-03-25T01:06:54.000Z","size":70,"stargazers_count":5,"open_issues_count":4,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-04-06T03:41:16.754Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DanielDaCosta.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-06-20T18:48:25.000Z","updated_at":"2023-07-12T03:41:17.000Z","dependencies_parsed_at":"2023-01-23T12:30:46.751Z","dependency_job_id":null,"html_url":"https://github.com/DanielDaCosta/HeartDisease-Classification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FHeartDisease-Classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FHeartDisease-Classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FHeartDisease-Classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FHeartDisease-Classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DanielDaCosta","download_url":"https://codeload.github.com/DanielDaCosta/HeartDisease-Classification/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251751354,"owners_count":21637911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T05:05:10.186Z","updated_at":"2025-04-30T17:33:03.555Z","avatar_url":"https://github.com/DanielDaCosta.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Heart Disease Classification\n\nClassification model built based on `Mission_Prediction_Dataset.csv` data that can predict the presence of a disease in the patient. \nModel created as part of CYBERLABS Mission: Disease Classification.\n\n## Requirements\nInstall the following packages: `pip install -r requirements.txt`\n* Python 3.7\n* Numpy: V1.16.4\n* Pandas: V0.24.2\n* Seaborn: V0.9.0\n* Keras: V2.2.4\n* Keras-Applications: V1.0.8\n* Keras-Preprocessing: V1.1.0\n* Scikit-learn: V0.21..2\n* Scipy: V1.3.0\n* Matplotlib: V3.1.0\n\n## Developing Process\n### Preprocessing \nThe first step was to analyze the dataset. After verifying the non existence of missing values, the following step was to \nunderstand how the features were distributed. As some of the variables had similar distributions the dataset was reorganized\nso that these features would be side by side.\n\nThe output histogram showed that the data was well balanced: 54.45% for sick people and 45.54% for non sick.\n\nIn order to analyze each feature contribution to the output, the correlation matrix was plotted. Every feature was correlated\nto the output and there wasn't any multicollinearity problem. On the other hand, outliers were detected through a scatter plot\nof the data. For a more precise detection, Z-score was used in order to find and remove them.\n\nSince the features have different scales and in order to make the training less sensitive to the scales of variables,\nthe dataset was normalized. For those with gaussian distributions the best way to rescale them was using standardization.\nThe binary data did not need normalization so its values did not change. For the rest of the data, they were normalized using a MinMaxScale function.\n\n### Creating ML Model\n\n#### Neural Network\n\nFor this problem a Neural Network classification model was used. Because of its versatility and also the fact that I was\nalready familiar with this Machine Learning Model, it ended up being chosen. Due to the complexity of the data, \nwith multiples variables, a Multilayer Perceptron with two hidden layers was the best configuration to use. \n\nThe inputs range were basically between -1 and 1 (some values were higher due to standardization), so for the hidden layers' \nactivation functions the 'tanh' function was chosen. Since it was a classification problem a 'sigmoid' was used in output\nlayer.\n\nTo prevent overfitting methods such as: Droup out and L1 and L2 Regularization, were used, but the result was not good \nenough, due to the small neural network configuration (only 7 neurons in each layer). The best result was obtained using \nthe early stopping method. The model efficiency was measured using accuracy metric, as it was asked.\n\n30% of the data was used to test the model, 10% as validation set and the rest was used for training.\n\nThe model has an accuracy an average accuracy 84,366% on the test set and 87,55% on the train set. \nIt's a reasonable result but not the best one. The model showed itself to be difficult to optimize since the neural\nnetwork was small and so was the dataset.\n\n#### Support Vector Machines\n\nIn order to compare the obtained result using the Neural Network method, another ML model was used. The chosen one was\nSupport Vector Machine modes. SVM was chosen since iis capable of doing classification, has an easy implementation and \nalso due to the fact that the input data is the same type of the one used in the last model. Using 30% of the data to \ntest the model and the rest for training.\n\nBecause of data's complexity and since the target has only two classes, a 'sigmoid' was used as the kernel. The other ones:\nlinear, gaussian and polynomial were also tested, but they did not give the best result.\n\nThe model has an average accuracy of 84,36%, almost the same obtained with the last model, which shows the consistency of \nthe results.\n\n\n## Output.txt file\n\nThis file contains a list of ten prediction results of both models and also an average accuracy for each of them.\n\n## Execution\n\nThe path to the csv file: *Mission_Prediction_Dataset.csv*, should be added in the function pd.read_csv(\"*Add path here*\"), \nline 21. The python file 'Daniel_Prediction.py' just have to be executed with all the needed libraries, Python 3.7 was used. The output contains the following figures, 16 in total. Comment the last line of the code, line 166, if the figures do not want to be displayed:\n\n* Distributions graphs of each feature (thirtenn figures)\n* Histogram of the output data\n* Correlation Matrix \n* Neural Network Loss function of training and test sets\n\nThe Neural Networks results (confusion Matrix and accuracy) and SVM results (confusion Matrix and accuracy) will be \nshown on the screen, in that order.\n\n## Acknowledgemetns\n\nThe dataset came from a Kaggle competition that can be foud here [here](https://www.kaggle.com/cdabakoglu/heart-disease-classifications-machine-learning).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieldacosta%2Fheartdisease-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanieldacosta%2Fheartdisease-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieldacosta%2Fheartdisease-classification/lists"}