{"id":20663076,"url":"https://github.com/kivanc57/feature_comparison","last_synced_at":"2026-06-06T15:01:52.321Z","repository":{"id":180768519,"uuid":"665667771","full_name":"kivanc57/feature_comparison","owner":"kivanc57","description":"This project explores the relationship between features and diagnosis in cancer data. Using methods like boxplots, scatterplots, PCA, k-means clustering, and logistic regression, we analyze and visualize data to understand health indicators.","archived":false,"fork":false,"pushed_at":"2025-01-28T12:13:24.000Z","size":3491,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-10T08:56:12.306Z","etag":null,"topics":["boxplot","clustering","correlation","data-science","data-visualization","descriptive-statistics","explanatory-data-analysis","pearson-correlation","r","scatter-plot","spearman"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kivanc57.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-12T18:08:21.000Z","updated_at":"2025-01-28T12:13:28.000Z","dependencies_parsed_at":"2024-11-16T21:18:29.695Z","dependency_job_id":null,"html_url":"https://github.com/kivanc57/feature_comparison","commit_stats":null,"previous_names":["kivanc57/r_feature_comparison","kgordu/r_feature_comparison","kivanc57/feature_comparison"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kivanc57/feature_comparison","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kivanc57%2Ffeature_comparison","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kivanc57%2Ffeature_comparison/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kivanc57%2Ffeature_comparison/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kivanc57%2Ffeature_comparison/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kivanc57","download_url":"https://codeload.github.com/kivanc57/feature_comparison/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kivanc57%2Ffeature_comparison/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33986901,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-06T02:00:07.033Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["boxplot","clustering","correlation","data-science","data-visualization","descriptive-statistics","explanatory-data-analysis","pearson-correlation","r","scatter-plot","spearman"],"created_at":"2024-11-16T19:16:32.392Z","updated_at":"2026-06-06T15:01:52.295Z","avatar_url":"https://github.com/kivanc57.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cancer Diagnosis Analysis\n\n## Aim and Dataset 🚀\nThe objective of this project is to examine the relationship between various features in a dataset, focusing on the impact of the diagnosis variable, which consists of Malignant (M) and Benign (B) categories. The study employs Exploratory Data Analysis (EDA) and Descriptive Statistics to clarify how measurement results relate to health conditions. The analysis includes boxplots, scatterplots, correlation charts, k-means clustering, and logistic regression.\n\n**Dataset Details:**\n- **Format:** CSV\n- **Rows:** 569\n- **Columns:** 32 (excluding the \"X\" column with `NaN` values)\n\n### Key Steps:\n📱 **Data Preparation** 📱\n   - Load and preprocess the dataset.\n   - Remove unnecessary columns and scale numerical data.\n\n🌟 **Descriptive Statistics** 🌟\n   - Generate summary statistics for the features.\n\n⚡ **Visualization** ⚡\n   - **Boxplots:** Display the distribution and outliers of features.\n   - **Scatterplots:** Illustrate the relationship between features like `radius_mean` and `perimeter_mean`.\n\n🚨 **Correlation Analysis** 🚨\n   - **Pearson’s and Spearman’s Correlation:** Analyze correlations and their significance between features.\n\n🔥 **Principal Component Analysis (PCA)** 🔥\n   - Perform PCA to reduce dimensionality and visualize feature relationships.\n\n🌱 **Clustering** 🌱\n   - **K-Means Clustering:** Determine optimal cluster numbers and visualize clusters.\n\n🔔 **Logistic Regression** 🔔\n   - Compare features like `compactness_mean` and `radius_se` with the diagnosis variable to analyze their relationship.\n\n## R Script\n```R\n# Load required libraries\nlibrary(tidyverse)\nlibrary(cluster)\nlibrary(Hmisc)\nlibrary(plotly)\nlibrary(ggfortify)\nlibrary(factoextra)\nlibrary(NbClust)\nlibrary(ggpubr)\nlibrary(dplyr)\nlibrary(PerformanceAnalytics)\nlibrary(ggplot2)\n\n# Load and prepare data\ndata \u003c- read.csv(\"/Users/admin/Desktop/Workplace/Data/cancer.csv\", sep=\",\", row.names=1, stringsAsFactors = T)\ndata_rest \u003c- data[, -32]\nnumerical_data \u003c- data_rest[, -1]\nnumerical_data_scaled \u003c- scale(numerical_data)\n\n# Export data for Excel\nexport_data \u003c- data[,1] %\u003e% mutate_if(is.numeric, round, digits=3)\nexport_data$X \u003c- data$X\nexport_data$diagnosis \u003c- data$diagnosis\nwrite.table(export_data, file='/Users/admin/Desktop/data.txt', sep=',', row.names = T)\n\n# Descriptive statistics summary\nsummary_data \u003c- round(apply(numerical_data, 2, summary),3)\nclip \u003c- pipe(\"pbcopy\", \"w\")                       \nwrite.table(summary_data, file=clip, sep = '\\t', row.names = FALSE)                               \nclose(clip)\n\n# Boxplot visualization\npng(\"/Users/admin/Desktop/boxplot.png\", width = 20, height = 10, units = 'in', res = 300)\nboxplot(scale(numerical_data), col = rainbow(ncol(numerical_data)), notch = TRUE, xlab = \"Features\", ylab = \"Values\")\ndev.off()\n\n# Scatter plot visualization\npng(\"/Users/admin/Desktop/scatter_plot.png\", width = 4, height = 4, units = 'in', res = 300)\nggplot(data = data_rest, mapping = aes(x = data_rest$radius_mean, y = data_rest$perimeter_mean)) +\n  geom_point(mapping = aes(color = diagnosis)) +\n  labs(x = \"radius_mean\", y = \"perimeter_mean\")\ndev.off()\n\n# Correlation analysis\nchart.Correlation(numerical_data, histogram=TRUE, pch=\"+\", method = \"pearson\")\nchart.Correlation(numerical_data, histogram=TRUE, pch=\"+\", method = \"spearman\")\ncorr_matrix \u003c- as.matrix(round(cor(numerical_data)), 2)\ncorr_matrix[corr_matrix\u003c 0.05]=NA\n\n# PCA\nPCA_result \u003c- prcomp(numerical_data, center = TRUE, scale. = TRUE)\nsummary(PCA_result)\nPCA_plot \u003c- autoplot(PCA_result, data = data_rest, colour = 'diagnosis', label.size = 3, shape = FALSE,\n              loadings = TRUE, loadings.colour = 'blue',\n              loadings.label = TRUE, loadings.label.size = 2)\nggplotly(PCA_plot)\n\n# K-Means Clustering\nset.seed(1)\nNbClust(numerical_data, distance = \"euclidean\", min.nc = 2, max.nc = 10, method = \"kmeans\")\nkm.res \u003c- kmeans(numerical_data_scaled, 2)\nfviz_cluster(km.res, numerical_data, ellipse.type = \"norm\", repel = TRUE)\nggsave(\"kmeans_graph.png\")\npam.res \u003c- pam(numerical_data_scaled, 2)\nfviz_cluster(pam.res, geom = \"point\", ellipse.type = \"norm\")\n\n```\n\n## Screenshots 🚗\n### k-Mean Graph\n![k-Mean](/screenshots/kmean.png?raw=true)\n\n---\n\n### Boxplot Graph of Each Feature\n![Boxplot](/screenshots/boxplot.png?raw=true)\n\n---\n\n### Correlation of Graph Each Feature\n![Correlation](/screenshots/correlation.png?raw=true)\n\n---\n\n### Logistic Regression Graph of compactness_mean and diagnosis columns\n![Regression of compactness_mean/diagnosis](/screenshots/regression1.png?raw=true)\n\n---\n\n### Logistic Regression Graph of radius_se and diagnosis columns\n![Regression of radius_se/diagnosis ](/screenshots/regression2.png?raw=true)\n\n## Conclusions 📣\nThrough exploratory data analysis and statistical modeling, this project reveals significant insights into the relationships between features and their impact on the diagnosis variable. The use of correlation analysis, PCA, and clustering methods provides a comprehensive understanding of the data and highlights key patterns and relationships.\n\n* More comprehensive report is avalable in the project folder within `report.docx`.\n\n## Dataset 📰\n* The original dataset that is used in this project is called 'Breast Cancer Wisconsin (Diagnostic) Data Set' and it can be accessed [here](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download). \n\n## License 📍\nThis project is licensed under the GNU General Public License v3.0 (GPL-3.0) - see the [LICENSE](LICENSE) file for details.\n\n\n## Contact 📩\nLet me know if there are any specific details you’d like to adjust or additional sections you want to include!  \n* **Email**: kivancgordu@hotmail.com\n* **Version**: 1.0.0\n* **Date**: 23-06-2024\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkivanc57%2Ffeature_comparison","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkivanc57%2Ffeature_comparison","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkivanc57%2Ffeature_comparison/lists"}