{"id":19947848,"url":"https://github.com/aarryasutar/credit_eda","last_synced_at":"2026-04-13T15:32:58.318Z","repository":{"id":250438502,"uuid":"834479132","full_name":"aarryasutar/Credit_EDA","owner":"aarryasutar","description":"This project focuses on cleaning and analyzing a loan application dataset to gain insights into the factors influencing loan defaults. Through systematic data cleaning, visualization, and merging with previous application data, it provides a robust foundation for further predictive modeling.","archived":false,"fork":false,"pushed_at":"2024-07-27T12:04:39.000Z","size":1493,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-28T03:48:18.638Z","etag":null,"topics":["binning","boxplot","correlation-matrix","data-cleaning","data-splitting","dataframe","feature-engineering","heatmap","jupyter-notebook","matplotlib","numpy","pandas","python","scikit-learn","seaborn"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aarryasutar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-27T11:49:40.000Z","updated_at":"2024-08-08T18:36:47.000Z","dependencies_parsed_at":"2024-07-27T13:12:02.212Z","dependency_job_id":null,"html_url":"https://github.com/aarryasutar/Credit_EDA","commit_stats":null,"previous_names":["aarryasutar/credit_eda"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aarryasutar/Credit_EDA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aarryasutar%2FCredit_EDA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aarryasutar%2FCredit_EDA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aarryasutar%2FCredit_EDA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aarryasutar%2FCredit_EDA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aarryasutar","download_url":"https://codeload.github.com/aarryasutar/Credit_EDA/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aarryasutar%2FCredit_EDA/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31759421,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T15:25:13.801Z","status":"ssl_error","status_checked_at":"2026-04-13T15:25:09.162Z","response_time":93,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binning","boxplot","correlation-matrix","data-cleaning","data-splitting","dataframe","feature-engineering","heatmap","jupyter-notebook","matplotlib","numpy","pandas","python","scikit-learn","seaborn"],"created_at":"2024-11-13T00:37:42.675Z","updated_at":"2026-04-13T15:32:58.267Z","avatar_url":"https://github.com/aarryasutar.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Cleaning and Analysis of Application Data\n\n## Introduction\nThis project involves the cleaning and analysis of a dataset called `application_data.csv`, which contains information on loan applications. The goal is to clean the data by handling missing values, removing unwanted columns, and exploring various features through visualizations.\n\n## Importing Libraries\nNecessary libraries like `numpy`, `pandas`, `matplotlib`, `seaborn`, and `warnings` are imported to handle data manipulation, visualization, and suppressing warnings.\n\n## Data Loading\nThe application data is read into a DataFrame using `pd.read_csv(\"application_data.csv\")`.\n\n## Initial Data Exploration\n- **Shape**: The initial number of rows and columns is checked using `df.shape`.\n- **Info**: Detailed information about the DataFrame is obtained using `df.info('all')`.\n- **Null Values**: The percentage of missing values in each column is calculated.\n\n## Handling Missing Values\n- **Dropping Columns with \u003e40% Missing Values**: Columns with more than 40% missing values are dropped.\n- **Null Values Verification**: The percentage of missing values is checked again after dropping columns.\n\n## Column-Specific Missing Values Treatment\n- **AMT_ANNUITY**: Missing values are filled with the median value.\n- **Rows with \u003e40% Missing Values**: Rows having more than 40% missing values are dropped.\n- **Unwanted Columns**: Several columns deemed unnecessary are dropped from the dataset.\n\n## Handling 'XNA' Values\n- **CODE_GENDER**: Rows with 'XNA' values are updated to 'F' based on the majority value.\n- **ORGANIZATION_TYPE**: Rows with 'XNA' values are dropped.\n\n## Data Type Conversion\nNumeric columns are converted to the appropriate data types using `pd.to_numeric`.\n\n## Binning Income and Credit Amounts\n- **AMT_INCOME_TOTAL**: Binned into various ranges for better analysis.\n- **AMT_CREDIT**: Binned into various ranges for better analysis.\n\n## Dataset Splitting\nThe dataset is split into two based on the `TARGET` column:\n- `target0_df`: Clients without payment difficulties.\n- `target1_df`: Clients with payment difficulties.\n\n## Imbalance Calculation\nThe imbalance ratio between the majority (`target0_df`) and minority (`target1_df`) classes is calculated.\n\n## Visualization\nSeveral visualizations are created to understand the distribution of various features:\n- **Income Range Distribution**: Plotted by `CODE_GENDER`.\n- **Income Type Distribution**: Plotted by `CODE_GENDER`.\n- **Contract Type Distribution**: Plotted by `CODE_GENDER`.\n- **Organization Type Distribution**: Plotted on a logarithmic scale.\n\n## Correlation Analysis\nCorrelation matrices are computed for both target classes and visualized using heatmaps.\n\n## Outlier Detection\nBox plots are used to detect outliers in various features:\n- **Income Amount**: Distribution visualized for `target0_df`.\n- **Credit Amount**: Distribution visualized for both `target0_df` and `target1_df`.\n- **Annuity Amount**: Distribution visualized for both `target0_df` and `target1_df`.\n\n## Additional Visualizations\n- **Credit Amount vs Education Status**: Visualized using box plots.\n- **Income Amount vs Education Status**: Visualized using box plots.\n\n## Previous Application Data\n- **Data Loading**: Previous application data is read into `df1`.\n- **Missing Data Handling**: Columns with more than 40% missing values are dropped.\n- **'XNA' and 'XAP' Values**: Rows with these values are removed.\n- **Data Merging**: The cleaned previous application data is merged with the application data.\n- **Column Renaming**: Columns are renamed for better understanding.\n\n## Visualization of Merged Data\n- **Contract Status Distribution**: Visualized with purposes.\n- **Purposes Distribution by Target**: Visualized with a count plot.\n- **Credit Amount vs Loan Purpose**: Visualized using box plots.\n- **Credit Amount vs Housing Type**: Visualized using bar plots.\n\n## Conclusion\nThe data cleaning and analysis process involves handling missing values, updating specific columns, binning continuous variables, splitting the dataset, visualizing distributions, and merging with previous application data. This comprehensive approach ensures a clean dataset for further analysis and modeling.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faarryasutar%2Fcredit_eda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faarryasutar%2Fcredit_eda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faarryasutar%2Fcredit_eda/lists"}