Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jovicdev97/financial-data-analytics

using numpy and pandas to analyze a synthetic loan dataset with python
https://github.com/jovicdev97/financial-data-analytics

data-analysis matlabplot numpy pandas plotting python seaborn

Last synced: about 17 hours ago
JSON representation

using numpy and pandas to analyze a synthetic loan dataset with python

Awesome Lists containing this project

README

        

### Source
- Dataset: Synthetic Loan Dataset
- Platform: Kaggle
- Link: [financial-risk-for-loan-approval](https://www.kaggle.com/datasets/lorenzozoppelletto/financial-risk-for-loan-approval)
- Type: Synthetic/Generated Data
- Records: 20,000
- Features: 36 columns

### Synthetic Data
- Protecting individual privacy
- Avoiding ethical concerns related to financial data
- Allowing open sharing and collaboration
- Maintaining realistic data patterns while eliminating sensitive information

### Dataset Features
- Application details (date, loan amount, duration)
- Personal information (age, employment status, education)
- Financial metrics (annual income, credit score, interest rates)
- Risk assessment (risk score, loan approval status)

## Analysis Features

### 1. Data Loading and Initial Exploration
- Loading dataset using Pandas
- Basic data examination with head() function
- Data cleaning and preprocessing

### 2. Array Operations with NumPy
- Creating and manipulating different data type arrays
- Filtering operations
- Statistical calculations

### 3. Financial Analysis
- Debt-to-Income ratio calculations
- Monthly payment analysis
- Interest rate examination
- Credit risk assessment

### 4. Data Visualization
- Line chart: Interest rates over time
- Bar chart: Distribution of employment status
- Histogram: Annual income distribution
- Box plot: Interest rates by education level
- Scatter plot: Credit score vs interest rate correlation

## Key Insights
- Interest rates remain relatively stable over the analyzed time period
- Most loan applicants are employed
- Majority of applicants have annual income under $100,000
- Higher credit scores correlate with lower interest rates
- Education level shows minimal impact on base interest rates

## Technical Requirements
### Dependencies
- Python 3.x
- NumPy >= 1.19.2
- Pandas >= 1.2.0
- Matplotlib >= 3.3.2
- optional: (Seaborn >= 0.11.0)

### Hardware Requirements
- Minimum 4GB RAM
- 1GB free disk space

## Installation
```bash
# Create virtual environment (optional but strongly (!) recommended)
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate

# Install required packages
pip install numpy pandas matplotlib seaborn

# Clone repository
git clone https://github.com/jovicdev97/loan-analysis.git
cd loan-analysis
# Usage
Clone this repository
Place the Loan.csv dataset in the project directory
Run the Jupyter notebook:
bash
jupyter notebook loan_analysis.ipynb

# Project Structure
basic

loan-analysis/

├── data/
│ └── Loan.csv

├── notebooks/
│ └── loan_analysis.ipynb

├── README.md
└── requirements.txt
```

# DATA
SOURCE OF DATA IS KAGGLE
Kaggle https://www.kaggle.com/datasets/lorenzozoppelletto/financial-risk-for-loan-approval

The original authos provides the Python Snippet to generate the provided data we are using in this project:

```
import pandas as pd
import numpy as np
from scipy import stats
from datetime import datetime, timedelta

# Number of samples
num_samples = 2000

# Seed for reproducibility
np.random.seed(42)

def generate_correlated_features(num_samples):
# Generate base features
age = np.random.normal(40, 12, num_samples).clip(18, 80).astype(int)
experience = (age - 18 - np.random.normal(4, 2, num_samples).clip(0)).clip(0).astype(int)
education_level = np.random.choice(['High School', 'Associate', 'Bachelor', 'Master', 'Doctorate'], num_samples, p=[0.3, 0.2, 0.3, 0.15, 0.05])

# Education affects income and credit score
edu_impact = {'High School': 0, 'Associate': 0.1, 'Bachelor': 0.2, 'Master': 0.3, 'Doctorate': 0.4}
edu_factor = np.array([edu_impact[level] for level in education_level])

# Generate correlated income, credit score, and employment status
base_income = np.random.lognormal(10.5, 0.6, num_samples) * (1 + edu_factor) * (1 + experience / 100)
income_noise = np.random.normal(0, 0.1, num_samples)
annual_income = (base_income * (1 + income_noise)).clip(15000, 300000).astype(int)

credit_score_base = 300 + 300 * stats.beta.rvs(5, 1.5, size=num_samples)
credit_score = (credit_score_base + edu_factor * 100 + experience * 1.5 + income_noise * 100).clip(300, 850).astype(int)

employment_status_probs = np.column_stack([
0.9 - edu_factor * 0.3, # Employed
0.05 + edu_factor * 0.2, # Self-Employed
0.05 + edu_factor * 0.1 # Unemployed
])
employment_status = np.array(['Employed', 'Self-Employed', 'Unemployed'])[np.argmax(np.random.random(num_samples)[:, np.newaxis] < employment_status_probs.cumsum(axis=1), axis=1)]

return age, experience, education_level, annual_income, credit_score, employment_status

def generate_time_based_features(num_samples):
start_date = datetime(2018, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(num_samples)]
return dates

age, experience, education_level, annual_income, credit_score, employment_status = generate_correlated_features(num_samples)
application_dates = generate_time_based_features(num_samples)

data = {
'ApplicationDate': application_dates,
'Age': age,
'AnnualIncome': annual_income,
'CreditScore': credit_score,
'EmploymentStatus': employment_status,
'EducationLevel': education_level,
'Experience': experience,
'LoanAmount': np.random.lognormal(10, 0.5, num_samples).astype(int),
'LoanDuration': np.random.choice([12, 24, 36, 48, 60, 72, 84, 96, 108, 120], num_samples, p=[0.05, 0.1, 0.2, 0.2, 0.2, 0.1, 0.05, 0.05, 0.025, 0.025]),
'MaritalStatus': np.random.choice(['Single', 'Married', 'Divorced', 'Widowed'], num_samples, p=[0.3, 0.5, 0.15, 0.05]),
'NumberOfDependents': np.random.choice([0, 1, 2, 3, 4, 5], num_samples, p=[0.3, 0.25, 0.2, 0.15, 0.07, 0.03]),
'HomeOwnershipStatus': np.random.choice(['Own', 'Rent', 'Mortgage', 'Other'], num_samples, p=[0.2, 0.3, 0.4, 0.1]),
'MonthlyDebtPayments': np.random.lognormal(6, 0.5, num_samples).astype(int),
'CreditCardUtilizationRate': np.random.beta(2, 5, num_samples),
'NumberOfOpenCreditLines': np.random.poisson(3, num_samples).clip(0, 15).astype(int),
'NumberOfCreditInquiries': np.random.poisson(1, num_samples).clip(0, 10).astype(int),
'DebtToIncomeRatio': np.random.beta(2, 5, num_samples),
'BankruptcyHistory': np.random.choice([0, 1], num_samples, p=[0.95, 0.05]),
'LoanPurpose': np.random.choice(['Home', 'Auto', 'Education', 'Debt Consolidation', 'Other'], num_samples, p=[0.3, 0.2, 0.15, 0.25, 0.1]),
'PreviousLoanDefaults': np.random.choice([0, 1], num_samples, p=[0.9, 0.1]),
'PaymentHistory': np.random.poisson(24, num_samples).clip(0, 60).astype(int),
'LengthOfCreditHistory': np.random.randint(1, 30, num_samples),
'SavingsAccountBalance': np.random.lognormal(8, 1, num_samples).astype(int),
'CheckingAccountBalance': np.random.lognormal(7, 1, num_samples).astype(int),
'TotalAssets': np.random.lognormal(11, 1, num_samples).astype(int),
'TotalLiabilities': np.random.lognormal(10, 1, num_samples).astype(int),
'MonthlyIncome': annual_income / 12,
'UtilityBillsPaymentHistory': np.random.beta(8, 2, num_samples),
'JobTenure': np.random.poisson(5, num_samples).clip(0, 40).astype(int),
}

# Create DataFrame
df = pd.DataFrame(data)

# Ensure TotalAssets is always greater than or equal to the sum of SavingsAccountBalance and CheckingAccountBalance
df['TotalAssets'] = np.maximum(df['TotalAssets'], df['SavingsAccountBalance'] + df['CheckingAccountBalance'])

# Add more complex derived features
min_net_worth = 1000 # Set a minimum net worth
df['NetWorth'] = np.maximum(df['TotalAssets'] - df['TotalLiabilities'], min_net_worth)

# More realistic interest rate based on credit score, loan amount, and loan duration
df['BaseInterestRate'] = 0.03 + (850 - df['CreditScore']) / 2000 + df['LoanAmount'] / 1000000 + df['LoanDuration'] / 1200
df['InterestRate'] = df['BaseInterestRate'] * (1 + np.random.normal(0, 0.1, num_samples)).clip(0.8, 1.2)

df['MonthlyLoanPayment'] = (df['LoanAmount'] * (df['InterestRate']/12)) / (1 - (1 + df['InterestRate']/12)**(-df['LoanDuration']))
df['TotalDebtToIncomeRatio'] = (df['MonthlyDebtPayments'] + df['MonthlyLoanPayment']) / df['MonthlyIncome']

# Create a more complex loan approval rule
def loan_approval_rule(row):
score = 0
score += (row['CreditScore'] - 600) / 250 # Credit score factor
score += (100000 - row['AnnualIncome']) / 100000 # Income factor
score += (row['TotalDebtToIncomeRatio'] - 0.4) * 2 # DTI factor
score += (row['LoanAmount'] - 10000) / 90000 # Loan amount factor
score += (row['InterestRate'] - 0.05) * 10 # Interest rate factor
score += 0.5 if row['BankruptcyHistory'] == 1 else 0 # Bankruptcy penalty
score += 0.3 if row['PreviousLoanDefaults'] == 1 else 0 # Previous default penalty
score += 0.2 if row['EmploymentStatus'] == 'Unemployed' else 0 # Employment status factor
score -= 0.1 if row['HomeOwnershipStatus'] in ['Own', 'Mortgage'] else 0 # Home ownership factor
score -= row['PaymentHistory'] / 120 # Payment history factor
score -= row['LengthOfCreditHistory'] / 60 # Length of credit history factor
score -= row['NetWorth'] / 500000 # Net worth factor

# Age factor (slight preference for middle-aged applicants)
score += abs(row['Age'] - 40) / 100

# Experience factor
score -= row['Experience'] / 200

# Education factor
edu_score = {'High School': 0.2, 'Associate': 0.1, 'Bachelor': 0, 'Master': -0.1, 'Doctorate': -0.2}
score += edu_score[row['EducationLevel']]

# Seasonal factor (higher approval rates in spring/summer)
month = row['ApplicationDate'].month
score -= 0.1 if 3 <= month <= 8 else 0

# Random factor to add some unpredictability
score += np.random.normal(0, 0.1)

return 1 if score < 1 else 0 # Adjust this threshold to change overall approval rate

df['LoanApproved'] = df.apply(loan_approval_rule, axis=1)

# Add some noise and outliers
noise_mask = np.random.choice([True, False], num_samples, p=[0.01, 0.99])
df.loc[noise_mask, 'AnnualIncome'] = (df.loc[noise_mask, 'AnnualIncome'] * np.random.uniform(1.5, 2.0, noise_mask.sum())).astype(int)

low_net_worth_mask = df['NetWorth'] == min_net_worth
df.loc[low_net_worth_mask, 'NetWorth'] += np.random.randint(0, 10000, size=low_net_worth_mask.sum())

# Print some statistics
print(f"Loan Approval Rate: {df['LoanApproved'].mean():.2%}")
print(f"Average Credit Score: {df['CreditScore'].mean():.0f}")
print(f"Average Annual Income: ${df['AnnualIncome'].mean():.0f}")
print(f"Average Loan Amount: ${df['LoanAmount'].mean():.0f}")
print(f"Average Total Debt-to-Income Ratio: {df['TotalDebtToIncomeRatio'].mean():.2f}")
print(f"Average Interest Rate: {df['InterestRate'].mean():.2%}")

def assign_credit_score_risk(credit_score):
if credit_score >= 750: return 1
elif 700 <= credit_score < 750: return 2
elif 650 <= credit_score < 700: return 3
elif 600 <= credit_score < 650: return 4
else: return 5

def assign_dti_risk(dti):
if dti < 0.20: return 1
elif 0.20 <= dti < 0.30: return 2
elif 0.30 <= dti < 0.40: return 3
elif 0.40 <= dti < 0.50: return 4
else: return 5

def assign_payment_history_risk(payment_history):
if payment_history >= 99: return 1
elif 97 <= payment_history < 99: return 2
elif 95 <= payment_history < 97: return 3
elif 90 <= payment_history < 95: return 4
else: return 5

def assign_bankruptcy_risk(bankruptcy_history):
return 5 if bankruptcy_history else 1

def assign_previous_defaults_risk(previous_defaults):
if previous_defaults == 0: return 1
elif previous_defaults == 1: return 3
else: return 5

def assign_utilization_risk(utilization):
if utilization < 0.20: return 1
elif 0.20 <= utilization < 0.40: return 2
elif 0.40 <= utilization < 0.60: return 3
elif 0.60 <= utilization < 0.80: return 4
else: return 5

def assign_credit_history_risk(length_of_history):
if length_of_history >= 10: return 1
elif 7 <= length_of_history < 10: return 2
elif 5 <= length_of_history < 7: return 3
elif 3 <= length_of_history < 5: return 4
else: return 5

def assign_income_risk(annual_income):
if annual_income >= 120000: return 1
elif 80000 <= annual_income < 120000: return 2
elif 50000 <= annual_income < 80000: return 3
elif 30000 <= annual_income < 50000: return 4
else: return 5

def assign_employment_risk(employment_status):
if employment_status == 'Employed': return 1
elif employment_status == 'Self-employed': return 2
elif employment_status == 'Part-time': return 3
else: return 4 # Unemployed or other

def assign_net_worth_risk(net_worth):
if net_worth >= 500000: return 1
elif 250000 <= net_worth < 500000: return 2
elif 100000 <= net_worth < 250000: return 3
elif 50000 <= net_worth < 100000: return 4
else: return 5

# Refined overall risk calculation
def calculate_overall_risk(row):
base_score = (
assign_credit_score_risk(row['CreditScore']) * 3 +
assign_dti_risk(row['DebtToIncomeRatio']) * 2 +
assign_payment_history_risk(row['PaymentHistory']) * 2 +
assign_bankruptcy_risk(row['BankruptcyHistory']) * 3 +
assign_previous_defaults_risk(row['PreviousLoanDefaults']) * 3 +
assign_utilization_risk(row['CreditCardUtilizationRate']) +
assign_credit_history_risk(row['LengthOfCreditHistory']) +
assign_income_risk(row['AnnualIncome']) +
assign_employment_risk(row['EmploymentStatus']) +
assign_net_worth_risk(row['NetWorth']) * 2
)

# Adjust score based on loan approval status
if row['LoanApproved'] == 1: # Assuming 1 means approved
base_score *= 0.8 # Reduce risk score for approved loans

return base_score

# Apply the refined risk calculation
df['RiskScore'] = df.apply(calculate_overall_risk, axis=1)

# Save to CSV
df.to_csv('focused_synthetic_loan_data.csv', index=False)
print("\nFocused synthetic data saved to 'focused_synthetic_loan_data.csv'")

# Display final feature count
print(f"\nTotal number of features (including label): {len(df.columns)}")
print("\nFeatures:")
for column in df.columns:
print(f"- {column}")
```