https://github.com/SimranShaikh20/Credit-Card-fraud-Detection
Fraud detection using machine learning
https://github.com/SimranShaikh20/Credit-Card-fraud-Detection
Last synced: 11 months ago
JSON representation
Fraud detection using machine learning
- Host: GitHub
- URL: https://github.com/SimranShaikh20/Credit-Card-fraud-Detection
- Owner: SimranShaikh20
- Created: 2024-06-29T11:31:31.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-16T13:23:50.000Z (over 1 year ago)
- Last Synced: 2024-10-18T10:15:39.781Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 295 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hackathon_Project
Fraud transaction detection using machine learning
kaggle dataset file link is here:
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
## Project Aims
This project aims to convey several key points:
1. **Practical Application of Machine Learning**:
Demonstrates a real-world application of machine learning in financial security, specifically in detecting credit card fraud.
2. **Handling Imbalanced Datasets**:
Showcases how to deal with imbalanced datasets, a common challenge in fraud detection. The code uses undersampling of the majority class (legitimate transactions) to balance the dataset.
3. **Basic Machine Learning Workflow**:
Illustrates the fundamental steps in a machine learning project:
- Data loading and preprocessing
- Splitting data into features and target
- Dividing data into training and testing sets
- Model selection, training, and evaluation
4. **Use of Popular Data Science Libraries**:
Demonstrates the use of common Python libraries for data science and machine learning:
- pandas for data manipulation
- scikit-learn for machine learning tasks
- numpy for numerical operations
5. **Simple Model Implementation**:
Uses Logistic Regression, a straightforward and interpretable model, as a starting point for fraud detection.
6. **Model Evaluation**:
Shows how to evaluate a model's performance using accuracy scores on both training and test data.
7. **Interactive Web Application**:
Integrates with Streamlit to create a simple web interface for the model, allowing users to input data and receive predictions.
8. **Reproducibility**:
Includes a link to the dataset and provides the code, emphasizing reproducibility in data science.
9. **Potential for Expansion**:
While the current implementation is basic, it provides a foundation that can be built upon with more advanced techniques.
10. **Importance of Fraud Detection**:
Highlights the significance of fraud detection in the financial sector, addressing a real-world problem that affects many people and businesses.
### Data Preprocessing
1. **Data Loading**:
- The dataset is loaded from 'creditcard.csv' using pandas:
```python
data = pd.read_csv("creditcard.csv")
```
2. **Class Separation**:
- Legitimate and fraudulent transactions are separated:
```python
legit = data[data.Class == 0]
fraud = data[data['Class'] == 1]
```
- This separation allows for analysis of class imbalance.
3. **Feature and Target Separation**:
- Features (X) and target variable (y) are split:
```python
x = data.drop('Class', axis=1)
y = data['Class']
```
4. **Handling Class Imbalance**:
- Undersampling of the majority class (legitimate transactions) is performed:
```python
legit_s = legit.sample(n=len(fraud), random_state=2)
data = pd.concat([legit_s, fraud], axis=0)
```
- This creates a balanced dataset for training.
### Model Training
1. **Train-Test Split**:
- Data is split into training and testing sets:
```python
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=2)
```
- 20% of data is reserved for testing.
- Stratification ensures that the class distribution is maintained in both sets.
2. **Model Selection**:
- Logistic Regression is chosen as the classification algorithm:
```python
model = LogisticRegression()
```
- This is a good baseline model for binary classification tasks.
3. **Model Training**:
- The model is trained on the training data:
```python
model.fit(x_train, y_train)
```
### Model Evaluation
1. **Accuracy Calculation**:
- The model's performance is evaluated using accuracy scores:
```python
train_acc = accuracy_score(model.predict(x_train), y_train)
test_acc = accuracy_score(model.predict(x_test), y_test)
```
- Both training and testing accuracies are calculated to assess overfitting.
### Conclusion
This implementation provides a solid foundation for credit card fraud detection. The use of undersampling to balance the dataset and Logistic Regression as the classification algorithm offers a good starting point. The separate calculation of training and testing accuracies allows for basic assessment of model generalization.This project serves as an introductory example of applying machine learning to a critical financial problem, demonstrating how relatively simple techniques can be used to approach complex real-world issues. It provides a starting point for understanding and implementing fraud detection systems.
Thank You !