Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/amanovishnu/credit-card-draud-detection

this project uses machine learning to detect fraudulent credit card transactions with local outlier factor and isolation forest algorithms. we’ll evaluate performance and visualize data distribution.
https://github.com/amanovishnu/credit-card-draud-detection

isolation-factor-algorithm local-outlier-factor machine-learning numpy pandas seaborn-plots

Last synced: 14 days ago
JSON representation

Host: GitHub
URL: https://github.com/amanovishnu/credit-card-draud-detection
Owner: amanovishnu
License: gpl-3.0
Archived: true
Created: 2018-12-06T10:19:53.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2018-12-20T18:26:22.000Z (about 6 years ago)
Last Synced: 2025-01-03T02:57:50.531Z (14 days ago)
Topics: isolation-factor-algorithm, local-outlier-factor, machine-learning, numpy, pandas, seaborn-plots
Language: Jupyter Notebook
Homepage:
Size: 67 MB
Stars: 2
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        
# Credit Card Fraud Detection

Throughout the financial sector, machine learning algorithms are being developed to detect fraudulent transactions.  In this project, that is exactly what we are going to be doing as well.  Using a dataset of of nearly 28,500 credit card transactions and multiple unsupervised anomaly detection algorithms, we are going to identify transactions with a high probability of being credit card fraud.  In this project, we will build and deploy the following two machine learning algorithms:

* Local Outlier Factor (LOF)

* Isolation Forest Algorithm

Furthermore, using metrics suchs as precision, recall, and F1-scores, we will investigate why the classification accuracy for these algorithms can be misleading.

In addition, we will explore the use of data visualization techniques common in data science, such as parameter histograms and correlation matrices, to gain a better understanding of the underlying distribution of data in our data set. Let's get started!

## 1. Importing Necessary Libraries

To start, let's print out the version numbers of all the libraries we will be using in this project. This serves two purposes - it ensures we have installed the libraries correctly and ensures that this tutorial will be reproducible. 

```python

import sys

import numpy

import pandas

import matplotlib

import seaborn

import scipy

print('Python: {}'.format(sys.version))

print('Numpy: {}'.format(numpy.__version__))

print('Pandas: {}'.format(pandas.__version__))

print('Matplotlib: {}'.format(matplotlib.__version__))

print('Seaborn: {}'.format(seaborn.__version__))

print('Scipy: {}'.format(scipy.__version__))

```

    Python: 2.7.13 |Continuum Analytics, Inc.| (default, May 11 2017, 13:17:26) [MSC v.1500 64 bit (AMD64)]

    Numpy: 1.14.0

    Pandas: 0.21.0

    Matplotlib: 2.1.0

    Seaborn: 0.8.1

    Scipy: 1.0.0

    

```python

# import the necessary packages

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

```

### 2. The Data Set

In the following cells, we will import our dataset from a .csv file as a Pandas DataFrame.  Furthermore, we will begin exploring the dataset to gain an understanding of the type, quantity, and distribution of data in our dataset.  For this purpose, we will use Pandas' built-in describe feature, as well as parameter histograms and a correlation matrix. 

```python

# Load the dataset from the csv file using pandas

data = pd.read_csv('creditcard.csv')

```

```python

# Start exploring the dataset

print(data.columns)

```

    Index([u'Time', u'V1', u'V2', u'V3', u'V4', u'V5', u'V6', u'V7', u'V8', u'V9',

           u'V10', u'V11', u'V12', u'V13', u'V14', u'V15', u'V16', u'V17', u'V18',

           u'V19', u'V20', u'V21', u'V22', u'V23', u'V24', u'V25', u'V26', u'V27',

           u'V28', u'Amount', u'Class'],

          dtype='object')

    

```python

# Print the shape of the data

data = data.sample(frac=0.1, random_state = 1)

print(data.shape)

print(data.describe())

# V1 - V28 are the results of a PCA Dimensionality reduction to protect user identities and sensitive features

```

(28481, 31) 
                    Time 
    count   28481.000000 
    mean    94705.035216 
    std     47584.727034 
    min         0.000000 
    25%     53924.000000 
    50%     84551.000000 
    75%    139392.000000 
    max    172784.000000 
 
                     V5 
    count  28481.000000 
    mean      -0.015666 
    std        1.395552 
    min      -42.147898 
    25%       -0.703986 
    50%       -0.068037 
    75%        0.603574 
    max       28.762671 
 
               ... 
    count      ... 
    mean       ... 
    std        ... 
    min        ... 
    25%        ... 
    50%        ... 
    75%        ... 
    max        ... 
 
                    V25 
    count  28481.000000 
    mean      -0.000917 
    std        0.520679 
    min       -7.025783 
    25%       -0.319611 
    50%        0.015231 
    75%        0.351466 
    max        5.541598 
 
                  Class 
    count  28481.000000 
    mean       0.001720 
    std        0.041443 
    min        0.000000 
    25%        0.000000 
    50%        0.000000 
    75%        0.000000 
    max        1.000000 
 
    [8 rows x 31 columns] 


V1            V2            V3            V4  \ 28481.000000  28481.000000  28481.000000  28481.000000 -0.001143     -0.018290      0.000795      0.000350 1.994661      1.709050      1.522313      1.420003 -40.470142    -63.344698    -31.813586     -5.266509 -0.908809     -0.610322     -0.892884     -0.847370 0.031139      0.051775      0.178943     -0.017692 1.320048      0.792685      1.035197      0.737312 2.411499     17.418649      4.069865     16.715537 V6            V7            V8            V9  \ 28481.000000  28481.000000  28481.000000  28481.000000 0.003634     -0.008523     -0.003040      0.014536 1.334985      1.237249      1.204102      1.098006 -19.996349    -22.291962    -33.785407     -8.739670 -0.765807     -0.562033     -0.208445     -0.632488 -0.269071      0.028378      0.024696     -0.037100 0.398839      0.559428      0.326057      0.621093 22.529298     36.677268     19.587773      8.141560 V21           V22           V23           V24  \ 28481.000000  28481.000000  28481.000000  28481.000000 0.004740      0.006719     -0.000494     -0.002626 0.744743      0.728209      0.645945      0.603968 -16.640785    -10.933144    -30.269720     -2.752263 -0.224842     -0.535877     -0.163047     -0.360582 -0.029075      0.014337     -0.012678      0.038383 0.189068      0.533936      0.148065      0.434851 22.588989      6.090514     15.626067      3.944520 V26           V27           V28        Amount  \ 28481.000000  28481.000000  28481.000000  28481.000000 0.004762     -0.001689     -0.004154     89.957884 0.488171      0.418304      0.321646    270.894630 -2.534330     -8.260909     -9.617915      0.000000 -0.328476     -0.071712     -0.053379      5.980000 -0.049750      0.000914      0.010753     22.350000 0.253580      0.090329      0.076267     78.930000 3.118588     11.135740     15.373170  19656.530000

```python

# Plot histograms of each parameter 

data.hist(figsize = (20, 20))

plt.show()

```

![png](output_7_0.png)

```python

# Determine number of fraud cases in dataset

Fraud = data[data['Class'] == 1]

Valid = data[data['Class'] == 0]

outlier_fraction = len(Fraud)/float(len(Valid))

print(outlier_fraction)

print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))

print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

```

    0.00172341024198

    Fraud Cases: 49

    Valid Transactions: 28432

    

```python

# Correlation matrix

corrmat = data.corr()

fig = plt.figure(figsize = (12, 9))

sns.heatmap(corrmat, vmax = .8, square = True)

plt.show()

```

![png](output_9_0.png)

```python

# Get all the columns from the dataFrame

columns = data.columns.tolist()

# Filter the columns to remove data we do not want

columns = [c for c in columns if c not in ["Class"]]

# Store the variable we'll be predicting on

target = "Class"

X = data[columns]

Y = data[target]

# Print shapes

print(X.shape)

print(Y.shape)

```

    (28481, 30)

    (28481L,)

    

## 3. Unsupervised Outlier Detection

Now that we have processed our data, we can begin deploying our machine learning algorithms.  We will use the following techniques: 

**Local Outlier Factor (LOF)**

The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a 

given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the 

object is with respect to the surrounding neighborhood.

**Isolation Forest Algorithm**

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting 

a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to 

isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees 

collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

```python

from sklearn.metrics import classification_report, accuracy_score

from sklearn.ensemble import IsolationForest

from sklearn.neighbors import LocalOutlierFactor

# define random states

state = 1

# define outlier detection tools to be compared

classifiers = {

    "Isolation Forest": IsolationForest(max_samples=len(X),

                                        contamination=outlier_fraction,

                                        random_state=state),

    "Local Outlier Factor": LocalOutlierFactor(

        n_neighbors=20,

        contamination=outlier_fraction)}

```

```python

# Fit the model

plt.figure(figsize=(9, 7))

n_outliers = len(Fraud)

for i, (clf_name, clf) in enumerate(classifiers.items()):

    

    # fit the data and tag outliers

    if clf_name == "Local Outlier Factor":

        y_pred = clf.fit_predict(X)

        scores_pred = clf.negative_outlier_factor_

    else:

        clf.fit(X)

        scores_pred = clf.decision_function(X)

        y_pred = clf.predict(X)

    

    # Reshape the prediction values to 0 for valid, 1 for fraud. 

    y_pred[y_pred == 1] = 0

    y_pred[y_pred == -1] = 1

    

    n_errors = (y_pred != Y).sum()

    

    # Run classification metrics

    print('{}: {}'.format(clf_name, n_errors))

    print(accuracy_score(Y, y_pred))

    print(classification_report(Y, y_pred))

```

    Local Outlier Factor: 97

    0.9965942207085425

                 precision    recall  f1-score   support

    

              0       1.00      1.00      1.00     28432

              1       0.02      0.02      0.02        49

    

    avg / total       1.00      1.00      1.00     28481

    

    Isolation Forest: 71

    0.99750711000316

                 precision    recall  f1-score   support

    

              0       1.00      1.00      1.00     28432

              1       0.28      0.29      0.28        49

    

    avg / total       1.00      1.00      1.00     28481