{"id":16420145,"url":"https://github.com/geekquad/fraud-detection","last_synced_at":"2025-07-23T09:06:39.382Z","repository":{"id":110635458,"uuid":"274933965","full_name":"geekquad/Fraud-Detection","owner":"geekquad","description":"A Person Of Interest identifier based on ENRON CORPUS data.","archived":false,"fork":false,"pushed_at":"2020-07-22T19:52:11.000Z","size":1148,"stargazers_count":28,"open_issues_count":1,"forks_count":8,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-23T18:44:33.638Z","etag":null,"topics":["enron-dataset","enron-emails","evaluation","feature-selection","fraud-detection","kmeans","outlier-removal","person-of-interest","regression-analysis","validation"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/geekquad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-25T14:13:40.000Z","updated_at":"2025-04-01T10:00:05.000Z","dependencies_parsed_at":"2023-04-01T09:20:12.082Z","dependency_job_id":null,"html_url":"https://github.com/geekquad/Fraud-Detection","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/geekquad/Fraud-Detection","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geekquad%2FFraud-Detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geekquad%2FFraud-Detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geekquad%2FFraud-Detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geekquad%2FFraud-Detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/geekquad","download_url":"https://codeload.github.com/geekquad/Fraud-Detection/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/geekquad%2FFraud-Detection/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266649176,"owners_count":23962181,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-23T02:00:09.312Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["enron-dataset","enron-emails","evaluation","feature-selection","fraud-detection","kmeans","outlier-removal","person-of-interest","regression-analysis","validation"],"created_at":"2024-10-11T07:27:01.214Z","updated_at":"2025-07-23T09:06:39.353Z","avatar_url":"https://github.com/geekquad.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fraud-Detection\nThe Enron fraud is the largest case of corporate fraud in American history. Founded in 1985, Enron Corporation went bankrupt by end of 2001 due to widespread corporate fraud and corruption. Before its fall, Fortune magazine had named Enron “America’s most innovative company” for six consecutive years.\n\n**Dataset**: \u003ca href=\"https://www.cs.cmu.edu/~./enron/\"\u003e https://www.cs.cmu.edu/~./enron/ \u003c/a\u003e \n\u003chr\u003e\u003c/hr\u003e\n\n## \u003cu\u003e Goal of the Porject: \u003c/u\u003e\nThe goal of the project is to go through the thought process of data exploration (learning, cleaning and preparing the data), \nfeature selecting/engineering (selecting the features which influence mostly on the target, \ncreate new features (which explains the target the better than existing) and, \nreducing the dimensionality of the data using principal component analysis (PCA)), \npicking/tuning one of the supervised machine learning algorithm and validating it to get the accurate person of interest identifier model.\n\n## \u003cu\u003e Data Exploration \u003c/u\u003e\nThe features in the data fall into three major types, namely \n- financial features, \n- email features \n- POI labels.\n\nThere are 143 samples with 20 features and a binary classification (\"poi\")\nAmong 146 samples, there are\n- 18 POI and \n- 128 non-POI.\n\n\u003chr\u003e \u003c/hr\u003e\n\n## Optimize Feature Selection/Engineering\nDuring the work on the project, I've played with the different features and models. One strategy was to standardize features, \napply principal component analysis and GaussianNB classifier, another strategy was to use decision tree classifier, incl. choosing the \nfeatures with features importance attribute and tuning the model.\n\n\u003cimg src=\"https://github.com/geekquad/Fraud-Detection/blob/master/img/feature.png\"\u003e\n\n### Create new features\nFor both strategies I've tried to create new features as a fraction of almost all financial variables (f.ex. fractional bonus \nas fraction of bonus to total_payments, etc.). Logic behind email feature creation was to check the fraction of emails, sent to POI, \nto all sent emails; emails, received from POI, to all received emails.\nI've end up with using one new feature fraction_to_POI.\n\u003chr\u003e \u003c/hr\u003e\n\n## \u003cu\u003e Pick and Tune an Algorithm: \u003c/u\u003e\nI've played with 7 machine learning algorithms:\n- Naive Bayes (GaussianNB)\n- SVC\n- RandomForestClassifier\n- ExtraTreesClassifier\n- AdaBoostClassifier\n- LogisticRegression\n- SVC\n\n### Comparing Classifiers based on cross-validation scores:\n- 1st tier: SVC, RandomForestClassifier\n- 2nd tier: GaussianNB, ExtraTreesClassifier, AdaBoostClassifier\n- 3rd tier: Logistic Regression, LinearSVC\n\n### Tuning the algorithm:\nBias-variance tradeoff is one of the key dilema in machine learning. High bias algorithms has no capacity to learn, high variance algorithms \nreact poorly in case they didn't see such data before. Predictive model should be tuned to achieve compromise. The process of changing the parameteres of algorithms is \nalgorithm tuning and it lets us find the golden mean and best result. If I don't tune the algorithm well, I don't get the best result I could.\nAlgorithm might be tuned manually by iteratively changing the parameteres and tracking the results. Or GridSearchCV might be used which makes this automatically.\nI've tuned the parameteres of my decision tree classifier by sequentially tuning parameter by parameter and got the best F1 using these parameters\n\u003chr\u003e \u003c/hr\u003e\n\n## Validate and Evaluate\n### Usage of Evaluation Metrics\nIn the project I've used F1 score as key measure of algorithms' accuracy. It considers both the precision and the recall of the test to compute the score.\nPrecision is the ability of the classifier not label as positive sample that is negative.\nRecall is the ability of the classifier to find all positive samples.\nThe F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.\n\n### Validation Strategy\nThe validation is a process of model performance evaluation. Classic mistace is to use small data set for the model training or validate model on the same data set as train it.\nThere are a number of strategies to validate the model. One of them is to split the available data into train and test data another one is to perform a cross validation: process of splitting the data on k beans equal size; run learning experiments; repeat this operation number of times and take the average test result.\n\u003chr\u003e \u003c/hr\u003e\n\n## \u003cu\u003e Conclusions: \u003c/u\u003e\nBefore the start of this project I was completely sure that building the machine learning is about choosing the right algorithm \nfrom the black box and some magic. Working on the person of interest identifier I've been recursively going through the process \nof data exploration, outlier detection and algorithm tuning and spend most of the time on a data preparation. The model performance raised \nsignificantly after missing values imputation, extra feature creation and feature selection and less after algorithm tuning which shows me \nonce again how important to fit the model with the good data.\nThis experience might be applied to other fraud detection tasks. I think there is way of the model improvement by \nusing and tuning alternative algorithms like Random Forest.\n\n## Limitations of the study:\nIt’s important to identify and acknowledge the limitation of the study. My conclusions are based just on the provided \ndata set which represent just 143 persons. To get the real causation, I should gather all financial and email information \nabout all enron persons which is most probably not possible. Missing email values were imputed with median so the modes of the distributions \nof email features are switched to the medians. Algorithms were tuned sequentially (I've changed one parameter to achieve better performance \nand then swithched to another parameter. There is a chance that othere parameters in combination might give better model's accuracy).\n\n## References:\n- Enron data set: \u003ca href=\"https://www.cs.cmu.edu/~./enron/\"\u003e https://www.cs.cmu.edu/~./enron/ \u003c/a\u003e\n- FindLaw financial data: \u003ca href=\"http://www.findlaw.com\"\u003e http://www.findlaw.com \u003c/a\u003e \n- Visualization of POI: \u003ca href=\"http://www.nytimes.com/packages/html/national/20061023_ENRON_TABLE/index.html\"\u003e http://www.nytimes.com/packages/html/national/20061023_ENRON_TABLE/index.html \u003c/a\u003e\n- Enron on Wikipedia: \u003ca href=\"https://en.wikipedia.org/wiki/Enron\"\u003e https://en.wikipedia.org/wiki/Enron\u003c/a\u003e\n- F1 score on Wikipedia: \u003ca href=\"https://en.wikipedia.org/wiki/F1_score\"\u003e https://en.wikipedia.org/wiki/F1_score \u003c/a\u003e\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeekquad%2Ffraud-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeekquad%2Ffraud-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeekquad%2Ffraud-detection/lists"}