{"id":15664271,"url":"https://github.com/deep-diver/enron-data-analysis","last_synced_at":"2026-01-08T13:06:43.904Z","repository":{"id":82125095,"uuid":"122282544","full_name":"deep-diver/Enron-Data-Analysis","owner":"deep-diver","description":"Data Analysis and Machine Learning on Enron Data","archived":false,"fork":false,"pushed_at":"2018-10-21T08:05:46.000Z","size":1084,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-05T06:25:57.391Z","etag":null,"topics":["data-analysis","enron-data","exploratory-data-analysis","machine-learning"],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deep-diver.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-02-21T02:15:19.000Z","updated_at":"2019-07-15T08:05:25.000Z","dependencies_parsed_at":"2023-03-12T10:30:49.022Z","dependency_job_id":null,"html_url":"https://github.com/deep-diver/Enron-Data-Analysis","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deep-diver%2FEnron-Data-Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deep-diver%2FEnron-Data-Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deep-diver%2FEnron-Data-Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deep-diver%2FEnron-Data-Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deep-diver","download_url":"https://codeload.github.com/deep-diver/Enron-Data-Analysis/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246273554,"owners_count":20750906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","enron-data","exploratory-data-analysis","machine-learning"],"created_at":"2024-10-03T13:41:53.554Z","updated_at":"2026-01-08T13:06:43.892Z","avatar_url":"https://github.com/deep-diver.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Analysis and Machine Learning on Enron Dataset\n\nThis notebook shows how data analysis on enron dataset can be done. The goal of the analysis is to find the best machine learning algorithm with the best precision and recal metric values. Each algorithms' job on the way is to correctly classify poi(person of interest) from the dataset. POIs are who I am interested in since I think they are strongly related to [Enron Scandal](https://en.wikipedia.org/wiki/Enron_scandal). POIs are chosen mannually and provided by Udacity's \"[Intro to Machine Learning](https://www.udacity.com/course/intro-to-machine-learning--ud120)\" course. You can think of this notebook as a part of the assignment for the final project from the course.\n\n# The way it is organized\n\n1. Choose features of my interest\n2. Perform basic data analysis\n3. Find outliers, and remove them when needed\n4. Perform various machine learning algorithms\n5. Compare each results\n6. Confirm the best result\n\n# Machine Learning Part\n\n1. Perform basic DecisionTree classifier on raw data\n2. Perform basic DecisionTree classifier on data that outliers are removed\n3. Define a function to measure accuracy, precision, and recall metrics\n4. Define a function to run Pipeline with SelectKBest, and GridSearchCV\n5. Run different kinds of ML algorithms with a number of different parameters\n - Decision Tree Classifier\n - Adaboost Classifier\n - Random Forest Classifier\n - Support Vector Machine Classifier\n - Gaussian Naive Bayse Classifier\n\n# Result\n![F1 Score Result](f1.png)\n\u003cdiv style=\"text-align: center\"\u003e F1 Score Result \u003c/div\u003e\n\u003cbr/\u003e\u003cbr/\u003e\n\n![Accuracy Score Result](acc.png)\n\u003cdiv style=\"text-align: center\"\u003e Accuracy Score Result \u003c/div\u003e\n\n\n# Conclusion\nThe best model I could find is 'Adaboost'. The parameters with the below, it did the best job.\n\n- feature list: 'poi', 'bonus'\n- algotirthm: 'SAMME.R'\n- learning rate: 0.05\n- number of estimators: 30\n\nAnd the scores are\n\n- accuracy: 0.827\n- f1: 0.7159\n\nThis model achieved the best f1 score comparing to other models, DecisionTree, Gaussian Naive Bayes, Random Forest, and Support Vector Machine. While having the best f1 score, the number of feature used is only 2. I think it could mean this model is not overfitted much. Furthermore, it achieved the best accuracy as well in the group of other models under the same number of features.\n\n# Reference\nconceptual study\n- https://www.udacity.com\n\nprogramming reference\n- https://pandas.pydata.org\n- http://scikit-learn.org/stable/\n\nerror reference\n- https://stackoverflow.com/questions/30442259/why-does-not-gridsearchcv-give-best-score-scikit-learn\n- https://stackoverflow.com/questions/25222123/why-are-the-grid-scores-higher-than-the-score-for-full-training-set-sklearn?rq=1\n- https://stackoverflow.com/questions/26097921/choosing-random-state-for-sklearn-algorithms\n- https://stackoverflow.com/questions/28064634/random-state-pseudo-random-numberin-scikit-learn\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeep-diver%2Fenron-data-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeep-diver%2Fenron-data-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeep-diver%2Fenron-data-analysis/lists"}