{"id":15830054,"url":"https://github.com/imoxto/email-spam-detection","last_synced_at":"2025-04-01T11:13:44.170Z","repository":{"id":103506687,"uuid":"522163816","full_name":"imoxto/email-spam-detection","owner":"imoxto","description":null,"archived":false,"fork":false,"pushed_at":"2022-09-12T06:55:54.000Z","size":3225,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-12T11:11:58.111Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/imoxto.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-07T09:03:05.000Z","updated_at":"2022-09-12T10:46:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"96dc9d24-13b2-447d-811a-34483fa092ed","html_url":"https://github.com/imoxto/email-spam-detection","commit_stats":{"total_commits":44,"total_committers":4,"mean_commits":11.0,"dds":0.06818181818181823,"last_synced_commit":"e0ac71f6092f3e1f441dedb90c7c720334c9f4f1"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imoxto%2Femail-spam-detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imoxto%2Femail-spam-detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imoxto%2Femail-spam-detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imoxto%2Femail-spam-detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/imoxto","download_url":"https://codeload.github.com/imoxto/email-spam-detection/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246628229,"owners_count":20808106,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-05T11:04:25.775Z","updated_at":"2025-04-01T11:13:44.140Z","avatar_url":"https://github.com/imoxto.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Email Spam Detection\n### Dataset\n\n\n```python\n\nimport numpy as np\nimport pandas as pd\n\n# data from https://www.kaggle.com/datasets/rockinjas123/spam-ham-emails\ndata = pd.read_csv('emails.csv')\n\ndata\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etext\u003c/th\u003e\n      \u003cth\u003espam\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003eSubject: naturally irresistible your corporate...\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003eSubject: the stock trading gunslinger  fanny i...\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003eSubject: unbelievable new homes made easy  im ...\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003eSubject: 4 color printing special  request add...\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003eSubject: do not have money , get software cds ...\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e...\u003c/th\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n      \u003ctd\u003e...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e5723\u003c/th\u003e\n      \u003ctd\u003eSubject: re : research and development charges...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e5724\u003c/th\u003e\n      \u003ctd\u003eSubject: re : receipts from visit  jim ,  than...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e5725\u003c/th\u003e\n      \u003ctd\u003eSubject: re : enron case study update  wow ! a...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e5726\u003c/th\u003e\n      \u003ctd\u003eSubject: re : interest  david ,  please , call...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e5727\u003c/th\u003e\n      \u003ctd\u003eSubject: news : aurora 5 . 2 update  aurora ve...\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e5728 rows × 2 columns\u003c/p\u003e\n\u003c/div\u003e\n\n\n\n### Data Pre-Processing\n\n\n```python\nprint(f\"Rows Before Pre-Pocessing: {len(data.index)}\")\n\n# drops duplicate rows. no need to have too many rows of the same values\ndata.drop_duplicates(inplace=True)\n\n# since there are only 2 columns if any of them are null then those rows dont serve any functions\ndata.dropna(axis = 'index')\n\nprint(f\"Rows After Pre-Pocessing: {len(data.index)}\")\n```\n\n    Rows Before Pre-Pocessing: 5728\n    Rows After Pre-Pocessing: 5695\n\n\n\n```python\nfrom sklearn.model_selection import train_test_split\n\nx = data.text.values\ny = data.spam.values\nxTrain, xTest, yTrain, yTest = train_test_split(x, y, train_size = 0.8, stratify = y)\n```\n\n\n```python\nfrom sklearn.feature_extraction.text import CountVectorizer\n\ncv = CountVectorizer()\n\n# converting word to numbers via counting word frequencies: eg. \"i hate spam. spam is bad\" -\u003e i = 1, hate = 1, spam = 2, is = 1, bad = 1\nxTrain_cv = cv.fit_transform(xTrain)\nxTest_cv = cv.transform(xTest)\n```\n\n##### Correlation\n\n\n```python\n\ncvDf = pd.DataFrame(xTrain_cv.todense(), columns=cv.get_feature_names_out())\ncvDf['spam@unique'] = yTrain\ncorrelation = cvDf.corrwith(cvDf['spam@unique'])\ncorrelation\n```\n\n\n\n\n    00            -0.044479\n    000            0.095617\n    0000           0.123365\n    000000         0.026358\n    00000000      -0.013604\n                     ...   \n    zzmacmac      -0.008329\n    zzn            0.026358\n    zzncacst      -0.008329\n    zzzz           0.042336\n    spam@unique    1.000000\n    Length: 33471, dtype: float64\n\n\n\n\n```python\nimport matplotlib.pyplot as plt\n\nN, bins, patches  = plt.hist(correlation)\n\n\npatches[1].set_facecolor('blue')\npatches[2].set_facecolor('green')\npatches[3].set_facecolor('red')\n# rest are default colors\n\nplt.xlabel('Correlation')\nplt.ylabel('Count of columns')\nplt.title('Count of columns in each correlation group')\n\nplt.show()\n```\n\n\n    \n![png](esd_files/esd_8_0.png)\n    \n\n\nSince all of the word frequencies(x-variables) have almost no correlation with whether the email being spam or not (y), it can be concluded that the x-variables are not independent among each other. The x-variables are most likely dependant among each other.\n\n##### Scaling\n\n\n```python\nfrom sklearn.preprocessing import MaxAbsScaler\n\nmas = MaxAbsScaler()\nmas.fit(xTrain_cv)\nxTrainScaled = mas.transform(xTrain_cv)\nxTestScaled = mas.transform(xTest_cv)\n# All x variables are scaled to be 0 to 1\n```\n\n### Modelling Data\n\n\n```python\ndef modelData(model, scaled=True):\n  x = xTrainScaled if scaled else xTrain_cv\n  y = yTrain\n  model.fit(x,y)\n  return model\n```\n\n\n```python\nfrom sklearn.metrics import precision_score, f1_score, recall_score, roc_curve, roc_auc_score, ConfusionMatrixDisplay\n\n\ndef display_score(trained_model, scaled = True, probability = True):\n  x = xTestScaled if scaled else xTest_cv\n  y = yTest\n  y_pred = trained_model.predict(x)\n  name = str(type(trained_model).__name__)\n  name += \" \u003cScaled Data\u003e\" if scaled else \"\"\n\n  acc = trained_model.score(x, y)\n  prec = precision_score(y, y_pred)\n  f1scre = f1_score(y, y_pred)\n  recal = recall_score(y, y_pred)\n  \n  print(f\"Accuracy: {acc}\\nPrecision: {prec}\\nF1 Score: {f1scre}\\nRecall Score: {recal}\")\n\n  fpr, tpr = None, None\n  if probability:\n    y_score = trained_model.predict_proba(x)\n    y_score = y_score[:, 1]\n    rocAuc = roc_auc_score(y, y_score)\n    fpr, tpr, _ = roc_curve(y, y_score)\n    print(f\"ROC AUC score: {rocAuc}\")\n  print()\n  mat = ConfusionMatrixDisplay.from_predictions( y, y_pred)\n  plt.title(f\"Confusion matrix for {name}\")\n  plt.show()\n  print()\n  return {\"name\":name , \"acc\": acc, \"prec\":prec, \"f1scre\":f1scre, \"recal\":recal, \"fpr\":fpr, \"tpr\":tpr, \"mat\": mat}\n```\n\n\n```python\ncustomTests = [\n  \"Hello sir! When is the deadline for CSE422 project report submission?\",\n  \"Dear Students, The University is happy to offer a student transport service and wishes to keep the price as low as feasible while covering the cost of the service. You all know that the price of fuel has increased significantly: the price of diesel has increased by 42.5% (Previous price- Tk.80, New price Tk.114).Bus fares have already been raised across the country. Despite the increase in the fuel price, the University will keep the student transport fares unchanged for the remainder of the current semester. There will be a need, however, to increase the fare to Tk. 90 from Tk.70 with effect from the start of the fall semester. The University hopes that you will understand the pressures that have led to this change. Best regards, Office of the Registrar\",\n  \"Click here to get free discord nitro\"\n  ]\ncustomTests_cv = cv.transform(customTests)\ncustomTestsScaled = mas.transform(customTests_cv)\n\ndef getCustomTestResults(model, scaled=True):\n\n  results = model.predict(customTestsScaled if scaled else customTests_cv)\n  assert len(results) == len(customTests), f\"length of results, {len(results)} and tests, {len(customTests)} are unequal\"\n  print(\"Custom string results:\")\n  for i in range(len(results)):\n    spam = \"spam\" if results[i] == 1 else \"ok\"\n    if len(customTests[i]) \u003e 80:\n      print(f\"{spam}  --\u003e  \\\"{customTests[i][0:35]} ... {customTests[i][-35:]}\\\"\")\n    else:\n      print(f\"{spam}  --\u003e  \\\"{customTests[i]}\\\"\")\n```\n\n##### Multinomial Naive Bayes\n\n\n```python\nfrom sklearn.naive_bayes import MultinomialNB as MNB\n\n# scaled\nmnbModelScaled = modelData(MNB())\n\nmnbS = display_score(mnbModelScaled)\n\ngetCustomTestResults(mnbModelScaled)\n```\n\n    Accuracy: 0.9780509218612818\n    Precision: 1.0\n    F1 Score: 0.9521988527724665\n    Recall Score: 0.9087591240875912\n    ROC AUC score: 0.9949917724990506\n    \n\n\n\n    \n![png](esd_files/esd_16_1.png)\n    \n\n\n    \n    Custom string results:\n    ok  --\u003e  \"Hello sir! When is the deadline for CSE422 project report submission?\"\n    ok  --\u003e  \"Dear Students, The University is ha ... st regards, Office of the Registrar\"\n    ok  --\u003e  \"Click here to get free discord nitro\"\n\n\n\n```python\nmnbModel = modelData(MNB(), False)\n\nmnb = display_score(mnbModel, False)\n\ngetCustomTestResults(mnbModel, False)\n```\n\n    Accuracy: 0.9885864793678666\n    Precision: 0.9851301115241635\n    F1 Score: 0.9760589318600369\n    Recall Score: 0.9671532846715328\n    ROC AUC score: 0.9965275726762582\n    \n\n\n\n    \n![png](esd_files/esd_17_1.png)\n    \n\n\n    \n    Custom string results:\n    ok  --\u003e  \"Hello sir! When is the deadline for CSE422 project report submission?\"\n    ok  --\u003e  \"Dear Students, The University is ha ... st regards, Office of the Registrar\"\n    spam  --\u003e  \"Click here to get free discord nitro\"\n\n\n##### Support Vector Classifier\n\n\n```python\nfrom sklearn.svm import SVC\n\n# scaled\nsvcModelScaled = modelData( SVC(kernel=\"linear\", probability = True) )\n\nsvcS = display_score(svcModelScaled, probability= True )\n\ngetCustomTestResults(svcModelScaled )\n```\n\n    Accuracy: 0.9648814749780509\n    Precision: 0.9717741935483871\n    F1 Score: 0.9233716475095787\n    Recall Score: 0.8795620437956204\n    ROC AUC score: 0.9945107801358593\n    \n\n\n\n    \n![png](esd_files/esd_19_1.png)\n    \n\n\n    \n    Custom string results:\n    ok  --\u003e  \"Hello sir! When is the deadline for CSE422 project report submission?\"\n    ok  --\u003e  \"Dear Students, The University is ha ... st regards, Office of the Registrar\"\n    ok  --\u003e  \"Click here to get free discord nitro\"\n\n\n\n```python\nsvcModel = modelData( SVC(kernel=\"linear\", probability=True) , False)\n\nsvc = display_score(svcModel, False, True)\n\ngetCustomTestResults(svcModel, False)\n```\n\n    Accuracy: 0.9877085162423178\n    Precision: 0.9814814814814815\n    F1 Score: 0.9742647058823529\n    Recall Score: 0.9671532846715328\n    ROC AUC score: 0.9964220918948568\n    \n\n\n\n    \n![png](esd_files/esd_20_1.png)\n    \n\n\n    \n    Custom string results:\n    ok  --\u003e  \"Hello sir! When is the deadline for CSE422 project report submission?\"\n    ok  --\u003e  \"Dear Students, The University is ha ... st regards, Office of the Registrar\"\n    spam  --\u003e  \"Click here to get free discord nitro\"\n\n\n##### Random Forest Classifier\n\n\n```python\nfrom sklearn.ensemble import RandomForestClassifier as RFC\n\n# scaled\nrfcModelScaled = modelData(RFC(n_estimators=50))\n\nrfcS = display_score(rfcModelScaled)\n\ngetCustomTestResults(rfcModelScaled)\n```\n\n    Accuracy: 0.9675153643546971\n    Precision: 1.0\n    F1 Score: 0.9275929549902152\n    Recall Score: 0.864963503649635\n    ROC AUC score: 0.9982785536475255\n    \n\n\n\n    \n![png](esd_files/esd_22_1.png)\n    \n\n\n    \n    Custom string results:\n    ok  --\u003e  \"Hello sir! When is the deadline for CSE422 project report submission?\"\n    ok  --\u003e  \"Dear Students, The University is ha ... st regards, Office of the Registrar\"\n    ok  --\u003e  \"Click here to get free discord nitro\"\n\n\n\n```python\nrfcModel = modelData(RFC(n_estimators=50), False)\n\nrfc = display_score(rfcModel, False)\n\ngetCustomTestResults(rfcModel, False)\n```\n\n    Accuracy: 0.9648814749780509\n    Precision: 1.0\n    F1 Score: 0.9212598425196851\n    Recall Score: 0.8540145985401459\n    ROC AUC score: 0.9989304248765876\n    \n\n\n\n    \n![png](esd_files/esd_23_1.png)\n    \n\n\n    \n    Custom string results:\n    ok  --\u003e  \"Hello sir! When is the deadline for CSE422 project report submission?\"\n    ok  --\u003e  \"Dear Students, The University is ha ... st regards, Office of the Registrar\"\n    spam  --\u003e  \"Click here to get free discord nitro\"\n\n\n### Results\n\n\n```python\n# create data\ndef algoResArray(algo):\n  return [ algo[\"name\"], algo[\"acc\"], algo[\"prec\"], algo[\"f1scre\"], algo[\"recal\"] ]\n\ndef displayScore(algo1, algo2):\n  df = pd.DataFrame([\n    algoResArray(algo1), \n    algoResArray(algo2),\n  ],\n\n  columns=[ 'Algorithm', \"accuracy\", \"precision\", \"f1 score\", \"recall\" ])\n\n  df.plot(\n    x='Algorithm',\n    kind='bar',\n    stacked=False,\n    title='Algorigthm score comparison',\n    ylim=(0.8,1)\n  )\n\ndisplayScore(mnb, mnbS)\ndisplayScore(svc, svcS)\ndisplayScore(rfc, rfcS)\n```\n\n\n    \n![png](esd_files/esd_25_0.png)\n    \n\n\n\n    \n![png](esd_files/esd_25_1.png)\n    \n\n\n\n    \n![png](esd_files/esd_25_2.png)\n    \n\n\n##### ROC-Curve\n\n\n```python\nplt.plot(mnb[\"fpr\"], mnb[\"tpr\"], label= \"Multinomial Naive Bayes\")\n\nplt.plot(svc[\"fpr\"], svc[\"tpr\"], label= \"Support Vector Classifier\")\n\nplt.plot(rfc[\"fpr\"], rfc[\"tpr\"], label= \"Random Forest Classifier\")\n\nplt.title('ROC Curves for different algorithms')\nplt.xlabel('False Positive Rate')\nplt.ylabel('True Positive Rate')\nplt.legend()\nplt.show()\n```\n\n\n    \n![png](esd_files/esd_27_0.png)\n    \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimoxto%2Femail-spam-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimoxto%2Femail-spam-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimoxto%2Femail-spam-detection/lists"}