{"id":13856667,"url":"https://github.com/davidsvy/Neural-Scam-Artist","last_synced_at":"2025-07-13T19:32:06.676Z","repository":{"id":63883099,"uuid":"418363284","full_name":"davidsvy/Neural-Scam-Artist","owner":"davidsvy","description":"Web Scraping, Document Deduplication \u0026 GPT-2 Fine-tuning with a newly created scam dataset.","archived":false,"fork":false,"pushed_at":"2021-10-30T15:57:53.000Z","size":193,"stargazers_count":23,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-06T03:02:00.348Z","etag":null,"topics":["dataset","deduplication","fine-tuning","fraud","gpt2","huggingface","lsh","minhash","nlp","pytorch","readability","scam","transformer","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidsvy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-10-18T05:55:18.000Z","updated_at":"2024-04-19T06:55:34.000Z","dependencies_parsed_at":"2022-11-28T10:51:02.820Z","dependency_job_id":null,"html_url":"https://github.com/davidsvy/Neural-Scam-Artist","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsvy%2FNeural-Scam-Artist","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsvy%2FNeural-Scam-Artist/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsvy%2FNeural-Scam-Artist/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidsvy%2FNeural-Scam-Artist/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidsvy","download_url":"https://codeload.github.com/davidsvy/Neural-Scam-Artist/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225912205,"owners_count":17544120,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","deduplication","fine-tuning","fraud","gpt2","huggingface","lsh","minhash","nlp","pytorch","readability","scam","transformer","web-scraping"],"created_at":"2024-08-05T03:01:08.099Z","updated_at":"2024-11-22T14:30:29.500Z","avatar_url":"https://github.com/davidsvy.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n  \u003cb\u003eNeural Scam Artist\u003c/b\u003e\u003cbr\u003e\n\u003c/h1\u003e\n\n\n\u003cp align=\"center\"\u003e\n      \u003ca href=\"https://www.python.org/\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/python-3.7-blue.svg\" /\u003e\u003c/a\u003e\n       \u003ca href= \"https://pytorch.org/\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/PyTorch-1.9-FF0000.svg\" /\u003e\u003c/a\u003e\n       \u003ca href= \"https://github.com/davidsvy/Neural-Scam-Artist/blob/main/LICENSE\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/license-MIT-white.svg\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nTL;DR\\\nA dataset of scam emails is scraped from an anti-fraud website. The dataset is then deduplicated\nusing MinHash and LSH. The deduplicated dataset is used for fine-tuning GPT-2.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/davidsvy/Neural-Scam-Artist/blob/master/assets/comic.jpg?raw=true\" /\u003e\n\u003c/p\u003e\n\n\n\n\u003cp align=\"center\"\u003e\n  Comic stolen from \u003ca href=\"https://www.agent-x.com.au/\"\u003eAgent-X Comics\u003c/a\u003e.\n  \n\u003c/p\u003e\n\n\n\n:book: Table of Contents\n===\n\n\u003c!--ts--\u003e\n  * [➤ Project Description](#project-description)\n    * [➤ Objective](#objective)\n    * [➤ Web Scraper](#web-scraper)\n    * [➤ Deduplication](#deduplication)\n    * [➤ GPT-2](#gpt-2)\n  * [➤ Shared Files](#shared-files)\n  * [➤ Requirements](#requirements)\n  * [➤ Installation](#installation)\n  * [➤ Usage](#usage)\n\u003c!--te--\u003e\n\n\u003ca  id=\"project-description\"\u003e\u003c/a\u003e\n\n:cloud: Project Description\n===\n\n\u003ca  id=\"objective\"\u003e\u003c/a\u003e\nObjective\n---\n\nThe goal of this project is create a new dataset of fraudulent emails that can advance the\nresearch on intelligent email assistants.\n\n\u003ca  id=\"web-scraper\"\u003e\u003c/a\u003e\nWeb Scraper\n---\n\nData is scraped from the website [https://antifraudintl.org/](https://antifraudintl.org/). \nAt first, a set of thread urls is collected and stored. Then, each thread is searched for \nemails. For each thread, at most one email is kept as the rest are duplicates. Metadata \n(Subject, Date etc) is removed. The resultant dataset is stored inside a csv file.\n\n\u003ca  id=\"deduplication\"\u003e\u003c/a\u003e\nDeduplication\n---\nTo avoid the quadratic complexity, a cheap alternative is selected: MinHash and LSH using the [datasketch library](https://github.com/ekzhu/datasketch). For each document, this method \nefficiently locates its nearest neighbors. Because this leads to a a large amount of false\nnegatives (i.e. dulpicate documents that are classified as non-duplicates), the approach is\nextended by creating a duplicate graph. Nodes in this graph represent documents and are connected\nwith an edge if their respective documents have been classified as duplicates. To deduplicate the \ndataset, [connected components](https://en.wikipedia.org/wiki/Component_(graph_theory)) of the \ngraph are located and for each component only a single node is selected. A \n[readability criterion](https://en.wikipedia.org/wiki/Readability) is used for selection.\n\n\u003ca  id=\"gpt-2\"\u003e\u003c/a\u003e\nGPT-2\n---\n\nA small pretrained GPT-2 model from the \n[Huggingface library](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel)\nis fine-tuned on the deduplicated dataset. A collection of ~~cherry-picked~~ randomly selected \ngenerated samples can be found here [here](https://github.com/davidsvy/Neural-Scam-Artist/blob/main/generated_samples/generated_samples.txt).\n\n\u003ca  id=\"shared-files\"\u003e\u003c/a\u003e\n:file_folder: Shared Files\n===\n\n\n| Resource | Size | #Samples | Link |\n|-------------------|---|---|---|\n| **Full dataset**          | 128.5 MB  | 85,160  | [Link](https://drive.google.com/file/d/1CoZp1F0FqB3pOqYlQ7X9XqCChppZFUps/view?usp=sharing)  |\n| **Deduplicated dataset**  | 74.2 MB   | 58,227  | [Link](https://drive.google.com/file/d/19JXTPTqV9gaKzHqGdbyyEuEfXD2l5DCc/view?usp=sharing)  |\n| **Thread urls**           | 6.4 MB    | 95,324  | [Link](https://drive.google.com/file/d/1AmVIqCnWzSCqexTv02wOBAnPhiTkgHkP/view?usp=sharing)  |\n| **GPT-2 Checkpoints**     | ~1.5 GB   |   | [Link](https://drive.google.com/drive/folders/1RUV2gPbGUetBFlIJZ9_W-ARB70x_9s-L?usp=sharing)  |\n\n\n\n\n\u003ca  id=\"requirements\"\u003e\u003c/a\u003e\n:toolbox: Requirements\n===\nSee `requirements.txt`.\n\n\u003ca  id=\"installation\"\u003e\u003c/a\u003e\n:gear: Installation\n===\n```\n$ git clone https://github.com/davidsvy/Neural-Scam-Artist\n$ cd Neural-Scam-Artist\n$ pip install -r requirements.txt\n```\n\n\u003ca  id=\"usage\"\u003e\u003c/a\u003e\n:roll_of_paper: Usage\n===\n\nTo generate dataset (~3 hours on Colab):\n```\n\n$ python create_dataset.py [-c configs/create_dataset.yaml]\n```\n\nTo deduplicate dataset (~30 minutes on Colab):\n```\n$ python deduplicate_dataset.py [-c configs/deduplicate_dataset.yaml]\n```\n\nTo train GPT-2 (~3 hours/epoch on Colab with K80):\n```\n$ python gpt2_train.py [-c configs/gpt2_train.yaml]\n```\n\nTo generate text with GPT-2:\n```\n$ python gpt2_sample.py [-c configs/gpt2_sample.yaml]\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidsvy%2FNeural-Scam-Artist","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidsvy%2FNeural-Scam-Artist","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidsvy%2FNeural-Scam-Artist/lists"}