{"id":21806855,"url":"https://github.com/roguh/email-classifier","last_synced_at":"2026-05-18T11:36:53.459Z","repository":{"id":89927825,"uuid":"88011462","full_name":"roguh/email-classifier","owner":"roguh","description":"A machine learning project for classifying emails into spam and ham. Uses C for preprocessing emails, Python for parsing and extracting low-level features of emails, and a few lines of Matlab for running several classic classifiers.","archived":false,"fork":false,"pushed_at":"2017-04-17T01:17:20.000Z","size":718,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-26T04:42:08.395Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/roguh.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-04-12T05:00:07.000Z","updated_at":"2019-09-11T20:15:05.000Z","dependencies_parsed_at":"2023-05-30T13:30:44.186Z","dependency_job_id":null,"html_url":"https://github.com/roguh/email-classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roguh%2Femail-classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roguh%2Femail-classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roguh%2Femail-classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roguh%2Femail-classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/roguh","download_url":"https://codeload.github.com/roguh/email-classifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244759961,"owners_count":20505716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-27T12:30:40.294Z","updated_at":"2026-05-18T11:36:43.446Z","avatar_url":"https://github.com/roguh.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"#+BEGIN_COMMENT\nadd:\nlots of time, why does this matter?\nhow to pre-process online data?\ndatasets, pre-processing, description of the vector space\n#+END_COMMENT\n\n* Datasets\n\n** Dataset Metrics\n| Name        | Size   | Samples | Spam   |\n|-------------+--------+---------+--------|\n| CSDMC2010   |   70M  | 4327    | 31.85% |\n| Spam Track  |   964M | 92189   | 57.26% |\n| enron2      |   26M  | 5857    | 25.54% |\n| enron3      |   30M  | 5512    | 27.21% |\n| enron5      |   23M  | 5175    | 71.01% |\n| enron6      |   27M  | 6000    | 75.00% |\n| csmining    |   67M  |\n| lingspam    |   61M  |\n| PU          |   63M  |\n\n*** Duplicates\nAll CSDMC2010 samples are unique.\nThere's 12291 duplicated samples in the Spam Track dataset\nand 5192 duplicated samples in the enron datasets.\nAlso, the CSDMC2010 samples and the Spam Track samples are disjoint\n(this is based on a comparison of MD5 checksums, not the contents of each\nfile; emails may appear in slightly different forms in each sample).\n\nI haven't check the last three datasets in as much detail.\n\n*** Difficulty\nThe CS Mining dataset is particularly interesting as it classifies\nits ham and spam samples into varying difficulty classes.\nI'm planning on simulating an 'adversarial' stream of emails by\nrandomly generating a sequence of emails that ramps up in difficulty\nfrom easy to hard.\n\n** Mapping Variable-Length Emails to Vectors\nThis code generates 24 datasets, one for each feature type.\nA feature is a word or sequence of characters in a document.\n\nFeature types include:\n(1-gram, 2-gram, 3-gram) x (by character, by word) x (case sensitive, case\ninsensitive) x (no HTML tags, include HTML tags)\n\n** Converting Emails to Lists of Features\nEvery email document is converted to a *.data* feature file of the following form:\n\n#+BEGIN_SRC\n1\n// Sample file. First line is the document's label:\n// 1 if spam, 0 if ham\n// The rest of the non-commented lines are features.\nSVM\nis\nsuccessful\nSVM learning\nlearning is\nis successful\n#+END_SRC\n\n*** Implementation\nThe Python script =extract_features.py= processes each *.eml* email file\nby creating a *payloadN.data* file where N is an integer from 1 to the number\nof emails. The script extracts features by applying one or more of the\nfollowing operations to the email payload and subject:\n- Removes English stop words\n- Parsing email content as HTML in order to remove HTML tags\n- Converting all data to lowercase to make matching case insensitive\n- Split data into whitespace-separated words\n- Form n-grams based on either words or characters\n- Adds sender as a feature (verbatim)\n\n** Converting Lists of Features to Vectors\nEvery *.data* feature file is converted to a sparse matrix of the following\nform:\n\n| Document index (row) | Feature index (column) |    Count (value) |\n|----------------+---------------+---------------|\n|    1 to n      |   1 to u      | an integer    |\n\nWhere u is the number of unique features, n is the number of documents and\ncount is either the number of times the feature appears, or a number with an\nabsolute value less than the number of times the feature appears.\n\n*** Dimensionality Reduction\nIf desired, the sparse matrix can be modified to contain less columns than\nthe number of unique features.\nIn this case, there may be collisions: different features may have the same\nfeature index.\n\nThe script =count_collisions.sh= can count the number of features that have\nthe same index in the sparse matrix.\n\n**** Finding Features Given a Feature Index\nA rainbow table is created for the entire corpus.\nThis file maps features and document indices to the feature's index in the\nsparse matrix.\n\n*** Implementation\nThe C program =generate_matrix.c= takes\n\n#+BEGIN_COMMENT\nC program (output file, rainbow hash output, payload names, range of file IDs (1-100, 101-200)\n\nshould also generate different testing sets (cross-validation...)\n\nScan each .data file and add every feature to a bag of words.\nWrite the bag of words as a sparse matrix to a .dat file.\n\n**** Bag of Words\nMaps features (word n-grams or character n-grams)\nto R^u\n\nO(u) space\ngenerates sparse matrix of size O(nu) where n is the number of docs\nn is the number of docs\nu is the number of unique features\n\nvector[hash(f)]++\nis replaced with\nadd_or_increment(vector, f)\n\nkeep array of (hash value, doc, index, count) of length U\n\nhash and hash2 are different hash functions\n  (FarmHash, SipHash, Pearson's Hash)\n\nto map feature -\u003e index:\n returns an index from 1 to U\n\n hash value = hash(feature)\n lookup feature's (hash value % current size) in array\n present?\n   // Only do this if max hash size \u003c U\n   // Weinberger, Dasgupta, Langford, et. al. 2009\n   // Helps 'balance out' collisions\n   if hash2(feature) == 1\n     count += 1\n   else\n     count -= 1\n   // Otherwise, just do:\n   count += 1\n   return index\n\n absent?\n   current size++\n   index++\n   set count to 1\n   do not rehash if table is at max size\n   rehash in case of collision or current capacity size reached\n     (create new table with 2x or 4x size (check hash),\n      move to correct index based on true hash values)\n   add the feature to array and return index\n\n\nto process document w:\n  use previous hash table\n  feature -\u003e index\n  write to .dat: \"w index\"\n  write rainbow table: \"index  w:line_number  feature\"\n\ngenerates a sparse matrix \"data.dat\"\nload data.dat\ndata = spconvert(data)\n#+END_COMMENT\n\n#+BEGIN_COMMENT\n** Misc\n- `enron1/ham/2825.2000-11-13.farmer.ham.txt` \"i believe texas should re - establish itself as a republic and i can go to the barricades . now that gets my juices going .\"\n- 'New Mexico only appears in `spam/` in the enron1 dataset\n\n** Machine Learning!\nread files F1 (csvread)\nlearn (SVM, ROSVM, SGD SVM, Naive Bayesian)\nevaluate model on files F2\nprint evaluation (time, memory, error, regret, iterations)\n#+END_COMMENT\n\n** More Info on Datasets\nenron, csmining, lingspam, and PU came from csmining.org\n\nSources:\n- enron2: kaminski-v + SpamAssassin\u0026HoneyPot (05/2001 - 07/2005)\n- enron3: kitchen-l  + BG (08/2004 - 07/2005)\n- enron5: beck-s     + SpamAssassin\u0026HoneyPot (05/2001 - 07/2005)\n- enron6: lokay-m    + BG (08/2004 - 07/2005)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froguh%2Femail-classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Froguh%2Femail-classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froguh%2Femail-classifier/lists"}