{"id":13572234,"url":"https://github.com/mailgun/talon","last_synced_at":"2025-05-14T20:08:19.856Z","repository":{"id":18977399,"uuid":"22198589","full_name":"mailgun/talon","owner":"mailgun","description":null,"archived":false,"fork":false,"pushed_at":"2023-10-18T14:47:20.000Z","size":305,"stargazers_count":1277,"open_issues_count":70,"forks_count":288,"subscribers_count":89,"default_branch":"master","last_synced_at":"2025-04-13T14:07:25.022Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"nchammas/flintrock","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mailgun.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2014-07-24T04:12:04.000Z","updated_at":"2025-04-08T19:36:18.000Z","dependencies_parsed_at":"2024-01-07T00:06:49.240Z","dependency_job_id":null,"html_url":"https://github.com/mailgun/talon","commit_stats":{"total_commits":162,"total_committers":33,"mean_commits":4.909090909090909,"dds":0.5740740740740741,"last_synced_commit":"71d9b6eb78e985bcdfbf99b69c20c001b4b818c4"},"previous_names":[],"tags_count":45,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mailgun%2Ftalon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mailgun%2Ftalon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mailgun%2Ftalon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mailgun%2Ftalon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mailgun","download_url":"https://codeload.github.com/mailgun/talon/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248724637,"owners_count":21151561,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T14:01:17.329Z","updated_at":"2025-04-13T14:07:28.828Z","avatar_url":"https://github.com/mailgun.png","language":"Python","readme":"talon\n=====\n\nMailgun library to extract message quotations and signatures.\n\nIf you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like :smile:\n\nUsage\n-----\n\nHere’s how you initialize the library and extract a reply from a text\nmessage:\n\n.. code:: python\n\n    import talon\n    from talon import quotations\n\n    talon.init()\n\n    text =  \"\"\"Reply\n\n    -----Original Message-----\n\n    Quote\"\"\"\n\n    reply = quotations.extract_from(text, 'text/plain')\n    reply = quotations.extract_from_plain(text)\n    # reply == \"Reply\"\n\nTo extract a reply from html:\n\n.. code:: python\n\n    html = \"\"\"Reply\n    \u003cblockquote\u003e\n\n      \u003cdiv\u003e\n        On 11-Apr-2011, at 6:54 PM, Bob \u0026lt;bob@example.com\u0026gt; wrote:\n      \u003c/div\u003e\n\n      \u003cdiv\u003e\n        Quote\n      \u003c/div\u003e\n\n    \u003c/blockquote\u003e\"\"\"\n\n    reply = quotations.extract_from(html, 'text/html')\n    reply = quotations.extract_from_html(html)\n    # reply == \"\u003chtml\u003e\u003cbody\u003e\u003cp\u003eReply\u003c/p\u003e\u003c/body\u003e\u003c/html\u003e\"\n\nOften the best way is the easiest one. Here’s how you can extract\nsignature from email message without any\nmachine learning fancy stuff:\n\n.. code:: python\n\n    from talon.signature.bruteforce import extract_signature\n\n\n    message = \"\"\"Wow. Awesome!\n    --\n    Bob Smith\"\"\"\n\n    text, signature = extract_signature(message)\n    # text == \"Wow. Awesome!\"\n    # signature == \"--\\nBob Smith\"\n\nQuick and works like a charm 90% of the time. For other 10% you can use\nthe power of machine learning algorithms:\n\n.. code:: python\n\n    import talon\n    # don't forget to init the library first\n    # it loads machine learning classifiers\n    talon.init()\n\n    from talon import signature\n\n\n    message = \"\"\"Thanks Sasha, I can't go any higher and is why I limited it to the\n    homepage.\n\n    John Doe\n    via mobile\"\"\"\n\n    text, signature = signature.extract(message, sender='john.doe@example.com')\n    # text == \"Thanks Sasha, I can't go any higher and is why I limited it to the\\nhomepage.\"\n    # signature == \"John Doe\\nvia mobile\"\n\nFor machine learning talon currently uses the `scikit-learn`_ library to build SVM\nclassifiers. The core of machine learning algorithm lays in\n``talon.signature.learning package``. It defines a set of features to\napply to a message (``featurespace.py``), how data sets are built\n(``dataset.py``), classifier’s interface (``classifier.py``).\n\nCurrently the data used for training is taken from our personal email\nconversations and from `ENRON`_ dataset. As a result of applying our set\nof features to the dataset we provide files ``classifier`` and\n``train.data`` that don’t have any personal information but could be\nused to load trained classifier. Those files should be regenerated every\ntime the feature/data set is changed.\n\nTo regenerate the model files, you can run\n\n.. code:: sh\n\n    python train.py\n\nor\n\n.. code:: python\n    \n    from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA\n    from talon.signature.learning.classifier import train, init\n    train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)\n\nOpen-source Dataset\n-------------------\n\nRecently we started a `forge`_ project to create an open-source, annotated dataset of raw emails. In the project we\nused a subset of `ENRON`_ data, cleansed of private, health and financial information by `EDRM`_. At the moment over 190\nemails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to\nstart using it for talon.\n\n.. _scikit-learn: http://scikit-learn.org\n.. _ENRON: https://www.cs.cmu.edu/~enron/\n.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set\n.. _forge: https://github.com/mailgun/forge\n\nTraining on your dataset\n------------------------\n\ntalon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the `forge`_ project does. Then do:\n\n.. code:: python\n\n    from talon.signature.learning.dataset import build_extraction_dataset\n    from talon.signature.learning import classifier as c \n    \n    build_extraction_dataset(\"/path/to/your/P/folder\", \"/path/to/talon/signature/data/train.data\")\n    c.train(c.init(), \"/path/to/talon/signature/data/train.data\", \"/path/to/talon/signature/data/classifier\")\n\nNote that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).\n\n.. _forge: https://github.com/mailgun/forge\n\nResearch\n--------\n\nThe library is inspired by the following research papers and projects:\n\n-  http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf\n-  http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf\n","funding_links":[],"categories":["Python","资源列表","Email"],"sub_categories":["电子邮件"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmailgun%2Ftalon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmailgun%2Ftalon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmailgun%2Ftalon/lists"}