{"id":13418233,"url":"https://github.com/s/preprocessor","last_synced_at":"2025-04-13T07:51:01.388Z","repository":{"id":49609247,"uuid":"50229657","full_name":"s/preprocessor","owner":"s","description":"Elegant and Easy Tweet Preprocessing in Python","archived":false,"fork":false,"pushed_at":"2023-04-17T09:55:05.000Z","size":103,"stargazers_count":300,"open_issues_count":16,"forks_count":63,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-04-26T07:00:48.394Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://preprocessor.readthedocs.org/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/s.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-01-23T08:05:59.000Z","updated_at":"2024-04-25T02:00:19.000Z","dependencies_parsed_at":"2022-09-14T16:41:44.273Z","dependency_job_id":"65487012-9b06-457a-ba34-eb7a2e73358b","html_url":"https://github.com/s/preprocessor","commit_stats":{"total_commits":122,"total_committers":7,"mean_commits":"17.428571428571427","dds":0.05737704918032782,"last_synced_commit":"efd2eb5919187a1a6713f8bb3fb19861d6133392"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/s%2Fpreprocessor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/s%2Fpreprocessor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/s%2Fpreprocessor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/s%2Fpreprocessor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/s","download_url":"https://codeload.github.com/s/preprocessor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248681494,"owners_count":21144700,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T22:00:59.978Z","updated_at":"2025-04-13T07:51:01.366Z","avatar_url":"https://github.com/s.png","language":"Python","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":"*****\nPreprocessor\n*****\n\n.. image:: https://travis-ci.org/s/preprocessor.svg?branch=master\n    :target: https://travis-ci.org/s/preprocessor\n\n.. image:: https://img.shields.io/pypi/dm/tweet-preprocessor.svg\n    :target: https://pypi.python.org/pypi/tweet-preprocessor/\n\n.. image:: https://badge.fury.io/py/tweet-preprocessor.svg\n    :target: https://pypi.python.org/pypi/tweet-preprocessor/\n\n.. image:: https://img.shields.io/github/license/s/preprocessor.svg\n    :target: https://github.com/s/preprocessor/blob/master/LICENSE.md\n\n.. image:: https://img.shields.io/pypi/pyversions/tweet-preprocessor.svg\n    :target: https://pypi.python.org/pypi/tweet-preprocessor/\n\n\nPreprocessor is a preprocessing library for tweet data written in\nPython. When building Machine Learning systems based on tweet and text data, a\npreprocessing is required. This is required because of quality of the data as well as dimensionality reduction purposes. \n\nThis library makes it easy to clean, parse or tokenize the tweets so you don't have to write the same helper functions over and over again ever time.\n\nFeatures\n========\n\nCurrently supports cleaning, tokenizing and parsing:\n\n-  URLs\n-  Hashtags\n-  Mentions\n-  Reserved words (RT, FAV)\n-  Emojis\n-  Smileys\n-  Numbers\n-  ``JSON`` and ``.txt`` file support\n\nPreprocessor ``v0.6.0`` supports\n``Python 3.4+ on Linux, macOS and Windows``. Tests run on\nfollowing setups:\n\n::\n\n    Linux Xenial with Python 3.4.8, 3.5.6, 3.6.7, 3.7.1, 3.8.0, 3.8.3+\n    macOS with Python 3.7.5, 3.8.0\n    Windows with Python 3.5.4, 3.6.8\n\nUsage\n=====\n\nBasic cleaning:\n---------------\n\n.. code:: python\n\n    \u003e\u003e\u003e import preprocessor as p\n    \u003e\u003e\u003e p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')\n    'Preprocessor is'\n\nTokenizing:\n-----------\n\n.. code:: python\n\n    \u003e\u003e\u003e p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')\n    'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'\n\nParsing:\n--------\n\n.. code:: python\n\n    \u003e\u003e\u003e parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')\n    \u003cpreprocessor.parse.ParseResult instance at 0x10f430758\u003e\n    \u003e\u003e\u003e parsed_tweet.urls\n    [(25:58) =\u003e https://github.com/s/preprocessor]\n    \u003e\u003e\u003e parsed_tweet.urls[0].start_index\n    25\n    \u003e\u003e\u003e parsed_tweet.urls[0].match\n    'https://github.com/s/preprocessor'\n    \u003e\u003e\u003e parsed_tweet.urls[0].end_index\n    58\n\nFully customizable:\n-------------------\n\n.. code:: python\n\n    \u003e\u003e\u003e p.set_options(p.OPT.URL, p.OPT.EMOJI)\n    \u003e\u003e\u003e p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')\n    'Preprocessor is #awesome'\n\nPreprocessor will go through all of the options by default unless you\nspecify some options.\n\nProcessing files:\n-----------------\n\nPreprocessor currently supports processing ``.json`` and ``.txt``\nformats. Please see below examples for the correct input format.\n\nExample JSON file\n~~~~~~~~~~~~~~~~~\n\n.. code:: json\n\n    [\n        \"Preprocessor now supports files. https://github.com/s/preprocessor\",\n        \"#preprocessing is a cruical part of @ML projects.\",\n        \"@RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl\"\n    ]\n\nExample Text file\n~~~~~~~~~~~~~~~~~\n\n::\n\n    Preprocessor now supports files. https://github.com/s/preprocessor\n    #preprocessing is a cruical part of @ML projects.\n    @RT @Twitter raw text data usually has lots of #residue. http://t.co/g00gl\n\nPreprocessing JSON file:\n~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    # JSON example\n    \u003e\u003e\u003e input_file_name = \"sample_json.json\"\n    \u003e\u003e\u003e p.clean_file(input_file_name, options=[p.OPT.URL, p.OPT.MENTION])\n    Saved the cleaned tweets to:/tests/artifacts/24052020_013451892752_vkeCMTwBEMmX_clean_file_sample.json\n\nPreprocessing text file:\n~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    # Text file example\n    \u003e\u003e\u003e input_file_name = \"sample_txt.txt\"\n    \u003e\u003e\u003e p.clean_file(input_file_name, options=[p.OPT.URL, p.OPT.MENTION])\n    Saved the cleaned tweets to:/tests/artifacts/24052020_013451908865_TE9DWX1BjFws_clean_file_sample.txt\n\nAvailable Options:\n~~~~~~~~~~~~~~~~~~\n\n+------------------+---------------------+\n| Option Name      | Option Short Code   |\n+==================+=====================+\n| URL              | p.OPT.URL           |\n+------------------+---------------------+\n| Mention          | p.OPT.MENTION       |\n+------------------+---------------------+\n| Hashtag          | p.OPT.HASHTAG       |\n+------------------+---------------------+\n| Reserved Words   | p.OPT.RESERVED      |\n+------------------+---------------------+\n| Emoji            | p.OPT.EMOJI         |\n+------------------+---------------------+\n| Smiley           | p.OPT.SMILEY        |\n+------------------+---------------------+\n| Number           | p.OPT.NUMBER        |\n+------------------+---------------------+\n\nInstallation\n============\n\nUsing pip:\n\n.. code:: bash\n\n    $ pip install tweet-preprocessor\n\n\nUsing Anaconda:\n\n.. code:: bash\n    \n    $ conda install -c saidozcan tweet-preprocessor\n\nUsing manual installation:\n\n.. code:: bash\n\n    $ python setup.py build\n    $ python setup.py install\n\nContributing\n============\n\nAre you willing to contribute to preprocessor? That's great! Please\nfollow below steps to contribute to this project:\n\n#. Create a bug report or a feature idea using the templates on Issues\n   page.\n\n#. Fork the repository and make your changes.\n\n#. Open a PR and make sure your PR has tests and all the checks pass.\n\n#. And that's all!\n\n.. |image| image:: https://travis-ci.org/s/preprocessor.svg?branch=master\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fs%2Fpreprocessor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fs%2Fpreprocessor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fs%2Fpreprocessor/lists"}