{"id":13468905,"url":"https://github.com/vi3k6i5/flashtext","last_synced_at":"2025-05-13T18:04:39.118Z","repository":{"id":38239555,"uuid":"100404920","full_name":"vi3k6i5/flashtext","owner":"vi3k6i5","description":"Extract Keywords from sentence or Replace keywords in sentences.","archived":false,"fork":false,"pushed_at":"2025-04-13T22:35:59.000Z","size":450,"stargazers_count":5648,"open_issues_count":69,"forks_count":603,"subscribers_count":141,"default_branch":"master","last_synced_at":"2025-05-04T15:02:43.467Z","etag":null,"topics":["data-extraction","keyword-extraction","nlp","search-in-text","word2vec"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vi3k6i5.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-08-15T18:03:01.000Z","updated_at":"2025-05-03T03:00:12.000Z","dependencies_parsed_at":"2022-07-12T17:13:19.720Z","dependency_job_id":"2564c2e8-6a1a-4085-aac9-d93014e3546a","html_url":"https://github.com/vi3k6i5/flashtext","commit_stats":{"total_commits":93,"total_committers":7,"mean_commits":"13.285714285714286","dds":0.5698924731182795,"last_synced_commit":"b316c7e9e54b6b4d078462b302a83db85f884a94"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vi3k6i5%2Fflashtext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vi3k6i5%2Fflashtext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vi3k6i5%2Fflashtext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vi3k6i5%2Fflashtext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vi3k6i5","download_url":"https://codeload.github.com/vi3k6i5/flashtext/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252531876,"owners_count":21763292,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-extraction","keyword-extraction","nlp","search-in-text","word2vec"],"created_at":"2024-07-31T15:01:21.619Z","updated_at":"2025-05-05T23:01:40.975Z","avatar_url":"https://github.com/vi3k6i5.png","language":"Python","readme":"=========\nFlashText\n=========\n\n.. image:: https://api.travis-ci.org/vi3k6i5/flashtext.svg?branch=master\n   :target: https://travis-ci.org/vi3k6i5/flashtext\n   :alt: Build Status\n\n.. image:: https://readthedocs.org/projects/flashtext/badge/?version=latest\n   :target: http://flashtext.readthedocs.io/en/latest/?badge=latest\n   :alt: Documentation Status\n\n.. image:: https://badge.fury.io/py/flashtext.svg\n   :target: https://badge.fury.io/py/flashtext\n   :alt: Version\n\n.. image:: https://coveralls.io/repos/github/vi3k6i5/flashtext/badge.svg?branch=master\n   :target: https://coveralls.io/github/vi3k6i5/flashtext?branch=master\n   :alt: Test coverage\n\n.. image:: https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000\n   :target: https://github.com/vi3k6i5/flashtext/blob/master/LICENSE\n   :alt: license\n\n\nThis module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the `FlashText algorithm \u003chttps://arxiv.org/abs/1711.00046\u003e`_.\n\n\nInstallation\n------------\n::\n\n    $ pip install flashtext\n\n\nAPI doc\n-------\n\nDocumentation can be found at `FlashText Read the Docs\n\u003chttp://flashtext.readthedocs.io/\u003e`_.\n\n\nUsage\n-----\nExtract keywords\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e # keyword_processor.add_keyword(\u003cunclean name\u003e, \u003cstandardised name\u003e)\n    \u003e\u003e\u003e keyword_processor.add_keyword('Big Apple', 'New York')\n    \u003e\u003e\u003e keyword_processor.add_keyword('Bay Area')\n    \u003e\u003e\u003e keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')\n    \u003e\u003e\u003e keywords_found\n    \u003e\u003e\u003e # ['New York', 'Bay Area']\n\nReplace keywords\n    \u003e\u003e\u003e keyword_processor.add_keyword('New Delhi', 'NCR region')\n    \u003e\u003e\u003e new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')\n    \u003e\u003e\u003e new_sentence\n    \u003e\u003e\u003e # 'I love New York and NCR region.'\n\nCase Sensitive example\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor(case_sensitive=True)\n    \u003e\u003e\u003e keyword_processor.add_keyword('Big Apple', 'New York')\n    \u003e\u003e\u003e keyword_processor.add_keyword('Bay Area')\n    \u003e\u003e\u003e keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')\n    \u003e\u003e\u003e keywords_found\n    \u003e\u003e\u003e # ['Bay Area']\n\nSpan of keywords extracted\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e keyword_processor.add_keyword('Big Apple', 'New York')\n    \u003e\u003e\u003e keyword_processor.add_keyword('Bay Area')\n    \u003e\u003e\u003e keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)\n    \u003e\u003e\u003e keywords_found\n    \u003e\u003e\u003e # [('New York', 7, 16), ('Bay Area', 21, 29)]\n\nGet Extra information with keywords extracted\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e kp = KeywordProcessor()\n    \u003e\u003e\u003e kp.add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))\n    \u003e\u003e\u003e kp.add_keyword('Delhi', ('Location', 'Delhi'))\n    \u003e\u003e\u003e kp.extract_keywords('Taj Mahal is in Delhi.')\n    \u003e\u003e\u003e # [('Monument', 'Taj Mahal'), ('Location', 'Delhi')]\n    \u003e\u003e\u003e # NOTE: replace_keywords feature won't work with this.\n\nNo clean name for Keywords\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e keyword_processor.add_keyword('Big Apple')\n    \u003e\u003e\u003e keyword_processor.add_keyword('Bay Area')\n    \u003e\u003e\u003e keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')\n    \u003e\u003e\u003e keywords_found\n    \u003e\u003e\u003e # ['Big Apple', 'Bay Area']\n\nAdd Multiple Keywords simultaneously\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e keyword_dict = {\n    \u003e\u003e\u003e     \"java\": [\"java_2e\", \"java programing\"],\n    \u003e\u003e\u003e     \"product management\": [\"PM\", \"product manager\"]\n    \u003e\u003e\u003e }\n    \u003e\u003e\u003e # {'clean_name': ['list of unclean names']}\n    \u003e\u003e\u003e keyword_processor.add_keywords_from_dict(keyword_dict)\n    \u003e\u003e\u003e # Or add keywords from a list:\n    \u003e\u003e\u003e keyword_processor.add_keywords_from_list([\"java\", \"python\"])\n    \u003e\u003e\u003e keyword_processor.extract_keywords('I am a product manager for a java_2e platform')\n    \u003e\u003e\u003e # output ['product management', 'java']\n\nTo Remove keywords\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e keyword_dict = {\n    \u003e\u003e\u003e     \"java\": [\"java_2e\", \"java programing\"],\n    \u003e\u003e\u003e     \"product management\": [\"PM\", \"product manager\"]\n    \u003e\u003e\u003e }\n    \u003e\u003e\u003e keyword_processor.add_keywords_from_dict(keyword_dict)\n    \u003e\u003e\u003e print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))\n    \u003e\u003e\u003e # output ['product management', 'java']\n    \u003e\u003e\u003e keyword_processor.remove_keyword('java_2e')\n    \u003e\u003e\u003e # you can also remove keywords from a list/ dictionary\n    \u003e\u003e\u003e keyword_processor.remove_keywords_from_dict({\"product management\": [\"PM\"]})\n    \u003e\u003e\u003e keyword_processor.remove_keywords_from_list([\"java programing\"])\n    \u003e\u003e\u003e keyword_processor.extract_keywords('I am a product manager for a java_2e platform')\n    \u003e\u003e\u003e # output ['product management']\n\nTo check Number of terms in KeywordProcessor\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e keyword_dict = {\n    \u003e\u003e\u003e     \"java\": [\"java_2e\", \"java programing\"],\n    \u003e\u003e\u003e     \"product management\": [\"PM\", \"product manager\"]\n    \u003e\u003e\u003e }\n    \u003e\u003e\u003e keyword_processor.add_keywords_from_dict(keyword_dict)\n    \u003e\u003e\u003e print(len(keyword_processor))\n    \u003e\u003e\u003e # output 4\n\nTo check if term is present in KeywordProcessor\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e keyword_processor.add_keyword('j2ee', 'Java')\n    \u003e\u003e\u003e 'j2ee' in keyword_processor\n    \u003e\u003e\u003e # output: True\n    \u003e\u003e\u003e keyword_processor.get_keyword('j2ee')\n    \u003e\u003e\u003e # output: Java\n    \u003e\u003e\u003e keyword_processor['colour'] = 'color'\n    \u003e\u003e\u003e keyword_processor['colour']\n    \u003e\u003e\u003e # output: color\n\nGet all keywords in dictionary\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e keyword_processor.add_keyword('j2ee', 'Java')\n    \u003e\u003e\u003e keyword_processor.add_keyword('colour', 'color')\n    \u003e\u003e\u003e keyword_processor.get_all_keywords()\n    \u003e\u003e\u003e # output: {'colour': 'color', 'j2ee': 'Java'}\n\nFor detecting Word Boundary currently any character other than this `\\\\w` `[A-Za-z0-9_]` is considered a word boundary.\n\nTo set or add characters as part of word characters\n    \u003e\u003e\u003e from flashtext import KeywordProcessor\n    \u003e\u003e\u003e keyword_processor = KeywordProcessor()\n    \u003e\u003e\u003e keyword_processor.add_keyword('Big Apple')\n    \u003e\u003e\u003e print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))\n    \u003e\u003e\u003e # ['Big Apple']\n    \u003e\u003e\u003e keyword_processor.add_non_word_boundary('/')\n    \u003e\u003e\u003e print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))\n    \u003e\u003e\u003e # []\n\n\nTest\n----\n::\n\n    $ git clone https://github.com/vi3k6i5/flashtext\n    $ cd flashtext\n    $ pip install pytest\n    $ python setup.py test\n\n\nBuild Docs\n----------\n::\n\n    $ git clone https://github.com/vi3k6i5/flashtext\n    $ cd flashtext/docs\n    $ pip install sphinx\n    $ make html\n    $ # open _build/html/index.html in browser to view it locally\n\n\nWhy not Regex?\n--------------\n\nIt's a custom algorithm based on `Aho-Corasick algorithm\n\u003chttps://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm\u003e`_ and `Trie Dictionary\n\u003chttps://en.wikipedia.org/wiki/Trie Dictionary\u003e`_.\n\n.. image:: https://github.com/vi3k6i5/flashtext/raw/master/benchmark.png\n   :target: https://twitter.com/RadimRehurek/status/904989624589803520\n   :alt: Benchmark\n\n\nTime taken by FlashText to find terms in comparison to Regex.\n\n.. image:: https://thepracticaldev.s3.amazonaws.com/i/xruf50n6z1r37ti8rd89.png\n\n\nTime taken by FlashText to replace terms in comparison to Regex.\n\n.. image:: https://thepracticaldev.s3.amazonaws.com/i/k44ghwp8o712dm58debj.png\n\nLink to code for benchmarking the `Find Feature \u003chttps://gist.github.com/vi3k6i5/604eefd92866d081cfa19f862224e4a0\u003e`_ and `Replace Feature \u003chttps://gist.github.com/vi3k6i5/dc3335ee46ab9f650b19885e8ade6c7a\u003e`_.\n\nThe idea for this library came from the following `StackOverflow question\n\u003chttps://stackoverflow.com/questions/44178449/regex-replace-is-taking-time-for-millions-of-documents-how-to-make-it-faster\u003e`_.\n\n\nCitation\n----------\n\nThe original paper published on `FlashText algorithm \u003chttps://arxiv.org/abs/1711.00046\u003e`_.\n\n::\n\n    @ARTICLE{2017arXiv171100046S,\n       author = {{Singh}, V.},\n        title = \"{Replace or Retrieve Keywords In Documents at Scale}\",\n      journal = {ArXiv e-prints},\n    archivePrefix = \"arXiv\",\n       eprint = {1711.00046},\n     primaryClass = \"cs.DS\",\n     keywords = {Computer Science - Data Structures and Algorithms},\n         year = 2017,\n        month = oct,\n       adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S},\n      adsnote = {Provided by the SAO/NASA Astrophysics Data System}\n    }\n\nThe article published on `Medium freeCodeCamp \u003chttps://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f\u003e`_.\n\n\nContribute\n----------\n\n- Issue Tracker: https://github.com/vi3k6i5/flashtext/issues\n- Source Code: https://github.com/vi3k6i5/flashtext/\n\n\nLicense\n-------\n\nThe project is licensed under the MIT license.\n","funding_links":[],"categories":["Python","资源列表","Python (1887)","文本数据和NLP","Feature Extraction"],"sub_categories":["文本处理","Text/NLP"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvi3k6i5%2Fflashtext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvi3k6i5%2Fflashtext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvi3k6i5%2Fflashtext/lists"}