Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vi3k6i5/flashtext
Extract Keywords from sentence or Replace keywords in sentences.
https://github.com/vi3k6i5/flashtext
data-extraction keyword-extraction nlp search-in-text word2vec
Last synced: 4 days ago
JSON representation
Extract Keywords from sentence or Replace keywords in sentences.
- Host: GitHub
- URL: https://github.com/vi3k6i5/flashtext
- Owner: vi3k6i5
- License: mit
- Created: 2017-08-15T18:03:01.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-07-03T08:05:36.000Z (5 months ago)
- Last Synced: 2024-10-29T11:28:19.717Z (about 1 month ago)
- Topics: data-extraction, keyword-extraction, nlp, search-in-text, word2vec
- Language: Python
- Size: 437 KB
- Stars: 5,596
- Watchers: 142
- Forks: 599
- Open Issues: 69
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-python-machine-learning-resources - GitHub - 49% open · ⏱️ 03.05.2020): (文本数据和NLP)
- starred-awesome - flashtext - Extract Keywords from sentence or Replace keywords in sentences. (Python)
- awesome-hacking-lists - vi3k6i5/flashtext - Extract Keywords from sentence or Replace keywords in sentences. (Python)
README
=========
FlashText
=========.. image:: https://api.travis-ci.org/vi3k6i5/flashtext.svg?branch=master
:target: https://travis-ci.org/vi3k6i5/flashtext
:alt: Build Status.. image:: https://readthedocs.org/projects/flashtext/badge/?version=latest
:target: http://flashtext.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status.. image:: https://badge.fury.io/py/flashtext.svg
:target: https://badge.fury.io/py/flashtext
:alt: Version.. image:: https://coveralls.io/repos/github/vi3k6i5/flashtext/badge.svg?branch=master
:target: https://coveralls.io/github/vi3k6i5/flashtext?branch=master
:alt: Test coverage.. image:: https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000
:target: https://github.com/vi3k6i5/flashtext/blob/master/LICENSE
:alt: licenseThis module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the `FlashText algorithm `_.
Installation
------------
::$ pip install flashtext
API doc
-------Documentation can be found at `FlashText Read the Docs
`_.Usage
-----
Extract keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # keyword_processor.add_keyword(, )
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> # ['New York', 'Bay Area']Replace keywords
>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')
>>> new_sentence
>>> # 'I love New York and NCR region.'Case Sensitive example
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Bay Area']Span of keywords extracted
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
>>> keywords_found
>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]Get Extra information with keywords extracted
>>> from flashtext import KeywordProcessor
>>> kp = KeywordProcessor()
>>> kp.add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))
>>> kp.add_keyword('Delhi', ('Location', 'Delhi'))
>>> kp.extract_keywords('Taj Mahal is in Delhi.')
>>> # [('Monument', 'Taj Mahal'), ('Location', 'Delhi')]
>>> # NOTE: replace_keywords feature won't work with this.No clean name for Keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Big Apple', 'Bay Area']Add Multiple Keywords simultaneously
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>> "java": ["java_2e", "java programing"],
>>> "product management": ["PM", "product manager"]
>>> }
>>> # {'clean_name': ['list of unclean names']}
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> # Or add keywords from a list:
>>> keyword_processor.add_keywords_from_list(["java", "python"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management', 'java']To Remove keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>> "java": ["java_2e", "java programing"],
>>> "product management": ["PM", "product manager"]
>>> }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))
>>> # output ['product management', 'java']
>>> keyword_processor.remove_keyword('java_2e')
>>> # you can also remove keywords from a list/ dictionary
>>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})
>>> keyword_processor.remove_keywords_from_list(["java programing"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management']To check Number of terms in KeywordProcessor
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>> "java": ["java_2e", "java programing"],
>>> "product management": ["PM", "product manager"]
>>> }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> print(len(keyword_processor))
>>> # output 4To check if term is present in KeywordProcessor
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> 'j2ee' in keyword_processor
>>> # output: True
>>> keyword_processor.get_keyword('j2ee')
>>> # output: Java
>>> keyword_processor['colour'] = 'color'
>>> keyword_processor['colour']
>>> # output: colorGet all keywords in dictionary
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('colour', 'color')
>>> keyword_processor.get_all_keywords()
>>> # output: {'colour': 'color', 'j2ee': 'Java'}For detecting Word Boundary currently any character other than this `\\w` `[A-Za-z0-9_]` is considered a word boundary.
To set or add characters as part of word characters
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
>>> # ['Big Apple']
>>> keyword_processor.add_non_word_boundary('/')
>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
>>> # []Test
----
::$ git clone https://github.com/vi3k6i5/flashtext
$ cd flashtext
$ pip install pytest
$ python setup.py testBuild Docs
----------
::$ git clone https://github.com/vi3k6i5/flashtext
$ cd flashtext/docs
$ pip install sphinx
$ make html
$ # open _build/html/index.html in browser to view it locallyWhy not Regex?
--------------It's a custom algorithm based on `Aho-Corasick algorithm
`_ and `Trie Dictionary
`_... image:: https://github.com/vi3k6i5/flashtext/raw/master/benchmark.png
:target: https://twitter.com/RadimRehurek/status/904989624589803520
:alt: BenchmarkTime taken by FlashText to find terms in comparison to Regex.
.. image:: https://thepracticaldev.s3.amazonaws.com/i/xruf50n6z1r37ti8rd89.png
Time taken by FlashText to replace terms in comparison to Regex.
.. image:: https://thepracticaldev.s3.amazonaws.com/i/k44ghwp8o712dm58debj.png
Link to code for benchmarking the `Find Feature `_ and `Replace Feature `_.
The idea for this library came from the following `StackOverflow question
`_.Citation
----------The original paper published on `FlashText algorithm `_.
::
@ARTICLE{2017arXiv171100046S,
author = {{Singh}, V.},
title = "{Replace or Retrieve Keywords In Documents at Scale}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1711.00046},
primaryClass = "cs.DS",
keywords = {Computer Science - Data Structures and Algorithms},
year = 2017,
month = oct,
adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}The article published on `Medium freeCodeCamp `_.
Contribute
----------- Issue Tracker: https://github.com/vi3k6i5/flashtext/issues
- Source Code: https://github.com/vi3k6i5/flashtext/License
-------The project is licensed under the MIT license.