Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vi3k6i5/synonym-extractor
Extract synonyms, keywords from sentences using modified implementation of Aho Corasick algorithm
https://github.com/vi3k6i5/synonym-extractor
algorithm datastructures nlp python synonyms
Last synced: about 1 month ago
JSON representation
Extract synonyms, keywords from sentences using modified implementation of Aho Corasick algorithm
- Host: GitHub
- URL: https://github.com/vi3k6i5/synonym-extractor
- Owner: vi3k6i5
- License: mit
- Created: 2017-07-02T07:40:52.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-08-17T10:00:15.000Z (over 7 years ago)
- Last Synced: 2024-12-02T22:36:09.928Z (about 2 months ago)
- Topics: algorithm, datastructures, nlp, python, synonyms
- Language: Python
- Homepage: http://synonym-extractor.readthedocs.io/en/latest/
- Size: 33.2 KB
- Stars: 40
- Watchers: 7
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
This project has moved to `Flash Text `_.
-----------------------------------------synonym-extractor
=================Synonym Extractor is a python library that is loosely based on `Aho-Corasick algorithm
`_.The idea is to extract words that we care about from a given sentence in one pass.
Basically say I have a vocabulary of 10K words and I want to get all the words from that set present in a sentence. A simple regex match will take a lot of time to loop over the 10K documents.
Hence we use a simpler yet much faster algorithm to get the desired result.
Installation
-------
::pip install synonym-extractor
Usage
------
::
# import module
from synonym.extractor import SynonymExtractor# Create an object of SynonymExtractor
synonym_extractor = SynonymExtractor()# add synonyms
synonym_names = ['NY', 'new-york', 'SF']
clean_names = ['new york', 'new york', 'san francisco']for synonym_name, clean_name in zip(synonym_names, clean_names):
synonym_extractor.add_to_synonym(synonym_name, clean_name)synonyms_found = synonym_extractor.get_synonyms_from_sentence('I love SF and NY. new-york is the best.')
synonyms_found
>> ['san francisco', 'new york', 'new york']Algorithm
----------synonym-extractor is based on `Aho-Corasick algorithm
`_.Documentation
----------Documentation can be found at `Read the Docs
`_.Why
------::
Say you have a corpus where similar words appear frequently.
eg: Last weekened I was in NY.
I am traveling to new york next weekend.If you train a word2vec model on this or do any sort of NLP it will treat NY and new york as 2 different words.
Instead if you create a synonym dictionary like:
eg: NY=>new york
new york=>new yorkThen you can extract NY and new york as the same text.
To do the same with regex it will take a lot of time:
============ ========== = ========= ============
Docs count # Synonyms : Regex synonym-extractor
============ ========== = ========= ============
1.5 million 2K : 16 hours NA
2.5 million 10K : 15 days 15 mins
============ ========== = ========= ============The idea for this library came from the following `StackOverflow question
`_.License
-------The project is licensed under the MIT license.