https://github.com/georgesalkhouri/l3wtransformer
A word hashing method based on vectors of letter n-grams. Currently transforms text into sequences of numbers.
https://github.com/georgesalkhouri/l3wtransformer
bag-of-words data-science feature-extraction letter-trigram-word-hashing python text-processing
Last synced: about 1 year ago
JSON representation
A word hashing method based on vectors of letter n-grams. Currently transforms text into sequences of numbers.
- Host: GitHub
- URL: https://github.com/georgesalkhouri/l3wtransformer
- Owner: GeorgesAlkhouri
- License: mit
- Created: 2017-07-08T22:02:42.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2018-02-27T15:35:41.000Z (over 8 years ago)
- Last Synced: 2024-10-14T09:40:58.399Z (over 1 year ago)
- Topics: bag-of-words, data-science, feature-extraction, letter-trigram-word-hashing, python, text-processing
- Language: Python
- Homepage:
- Size: 22.5 KB
- Stars: 10
- Watchers: 2
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
l3wtransformer
==============
> A word hashing method to reduce the dimensionality of the bag-of-words term vectors. It is based on letter n-gram. Given a word (e.g. good), it first adds word starting and ending marks to the word (e.g. #good#). Then, breaks the word into letter n-grams (e.g. letter trigrams: #go, goo, ood, od#). Finally, the word is represented using a vector of letter n-grams.
[Huang et al.2013, Learning Deep Structured Semantic Models for Web Search using Clickthrough Data]
---
This implementation supports the transformation from **text into sequences of numbers**, with the numbers indicating the descending word frequency.
For example:
*Lorem ipsum dolor sit amet, consectetuer adipiscing elit...* is transformed into *23, 1, 80, 86, 47, 50001, 21, 59, 83, 93, 14, 50003, 4, 7*
Also, after each word flags indicating lower case, upper case, mixed case or initial capitalization are added.
### To do
There will be an implementation supporting the transformation from **text into bag-of-word vectors**.
Install
-------
```
pip install l3wtransformer
```
Usage
-----
```
from l3wtransformer import L3wTransformer
l3wt = L3wTransformer()
l3wt.fit_on_texts(['First example.', 'And one more!'])
l3wt.texts_to_sequences(['One example', '2nd exa.'])
# [[5, 18, 17, 50001, 2, 10, 24, 6, 15, 20, 50003], [16, 50003, 2, 10, 50003]]
```