https://github.com/georgesalkhouri/l3wtransformer

A word hashing method based on vectors of letter n-grams. Currently transforms text into sequences of numbers.
https://github.com/georgesalkhouri/l3wtransformer

bag-of-words data-science feature-extraction letter-trigram-word-hashing python text-processing

Last synced: over 1 year ago
JSON representation

A word hashing method based on vectors of letter n-grams. Currently transforms text into sequences of numbers.

Host: GitHub
URL: https://github.com/georgesalkhouri/l3wtransformer
Owner: GeorgesAlkhouri
License: mit
Created: 2017-07-08T22:02:42.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2018-02-27T15:35:41.000Z (over 8 years ago)
Last Synced: 2024-10-14T09:40:58.399Z (almost 2 years ago)
Topics: bag-of-words, data-science, feature-extraction, letter-trigram-word-hashing, python, text-processing
Language: Python
Homepage:
Size: 22.5 KB
Stars: 10
Watchers: 2
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          l3wtransformer

==============

> A word hashing method to reduce the dimensionality of the bag-of-words term vectors. It is based on letter n-gram. Given a word (e.g. good), it first adds word starting and ending marks to the word (e.g. #good#). Then, breaks the word into letter n-grams (e.g. letter trigrams: #go, goo, ood, od#). Finally, the word is represented using a vector of letter n-grams.

[Huang et al.2013, Learning Deep Structured Semantic Models for Web Search using Clickthrough Data]

---

This implementation supports the transformation from **text into sequences of numbers**, with the numbers indicating the descending word frequency.

For example:

*Lorem ipsum dolor sit amet, consectetuer adipiscing elit...* is transformed into *23, 1, 80, 86, 47, 50001, 21, 59, 83, 93, 14, 50003, 4, 7*

Also, after each word flags indicating lower case, upper case, mixed case or initial capitalization are added.

### To do

There will be an implementation supporting the transformation from **text into bag-of-word vectors**.

Install

-------

```

pip install l3wtransformer

```

Usage

-----

```

from l3wtransformer import L3wTransformer

l3wt = L3wTransformer()

l3wt.fit_on_texts(['First example.', 'And one more!'])

l3wt.texts_to_sequences(['One example', '2nd exa.'])

# [[5, 18, 17, 50001, 2, 10, 24, 6, 15, 20, 50003], [16, 50003, 2, 10, 50003]]

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/georgesalkhouri/l3wtransformer

Awesome Lists containing this project

README