https://github.com/4ai/embedding_features
https://github.com/4ai/embedding_features
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/4ai/embedding_features
- Owner: 4AI
- License: mit
- Created: 2019-03-09T05:22:18.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-03-24T05:33:34.000Z (about 7 years ago)
- Last Synced: 2025-04-23T17:44:49.121Z (about 1 year ago)
- Language: Python
- Size: 189 KB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# embedding_features
> A library to extract word embedding features to train your linear model.
> We give a sklearn-like api that you can easily combine it with sklearn models.
# Algorithms
The embedding algorithms we suppoort:
- [x] word2vec
- [x] fasttext
`word2vec` and `fasttext` are implemented by [gensim](https://github.com/RaRe-Technologies/gensim)
## parameters
### Word2vecFeature
```python
embedding_features.fasttext.Word2vecFeature (
n_dim=100, # embedding size
min_count=1, # min frequency of token
window=5, # context window
n_jobs=-1, # workers
pretrained_file=None, # pretrained word2vec binary model file
save_file=None # path to save trained word2vec model
)
```
### FasttextFeature
```python
embedding_features.fasttext.FasttextFeature (
n_dim=100, # embedding size
min_count=1, # min frequency of token
window=5, # context window
n_jobs=-1, # workers
pretrained_file=None, # pretrained word2vec binary model file
save_file=None # path to save trained word2vec model
)
```
# Install
```bash
git clone https://github.com/4AI/embedding_features.git
cd embedding_features
python setup.py install
```
# Get Started
To get embedding features, import specific embedding features from `embedding_features` and prepare input data.
```python
from embedding_features.fasttext import FasttextFeature
X, y = load_data('examples/corpus/mpqa.txt')
```
Maybe you want to split you data into train and test dataset, we can easily implement this with sklearn.
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
Now we can fit the embedding features
```python
fea = FasttextFeature()
fea.fit(X_train)
```
The same as sklearn, you can fit and transform in the same time.
```python
train_vecs = fea.fit_transform(X_train)
```
After `fit` or `fit_transform` on train dataset, you can use `transform()` to transform test dataset into vector.
```python
test_vecs = fea.transform(X_test)
```
Well we have got the vector representations of train and test dataset, now we can train our model and evaluate it.
```python
from sklearn.svm import SVC
clf=SVC(kernel='rbf', verbose=True)
clf.fit(train_vecs, y_train)
score = clf.score(test_vecs, y_test)
```
# More detail
You can save the embedding model so that you can load the model next time.
```python
fea = FasttextFeature(save_file='./model.bin')
fea.fit(X_train)
```
You can load pretrained embedding model rather to train on train_dataset
```python
fea = FasttextFeature(pretrained_file='/path/to/pretrained_model.bin')
train_vecs = fea.transform(X_train)
test_vecs = fea.transform(X_test)
```
# License
[MIT](https://github.com/4AI/embedding_features/blob/master/LICENSE)