Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/huu4ontocord/keyedvectorsann

Genism word2vec + Pysparnn ANN + Trimmed GoogleNewsVec = Fast and lightweight NLP tool
https://github.com/huu4ontocord/keyedvectorsann

Last synced: 10 days ago
JSON representation

Genism word2vec + Pysparnn ANN + Trimmed GoogleNewsVec = Fast and lightweight NLP tool

Host: GitHub
URL: https://github.com/huu4ontocord/keyedvectorsann
Owner: huu4ontocord
Created: 2017-03-17T23:18:51.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2017-03-18T05:02:26.000Z (over 7 years ago)
Last Synced: 2024-02-16T02:29:21.226Z (9 months ago)
Language: Python
Homepage:
Size: 51.8 KB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # KeyedVectorsANN

Genism word2vec + Pysparnn ANN + Trimmed GoogleNewsVec = Fast and lightweight NLP tool  

This software is an extension of gensim's KeyedVectors using pysparnn's approximate nearest neighber indexer. It depends on gensim, numpy, sklearn and scipy.  

It also includes a utility to load Google News' vector and collapse down to a manageable size.  

Copyright (C) 2017 Hiep Huu Nguyen   

Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html  

See additional licenses at https://github.com/facebookresearch/pysparnn and https://radimrehurek.com/gensim/  

Download the the google vector file from here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit  

TODO: Extend gensim's word2vec to use KeyedVectorsANN. Refactor to use the pysparnn module when compatible.  

You can create the model based on Google News' 3,000,000 word vectors. This will result in a vocab of ~16K words with an additional ~44K "synonyms" of compound words.  

Model creation:

~~~

import gensim, time

from KeyedVectorsANN import *

t = time.time()    

kv = prepareANNModel("./nativedata/GoogleNews-vectors-negative300.bin", "GoogleNewsANN.bin", createSynonyms=True)

print ("finished creating model ...", time.time() - t, "seconds")        

~~~

Testing accuracy:

~~~ 

import gensim, time

from KeyedVectorsANN import *

t = time.time()    

#assumes you have already created the ANN file by calling:

#kv = prepareANNModel("./nativedata/GoogleNews-vectors-negative300.bin", "GoogleNewsANN.bin", createSynonyms=True)

kv = KeyedVectorsANN.load('GoogleNewsANN.bin')

#you can set the k_clusters variable to get higher accuracy at the expense of speed. 

#1 is fastest with lowest accuracy. Default is set to 10.

#kv.indexer.k_clusters = 1 

#simple analogy lookup

#print (kv.most_similar(['dog',], [], 10))

#you can also pass k_clusters= as a parameter to most_similar

#do an accuracy test

acc_data = kv.accuracy_indexer("./nativedata/questions-words.txt")

for section in acc_data:

  if len(section['correct']) +  len(section['incorrect']) > 0:

    if section['section'] == 'total':

      print (section['section'], len(section['correct'])/(len(section['correct']) +  len(section['incorrect'])), len(section['correct']),  len(section['correct']) + len(section['incorrect']))

    else:

      print (section['section'], len(section['correct'])/(len(section['correct']) +  len(section['incorrect'])))

print ("finished ann accuracy ...", time.time() - t, "seconds")        

~~~ 

Results in:

 

~~~

capital-common-countries 0.9263157894736842

capital-world 0.9041731066460588

currency 0.4262948207171315

city-in-state 0.7990165949600492

family 0.8201581027667985

gram1-adjective-to-adverb 0.3326612903225806

gram2-opposite 0.44950738916256155

gram3-comparative 0.7507507507507507

gram4-superlative 0.8368983957219251

gram5-present-participle 0.7395833333333334

gram6-nationality-adjective 0.9441571871768356

gram7-past-tense 0.6730769230769231

gram8-plural 0.7942942942942943

gram9-plural-verbs 0.732183908045977

total 0.7252097774534841 9939 13705

finished ann accuracy ... 45.34215688705444 seconds

~~~