https://github.com/dpressel/torchure
Various accrued torch utilities including Word2Vec model loader
https://github.com/dpressel/torchure
Last synced: 9 months ago
JSON representation
Various accrued torch utilities including Word2Vec model loader
- Host: GitHub
- URL: https://github.com/dpressel/torchure
- Owner: dpressel
- License: mit
- Created: 2016-04-22T14:07:15.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-12-28T21:01:27.000Z (almost 9 years ago)
- Last Synced: 2025-01-03T21:27:50.682Z (10 months ago)
- Language: Lua
- Homepage:
- Size: 15.6 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# torchure
Tiny module of accrued Torch utilities. This includes a Word2Vec model loader to use existing embeddings in Torch and a derived implementation of nn.LookupTable that loads its model from a binary.
About
-----
Load and use word embeddings. Currently just reads word2vec binaries. There are two ways to use these. You can load up just a representation of the embeddings into "Word2VecModel":
```
th> require 'torchure';
[0.0000s]
th> wv = Word2VecModel('/data/xdata/oct-s140clean-uber.cbow-bin')
[14.4513s]
th> wv.vsz, wv.dsz
949887 150
[0.0001s]
th> hellov = wv:lookup('hello')
```
If a word isnt present, by default, it gives back a zero-vector. If you pass true to the second parameter, it will return nil instead
```
th> wv:lookup('kjlasgjklwljk', true)
[0.0000s]
th> wv:lookup('kjlasgjklwljk', false)
0
0
0
0
0
0
```
The approach above might be useful if you are not planning on fine-tuning your embeddings and want to just preprocess your data. For instance, in the case of classification, you can simply form a tensor (or batch of tensors). This example loads a temporal vector from a TSV, where the first tab is the label, followed by a sentence. It returns a table with-sub tables for x (the feature vector) and y (the label). Each feature vector has a row for each word, and a column for each embedding dimension.
```
function loadTemporal(file, w2v, filtsz, mxlen)
local ts = {}
local yt = {}
local xt = {}
local vsz = w2v.vsz
local dsz = w2v.dsz
local fsz = filtsz or 0
local mxw = mxlen or 128
local halffiltsz = math.floor(fsz / 2)
local tsfile = io.open(file, 'r')
local linenum = 1
for line in tsfile:lines() do
local labelText = line:split('\t')
if #labelText < 2 then
print('Skipping invalid line ' .. line .. " " .. linenum)
goto continue
end
local num = tonumber(labelText[1])
local y = torch.FloatTensor({num})
local toks = labelText[2]:split(' ')
local mx = math.min(#toks, mxw)
local siglen = mx + (2*halffiltsz)
local x = torch.zeros(siglen, dsz)
for i=1,mx do
local w = toks[i]
local z = w2v:lookup(w)
x[{i + halffiltsz}] = z
end
table.insert(yt, y)
table.insert(xt, x)
linenum = linenum + 1
::continue::
end
ts.y = yt
ts.x = xt
return ts
end
```
If you want to just use a model directly as pretraining for a nn.LookupTable, you can use the Word2VecLookupTable instead. The usage is the same (almost), except this can be used both as a way to find the word vectors for indices, and it can be put directly into a neural network.
```
th> wv = Word2VecLookupTable('/data/xdata/oct-s140clean-uber.cbow-bin')
[12.4409s]
th> wv.vsz, wv.dsz
949887 150
[0.0001s]
th> hellov = wv:lookup('hello')
[0.0001s]
th> wv:lookup('kjlasgjklwljk')
[0.0000s]
```
Now we can use the table directly as a LookupTable, for example, in a convolution network
```
function createNN(w2v, cmotsz, filtsz, nc, gpu)
local seq = nn.Sequential()
seq:add(w2v)
seq:add(nn.TemporalConvolution(w2v.dsz, cmotsz, filtsz))
seq:add(nn.Max())
seq:add(nn.Linear(cmotsz, nc))
seq:add(nn.LogSoftMax())
return gpu and seq:cuda() or seq
end
local w2v = Word2VecLookupTable(embeddings)
print('Loaded word embeddings')
local cnn = createCNN(w2v, 200, 5, labelsz, gpu)
```
Installing
----------
*From git*
- clone the repository
- execute:
```
luarocks make torchure-scm-1.rockspec
```
*From web*
- execute:
```
luarocks install https://raw.githubusercontent.com/dpressel/torchure/master/torchure-scm-1.rockspec
```