https://github.com/dpressel/torchure

Various accrued torch utilities including Word2Vec model loader
https://github.com/dpressel/torchure

Last synced: 9 months ago
JSON representation

Various accrued torch utilities including Word2Vec model loader

Host: GitHub
URL: https://github.com/dpressel/torchure
Owner: dpressel
License: mit
Created: 2016-04-22T14:07:15.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2016-12-28T21:01:27.000Z (almost 9 years ago)
Last Synced: 2025-01-03T21:27:50.682Z (10 months ago)
Language: Lua
Homepage:
Size: 15.6 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # torchure

Tiny module of accrued Torch utilities.  This includes a Word2Vec model loader to use existing embeddings in Torch and a derived implementation of nn.LookupTable that loads its model from a binary.

About

-----

Load and use word embeddings.  Currently just reads word2vec binaries.  There are two ways to use these.  You can load up just a representation of the embeddings into "Word2VecModel":

```

th> require 'torchure';

                                                                      [0.0000s]	

th> wv = Word2VecModel('/data/xdata/oct-s140clean-uber.cbow-bin')

                                                                      [14.4513s]

th> wv.vsz, wv.dsz  

949887	150		

                                                                      [0.0001s]	

th> hellov = wv:lookup('hello')

```

If a word isnt present, by default, it gives back a zero-vector.  If you pass true to the second parameter, it will return nil instead

```

      

th> wv:lookup('kjlasgjklwljk', true)

                                                                      [0.0000s]	

th> wv:lookup('kjlasgjklwljk', false)

 0

 0

 0

 0

 0

 0

```

The approach above might be useful if you are not planning on fine-tuning your embeddings and want to just preprocess your data.  For instance, in the case of classification, you can simply form a tensor (or batch of tensors).  This example loads a temporal vector from a TSV, where the first tab is the label, followed by a sentence.  It returns a table with-sub tables for x (the feature vector) and y (the label).  Each feature vector has a row for each word, and a column for each embedding dimension.

```

function loadTemporal(file, w2v, filtsz, mxlen)

    local ts = {}

    local yt = {}

    local xt = {}

    local vsz = w2v.vsz

    local dsz = w2v.dsz

    local fsz = filtsz or 0

    local mxw = mxlen or 128

    local halffiltsz = math.floor(fsz / 2)

    local tsfile = io.open(file, 'r')

    local linenum = 1

    for line in tsfile:lines() do  

	local labelText = line:split('\t')

	if #labelText < 2 then

	   print('Skipping invalid line ' .. line .. " " .. linenum)

	   goto continue 

	end

	local num = tonumber(labelText[1])

	local y = torch.FloatTensor({num})

	local toks = labelText[2]:split(' ')

	local mx = math.min(#toks, mxw)

        local siglen = mx + (2*halffiltsz)

 	local x = torch.zeros(siglen, dsz)

	for i=1,mx do

	    local w = toks[i]

	    local z = w2v:lookup(w)

	    x[{i + halffiltsz}] = z

	end

	table.insert(yt, y)

	table.insert(xt, x)

        

	linenum = linenum + 1

	::continue::

    end

    ts.y = yt

    ts.x = xt

    return ts

end

```

If you want to just use a model directly as pretraining for a nn.LookupTable, you can use the Word2VecLookupTable instead.  The usage is the same (almost), except this can be used both as a way to find the word vectors for indices, and it can be put directly into a neural network.

```

th> wv = Word2VecLookupTable('/data/xdata/oct-s140clean-uber.cbow-bin')

                                                                      [12.4409s]	

th> wv.vsz, wv.dsz

949887	150	

                                                                      [0.0001s]	

th> hellov = wv:lookup('hello')

                                                                      [0.0001s]	

th> wv:lookup('kjlasgjklwljk')

                                                                      [0.0000s]	

```

Now we can use the table directly as a LookupTable, for example, in a convolution network  

```

function createNN(w2v, cmotsz, filtsz, nc, gpu)

    local seq = nn.Sequential()

    seq:add(w2v)

    seq:add(nn.TemporalConvolution(w2v.dsz, cmotsz, filtsz))

    seq:add(nn.Max())

    seq:add(nn.Linear(cmotsz, nc))

    seq:add(nn.LogSoftMax())

    return gpu and seq:cuda() or seq

end

local w2v = Word2VecLookupTable(embeddings)

print('Loaded word embeddings')

local cnn = createCNN(w2v, 200, 5, labelsz, gpu)

```

Installing

----------

*From git*

- clone the repository

- execute:

```

luarocks make torchure-scm-1.rockspec

```

*From web*

- execute:

```

luarocks install https://raw.githubusercontent.com/dpressel/torchure/master/torchure-scm-1.rockspec

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dpressel/torchure

Awesome Lists containing this project

README