Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vzhong/embeddings

Fast, DB Backed pretrained word embeddings for natural language processing.
https://github.com/vzhong/embeddings

deep-learning neural-network nlp

Last synced: 7 days ago
JSON representation

Fast, DB Backed pretrained word embeddings for natural language processing.

Host: GitHub
URL: https://github.com/vzhong/embeddings
Owner: vzhong
License: mit
Created: 2017-02-27T04:05:12.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2023-10-23T11:09:11.000Z (over 1 year ago)
Last Synced: 2025-02-01T12:06:22.874Z (14 days ago)
Topics: deep-learning, neural-network, nlp
Language: Python
Homepage:
Size: 46.9 KB
Stars: 222
Watchers: 4
Forks: 31
Open Issues: 1
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

awesome-deeplearning-resources - Available pretrained word embeddings

README

        Embeddings

==========

.. image:: https://readthedocs.org/projects/embeddings/badge/?version=latest

    :target: http://embeddings.readthedocs.io/en/latest/?badge=latest

    :alt: Documentation Status

.. image:: https://travis-ci.org/vzhong/embeddings.svg?branch=master

    :target: https://travis-ci.org/vzhong/embeddings

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, ``embeddings`` is backed by a database and fast to load and query:

.. code-block:: python

    >>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)

    100 loops, best of 3: 12.7 ms per loop

    

    >>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')

    100 loops, best of 3: 12.9 ms per loop

    

    >>> g = GloveEmbedding('common_crawl_840', d_emb=300)

    

    >>> %timeit -n1 g.emb('canada')

    1 loop, best of 3: 38.2 µs per loop

Installation

------------

.. code-block:: sh

    pip install embeddings  # from pypi

    pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

-----

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database.

This may take a long time for large embeddings such as GloVe.

Further usage of the embeddings are directly queried against the database.

Embedding databases are stored in the ``$EMBEDDINGS_ROOT`` directory (defaults to ``~/.embeddings``). Note that this location is probably **undesirable** if your home directory is on NFS, as it would slow down database queries significantly.

.. code-block:: python

    from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

    

    g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)

    f = FastTextEmbedding()

    k = KazumaCharEmbedding()

    c = ConcatEmbedding([g, f, k])

    for w in ['canada', 'vancouver', 'toronto']:

        print('embedding {}'.format(w))

        print(g.emb(w))

        print(f.emb(w))

        print(k.emb(w))

        print(c.emb(w))

Docker

------

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto's character ngram embeddings is available at `vzhong/embeddings `_.

To mount volumes from this container, set ``$EMBEDDINGS_ROOT`` in your container to ``/opt/embeddings``.

For example:

.. code-block:: bash

    docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

------------

Pull requests welcome!