{"id":27881799,"url":"https://github.com/src-d/tensorflow-swivel","last_synced_at":"2025-07-23T11:37:32.893Z","repository":{"id":53344076,"uuid":"96406280","full_name":"src-d/tensorflow-swivel","owner":"src-d","description":null,"archived":false,"fork":false,"pushed_at":"2018-02-15T10:31:37.000Z","size":69,"stargazers_count":16,"open_issues_count":1,"forks_count":3,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-05-05T05:05:11.997Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-07-06T08:21:03.000Z","updated_at":"2023-04-25T12:30:28.000Z","dependencies_parsed_at":"2022-09-03T04:22:12.230Z","dependency_job_id":null,"html_url":"https://github.com/src-d/tensorflow-swivel","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Ftensorflow-swivel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Ftensorflow-swivel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Ftensorflow-swivel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Ftensorflow-swivel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/tensorflow-swivel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252442485,"owners_count":21748451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-05T05:05:16.863Z","updated_at":"2025-05-05T05:05:17.345Z","avatar_url":"https://github.com/src-d.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Swivel in Tensorflow\n\nThis is a [TensorFlow](http://www.tensorflow.org/) implementation of the\n[Swivel algorithm](http://arxiv.org/abs/1602.02215) for generating word\nembeddings.\n\n### This is the source{d}'s fork, which is different from the [original](https://github.com/tensorflow/models/tree/master/swivel). See \"Changes in this fork\".\n\nSwivel works as follows:\n\n1. Compute the co-occurrence statistics from a corpus; that is, determine how\n   often a word *c* appears the context (e.g., \"within ten words\") of a focus\n   word *f*.  This results in a sparse *co-occurrence matrix* whose rows\n   represent the focus words, and whose columns represent the context\n   words. Each cell value is the number of times the focus and context words\n   were observed together.\n2. Re-organize the co-occurrence matrix and chop it into smaller pieces.\n3. Assign a random *embedding vector* of fixed dimension (say, 300) to each\n   focus word and to each context word.\n4. Iteratively attempt to approximate the\n   [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)\n   (PMI) between words with the dot product of the corresponding embedding\n   vectors.\n\nNote that the resulting co-occurrence matrix is very sparse (i.e., contains many\nzeros) since most words won't have been observed in the context of other words.\nIn the case of very rare words, it seems reasonable to assume that you just\nhaven't sampled enough data to spot their co-occurrence yet.  On the other hand,\nif we've failed to observed to common words co-occuring, it seems likely that\nthey are *anti-correlated*.\n\nSwivel attempts to capture this intuition by using both the observed and the\nun-observed co-occurrences to inform the way it iteratively adjusts vectors.\nEmpirically, this seems to lead to better embeddings, especially for rare words.\n\n# Contents\n\nThis release includes the following programs.\n\n* `prep.py` is a program that takes a text corpus and pre-processes it for\n  training. Specifically, it computes a vocabulary and token co-occurrence\n  statistics for the corpus.  It then outputs the information into a format that\n  can be digested by the TensorFlow trainer.\n* `swivel.py` is a TensorFlow program that generates embeddings from the\n  co-occurrence statistics.  It uses the files created by `prep.py` as input,\n  and generates two text files as output: the row and column embeddings.\n* `text2bin.py` combines the row and column vectors generated by Swivel into a\n  flat binary file that can be quickly loaded into memory to perform vector\n  arithmetic.  This can also be used to convert embeddings from\n  [Glove](http://nlp.stanford.edu/projects/glove/) and\n  [word2vec](https://code.google.com/archive/p/word2vec/) into a form that can\n  be used by the following tools.\n* `nearest.py` is a program that you can use to manually inspect binary\n  embeddings.\n* `eval.mk` is a GNU makefile that fill retrieve and normalize several common\n  word similarity and analogy evaluation data sets.\n* `wordsim.py` performs word similarity evaluation of the resulting vectors.\n* `analogy` performs analogy evaluation of the resulting vectors.\n* `fastprep` is a C++ program that works much more quickly that `prep.py`, but\n  also has some additional dependencies to build.\n\n# Building Embeddings with Swivel\n\nTo build your own word embeddings with Swivel, you'll need the following:\n\n* A large corpus of text; for example, the\n  [dump of English Wikipedia](https://dumps.wikimedia.org/enwiki/).\n* A working [TensorFlow](http://www.tensorflow.org/) implementation.\n* A machine with plenty of disk space and, ideally, a beefy GPU card.  (We've\n  experimented with the\n  [Nvidia Titan X](http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x),\n  for example.)\n\nYou'll then run `prep.py` (or `fastprep`) to prepare the data for Swivel and run\n`swivel.py` to create the embeddings. The resulting embeddings will be output\ninto two large text files: one for the row vectors and one for the column\nvectors.  You can use those \"as is\", or convert them into a binary file using\n`text2bin.py` and then use the tools here to experiment with the resulting\nvectors.\n\n## Preparing the data for training\n\nOnce you've downloaded the corpus (e.g., to `/tmp/wiki.txt`), run `prep.py` to\nprepare the data for training:\n\n    ./prep.py --output_dir /tmp/swivel_data --input /tmp/wiki.txt\n\nBy default, `prep.py` will make one pass through the text file to compute a\n\"vocabulary\" of the most frequent words, and then a second pass to compute the\nco-occurrence statistics.  The following options allow you to control this\nbehavior:\n\n| Option | Description |\n|:--- |:--- |\n| `--min_count \u003cn\u003e` | Only include words in the generated vocabulary that appear at least *n* times. |\n| `--max_vocab \u003cn\u003e` | Admit at most *n* words into the vocabulary. |\n| `--vocab \u003cfilename\u003e` | Use the specified filename as the vocabulary instead of computing it from the corpus.  The file should contain one word per line. |\n\nThe `prep.py` program is pretty simple.  Notably, it does almost no text\nprocessing: it does no case translation and simply breaks text into tokens by\nsplitting on spaces. Feel free to experiment with the `words` function if you'd\nlike to do something more sophisticated.\n\nUnfortunately, `prep.py` is pretty slow.  Also included is `fastprep`, a C++\nequivalent that works much more quickly.  Building `fastprep.cc` is a bit more\ninvolved: it requires you to pull and build the Tensorflow source code in order\nto provide the libraries and headers that it needs.  See `fastprep.mk` for more\ndetails.\n\n## Training the embeddings\n\nWhen `prep.py` completes, it will have produced a directory containing the data\nthat the Swivel trainer needs to run.  Train embeddings as follows:\n\n    ./swivel.py --input_base_path /tmp/swivel_data \\\n       --output_base_path /tmp/swivel_data\n\nThere are a variety of parameters that you can fiddle with to customize the\nembeddings; some that you may want to experiment with include:\n\n| Option | Description |\n|:--- |:--- |\n| `--embedding_size \u003cdim\u003e` | The dimensionality of the embeddings that are created.  By default, 300 dimensional embeddings are created. |\n| `--num_epochs \u003cn\u003e` | The number of iterations through the data that are performed.  By default, 40 epochs are trained. |\n\nAs mentioned above, access to beefy GPU will dramatically reduce the amount of\ntime it takes Swivel to train embeddings.\n\nWhen complete, you should find `row_embeddings.tsv` and `col_embedding.tsv` in\nthe directory specified by `--ouput_base_path`.  These files are tab-delimited\nfiles that contain one embedding per line.  Each line contains the token\nfollowed by *dim* floating point numbers.\n\n## Exploring and evaluating the embeddings\n\nThere are also some simple tools you can to explore the embeddings.  These tools\nwork with a simple binary vector format that can be `mmap`-ed into memory along\nwith a separate vocabulary file.  Use `text2bin.py` to generate these files:\n\n    ./text2bin.py -o vecs.bin -v vocab.txt /tmp/swivel_data/*_embedding.tsv\n\nYou can do some simple exploration using `nearest.py`:\n\n    ./nearest.py -v vocab.txt -e vecs.bin\n    query\u003e dog\n    dog\n    dogs\n    cat\n    ...\n    query\u003e man woman king\n    king\n    queen\n    princess\n    ...\n\nTo evaluate the embeddings using common word similarity and analogy datasets,\nuse `eval.mk` to retrieve the data sets and build the tools:\n\n    make -f eval.mk\n    ./wordsim.py -v vocab.txt -e vecs.bin *.ws.tab\n    ./analogy --vocab vocab.txt --embeddings vecs.bin *.an.tab\n\nThe word similarity evaluation compares the embeddings' estimate of \"similarity\"\nwith human judgement using\n[Spearman's rho](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)\nas the measure of correlation.  (Bigger numbers are better.)\n\nThe analogy evaluation tests how well the embeddings can predict analogies like\n\"man is to woman as king is to queen\".\n\nNote that `eval.mk` forces all evaluation data into lower case.  From there,\nboth the word similarity and analogy evaluations assume that the eval data and\nthe embeddings use consistent capitalization: if you train embeddings using\nmixed case and evaluate them using lower case, things won't work well.\n\n# Contact\n\nsource{d}'s Machine Learning Team: machine-learning@sourced.tech\n\n# Changes in this fork\n\n* Tailored for a single machine (but multiple GPUs)\n* Tensorboard support\n* High performance **fastprep**\n* Code style, logging changes\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Ftensorflow-swivel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Ftensorflow-swivel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Ftensorflow-swivel/lists"}