{"id":18634356,"url":"https://github.com/goldmansachs/mrword2vec","last_synced_at":"2025-04-11T07:33:05.223Z","repository":{"id":43273210,"uuid":"191587255","full_name":"goldmansachs/MRWord2Vec","owner":"goldmansachs","description":"A MapReduce / Hadoop implementation of Word2Vec","archived":false,"fork":false,"pushed_at":"2022-03-10T19:19:36.000Z","size":54,"stargazers_count":16,"open_issues_count":2,"forks_count":11,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-25T11:11:20.352Z","etag":null,"topics":["java","mapreduce","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/goldmansachs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-06-12T14:28:46.000Z","updated_at":"2024-03-31T14:22:37.000Z","dependencies_parsed_at":"2022-09-06T07:31:04.497Z","dependency_job_id":null,"html_url":"https://github.com/goldmansachs/MRWord2Vec","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goldmansachs%2FMRWord2Vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goldmansachs%2FMRWord2Vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goldmansachs%2FMRWord2Vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goldmansachs%2FMRWord2Vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/goldmansachs","download_url":"https://codeload.github.com/goldmansachs/MRWord2Vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248358873,"owners_count":21090447,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","mapreduce","word2vec"],"created_at":"2024-11-07T05:18:18.459Z","updated_at":"2025-04-11T07:33:04.774Z","avatar_url":"https://github.com/goldmansachs.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MRWord2Vec\n\n## Introduction\n\nMRWord2Vec is a [MapReduce](https://en.wikipedia.org/wiki/MapReduce) implementation of \n[Word2Vec](https://en.wikipedia.org/wiki/Word2vec).\n\nIt's a Java library, and can be used to train Word2Vec models in two ways:\n- in the MapReduce framework, and\n- on a single machine.\n\nThe novelty of this library is the MapReduce implementation and the extremely \nlow memory footprint.\nThe amount of memory needed is approximately twice the amount of memory needed \nto store the Word2Vec vectors.\nThis is achieved by implementing incremental training - training on one sentence at a time.\n\n\n## Quick-start\nThis section will demonstrate how to quickly get this library up and running.\nFor a more comprehensive guide, read the rest of the sections.\n\nThe following shell script will train a Word2Vec model and compute the nearest neighbours:\n```\n# The location of the MRWord2Vec's jar file.\nmrword2vec_jar=[path_to_mrword2vec.jar]\n# The location of the directory containing all the dependency jars. \n# This will, for example, include the jar file for jblas.\ndependency_jars=[path_to_directory_containing_all_dependency_jars]\n# Setting up the HADOOP_CLASSPATH by adding all the dependencies and the jar file of this library.\nHADOOP_CLASSPATH=$HADOOP_CLASSPATH:$(echo $dependency_jars/*.jar | tr ' ' ':'):$mrword2vec_jar\n# The text file used for training. See the section on Input to learn how this file is formatted.\ninputFile=[path_to_training_text_file]\n# The directory where the model is to be saved.\nmodelPath=[path_where_the_model_is_to_be_saved]\n\n# Running the class Word2VecDriver. This trains the model.\nhadoop jar $mrword2vec_jar com.gs.mrword2vec.mapreduce.train.Word2VecDriver -libjars=$(echo $dependency_jars/*.jar | tr ' ' ',') -queue_name=my_queue_name -input_path=$inputFile -output_dir=$modelPath -min_count=100 -max_vocab_size=100000 -num_parts=100 -num_reducers=5 -iterations=10 -mapper_memory_mb=6144\n# Computing the nearest neighbours.\nhadoop jar $mrword2vec_jar com.gs.mrword2vec.NearestNeighbours -D modelDir=$modelPath -D k=20\n```\n\nThe command above will run the MapReduce jobs to train a Word2Vec model on the specified\ntraining file and will save the learned model parameters.\nIt will run 10 MapReduce jobs (since `iterations` is set to 10), with each\njob using 100 mappers (since `num_parts` is set to 100) and 5 reducers.\nSince there are 100 mappers, 100 independent Word2Vec models are trained, one in each mapper.\nEach of these models is trained on 1/100th of the data.\n\nThe last command is then used to get the top k nearest neighbours for\nevery word in the vocabulary of the trained model.\nWhen it completes, a file will be created storing the nearest neighbours for every\nvocabulary word in the same directory as the saved model (inside `modelPath`).\n\n\n## Background\n\n#### Word2Vec\nWord2Vec is a model to learn real-vector representations of words from text data.\nThese are low-dimensional (i.e., of dimension much smaller than the vocabulary size) vectors with\nsemantic meaning, unlike one-hot vectors.\nSimilar words will have similar vectors, where similarity of two vectors is \ncomputed using [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).\n\nTo learn these vector representations Word2Vec uses text data.\nWord2Vec is an unsupervised algorithm, and needs text data in the form of sentences.\n\nThe model this library implements is called the skip-gram model. \nThe training objective of skip-gram is to find word vectors that maximize the \nprobability of accurately predicting the context given the word.\n\nHierarchical softmax is used to speed up the computation. \n(Another technique to do that is negative sampling, which isn't implemented here.)\n\n#### Distributed Word2Vec\nThe idea, used here, of sharding the data and averaging the resulting vectors, appears\nin Spark's implementation of Word2Vec.\nThe algorithm is as follows:\n\n1. Start with a model whose weights are randomly initialized.\n2. Partition the data into N parts.\n3. Repeat K times:\n    1. Spawn N mappers, with each mapper training a Word2Vec model on its shard, \n    starting from the previous model, doing gradient descent.\n    2. Combine all the N models, by taking an _average_ of all vectors for each word.\n    3. Save the model.\n\nThere is a trade-off here between N and K. \nThe larger the N, the faster the training, since each mapper trains on 1/N of the data.\nHowever, too large an N and the quality will suffer.\nSimilarly, the larger the K, the better the model. However, larger K necessitates \ngreater training time. If K = N, then there is no\nsaving in time by distributing the training.\n\n## Input\nThe input to this library is a single text file containing pre-processed data.\nEach line of this file must contain exactly one sentence,\nthe words of which are separated by spaces.\nFor example, consider the following small text file:\n```text\nWord2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.\n```\n\nAfter pre-processing it would look like:\n```text\nword2vec is a group of related models that are used to produce word embeddings\nthese models are shallow two-layer neural networks that are trained to reconstruct linguistic contexts of words\nword2vec takes as its input a large corpus of text and produces a vector space typically of several hundred dimensions with each unique word in the corpus being assigned a corresponding vector in the space\nword vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space\n```\n\nHere, we have put each sentence on a different line, removed all punctuation, \nand converted everything to lower case since we don't want to learn separate word \nembeddings for lower case and upper case words. \n\n\n## Usage\n\n#### MapReduce framework\n\nThe class used to run MRWord2Vec is `Word2VecDriver.java`.\nThis class takes some arguments and runs the MapReduce jobs.\n\n##### Arguments\n\n1. **queue\\_name** : A required argument. Example usage: `-queue_name=my_queue_name`. \nThe name of the queue on which to run the MapReduce jobs.\n2. **input\\_path** : A required argument. Example usage: `-input_path=/user/my_name/input_file`. \nThe HDFS path of the text file to be used for training. The file must contain one sentence per line. \nA sentence is a sequence of words separated by spaces. \nThe text must be pre-processed, i.e., all words should be in lower case and there should be no punctuation. \nThis file will also be used to build the vocabulary. \nHence the input file may contain more distinct words than desired for the vocabulary.\n3. **output\\_dir** : A required argument. Example usage: `-output_dir=/user/my_name/output_dir`. \nThe HDFS path of the directory where the model will be saved. \nDuring training, a temporary directory, `temp`, is created inside `output_dir`, \nwhere model parameters are saved during training. \nAfter the training is complete, the temporary directory is deleted and `output_dir` \nwill contain two files, one containing the Huffman tree encoding of the vocabulary words \nand the other containing the model parameters (with edge weights stored as matrices). \nThese two files are necessary to compute nearest neighbours, if so desired.\n4. **num\\_parts** : An optional argument. Default = 10. Example usage: `-num_parts=100`. \nThe number of parts into which to split the input corpus. \nThis is equal to the number of mappers spawned, since each mapper is responsible to train on one split. \nThe higher the `num_parts`, the faster the training will be.\n5. **num\\_reducers** : An optional argument. Default = 10. Example usage: `-num_reducers=10`. \nThe number of reducers to be spawned while training. The reducers are responsible for averaging the vectors. \nSet this number depending on the vocabulary size, with the default value being sufficient for most cases.\n6. **mapper\\_memory\\_mb** : An optional argument. Default = 3072. Example usage: `-mapper_memory_mb=6144`. \nThe size of the memory, in megabytes, allocated to each mapper used in training. \nThe memory usage of mappers while training is proportional to the product of vocabulary size and vector size. \nSet this number accordingly.\n7. **reducer_memory_mb** : An optional argument. Default = 3072. Example usage: `-reducer_memory_mb=6144`. \nThe size of the memory, in megabytes, allocated to each reducer \n(used for averaging after the training is complete for that epoch). \nThe memory usage of the reducers is proportional to the product of the number of mappers and the vector size. \nThe default value should suffice for most cases.\n8. **max_vocab_size** : An optional argument. Default = 10,000. Example usage: `-max_vocab_size=50000`. \nThis argument limits the vocabulary by choosing the `max_vocab_size` most frequent words from \nthe training data file specified by `input_path`. \nThis is one of the two parameters one can use to limit the vocabulary size, the other being `min_count`. \n9. **min_count** : An optional argument. Default = 10. Example usage: `-min_count=100`. \nAll words with frequency less than `min_count` are discarded. \nThis is one of the two parameters one can use to limit the vocabulary size, the other being `max_vocab_size`.\n10. **iterations**: An optional argument. Default = 1. Example usage: `-iterations=10`. \nThis specifies the number of times you want to train your model on the training data in a distributed manner. \nThis corresponds to K in the algorithm described in the section Distributed Word2Vec above. \nYou will need to experiment to find the best value of this hyperparameter. \nIn our tests, 10 iterations work fine when `num_parts` equals 100 for a data set of 1 TB.\n11. **vector_size** : An optional argument. Default = 300. Example usage: `-vector_size=200`. \nThe dimensionality of the word2vec vectors. \nLarger vectors will usually give better results, but around 300 the marginal gain \nin quality is offset by the increase in the training time. \nFor most cases the default value of 300 will suffice.     \n\nAfter adding all the dependencies to `HADOOP_CLASSPATH`, run the driver class like this:\n```text\nhadoop jar [path_to_mrword2vec.jar] com.gs.mrword2vec.mapreduce.train.Word2VecDriver -libjars=$(echo [path_of_directory_containing_all_dependency_jars]/*.jar | tr ' ' ',') -queue_name=my_queue_name -input_path=[path_to_training_text_file] -output_dir=[path_to_save_model] -min_count=100 -max_vocab_size=100000 -num_parts=100 -num_reducers=5 -iterations=10 -mapper_memory_mb=6144\n```\n\n#### Single machine framework\n\nIt's also possible to train a Word2Vec model on a single machine (without Hadoop).\nThe class whose API is relevant for this is `Word2Vec.java`. For example:\n```\n// Creating a Word2Vec object with vector size = 300, min_count = 15, \n// max_vocab_size = 2000, and 3 epochs. \nWord2Vec word2vec = new Word2Vec(300, 15, 2000, 3);\n// sentences is of type List\u003cString[]\u003e.\nword2vec.train(sentences);\nword2vec.save(\"path_to_save_the_model\");\n```\nIt's also possible to use the function `trainSentence` to train one sentence at a \ntime to save memory by reading one sentence at a time from a file and calling `trainSentence`\non the sentence.\n```\nfor(int i = 0; i \u003c N; i++) {\n  String[] sentence = getNextSentence();\n  word2vec.trainSentence(sentence);\n}\n```\n\nA saved model can be read using the `read` function.\n```\nString locationOfSavedModelDir = \"/user/my_user/modelDir\";\nWord2Vec word2vec = new Word2Vec();\nword2vec.read(locationOfSavedModelDir);\n```\n\n## Dependencies\n\n1. Apache HBase Client\n```\n\u003cgroupId\u003eorg.apache.hbase\u003c/groupId\u003e\n\u003cartifactId\u003ehbase-client\u003c/artifactId\u003e\n\u003cversion\u003e1.1.2\u003c/version\u003e\n```\n2. Apache Hadoop Common\n```\n\u003cgroupId\u003eorg.apache.hadoop\u003c/groupId\u003e\n\u003cartifactId\u003ehadoop-common\u003c/artifactId\u003e\n\u003cversion\u003e2.7.3\u003c/version\u003e\n```\n3. Apache Hadoop MapReduce Core\n```\n\u003cgroupId\u003eorg.apache.hadoop\u003c/groupId\u003e\n\u003cartifactId\u003ehadoop-mapreduce-client-core\u003c/artifactId\u003e\n\u003cversion\u003e2.7.3\u003c/version\u003e\n```\n4. jblas\n```\n\u003cgroupId\u003eorg.jblas\u003c/groupId\u003e\n\u003cartifactId\u003ejblas\u003c/artifactId\u003e\n\u003cversion\u003e1.2.4\u003c/version\u003e\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoldmansachs%2Fmrword2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoldmansachs%2Fmrword2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoldmansachs%2Fmrword2vec/lists"}