Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/oxford-cs-deepnlp-2017/practical-1
Oxford Deep NLP 2017 course - Practical 1: word2vec
https://github.com/oxford-cs-deepnlp-2017/practical-1
deep-learning natural-language-processing nlp oxford word2vec
Last synced: 5 days ago
JSON representation
Oxford Deep NLP 2017 course - Practical 1: word2vec
- Host: GitHub
- URL: https://github.com/oxford-cs-deepnlp-2017/practical-1
- Owner: oxford-cs-deepnlp-2017
- Created: 2017-01-19T17:28:23.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2021-03-08T17:43:27.000Z (almost 4 years ago)
- Last Synced: 2025-01-09T08:08:18.772Z (12 days ago)
- Topics: deep-learning, natural-language-processing, nlp, oxford, word2vec
- Language: Jupyter Notebook
- Homepage: https://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl/
- Size: 15.2 MB
- Stars: 253
- Watchers: 27
- Forks: 146
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Practical 1: word2vec
[Brendan Shillingford, Yannis Assael, Chris Dyer]For this practical, you'll be provided with a partially-complete IPython notebook, an interactive web-based Python computing environment that allows us to mix text, code, and interactive plots.
We will be training word2vec models on TED Talk and Wikipedia data, using the word2vec implementation included in the Python package `gensim`. After training the models, we will analyze and visualize the learned embeddings.
## Setup and installation
On a lab workstation, clone the practical repository and run the `. install-python.sh` shell script in a terminal to install Anaconda with Python 3, and the packages required for this practical.Run `ipython notebook` in the repository directory and open the `practical.ipynb` notebook in your browser.
## Preliminaries
### Preprocessing
The code for downloading the dataset and preprocessing it is prewritten to save time. However, it is expected that you'll need to perform such a task in future practicals, given raw data. Read it and make sure you understand it. Often, one uses a library like `nltk` to simplify this task, but we have not done so here and instead opted to use regular expressions via Python's `re` module.### Word frequencies
Make a list of the most common words and their occurence counts. Take a look at the top 40 words. You may want to use the `sklearn.feature_extraction.text` module's `CountVectorizer` class or the `collections` module's `Counter` class.Take the top 1000 words, and plot a histogram of their counts. The plotting code for an interactive histogram is already given in the notebook.
**Handin:** show the frequency distribution histogram.
## Training Word2Vec
Now that we have a processed list of sentences, let's run the word2vec training. Begin by reading the gensim documentation for word2vec at , to figure out how to use the `Word2Vec` class. Learn embeddings in $\mathbb R^{100}$ using CBOW (which is the default). Other options should be default except `min_count=10` so that infrequent words are ignored. The training process should take under half a minute.If your trained `Word2Vec` instance is called `model_ted`, you should be able to check the vocabulary size using `len(model_ted.vocab)`, which should be around 14427. Try using the `most_similar()` method to return a list of the most similar words to "man" and "computer".
**Handin:** find a few more words with interesting and/or surprising nearest neighbours.
**Handin:** find an interesting cluster in the t-SNE plot.
**Optional, for enthusiastic students:** try manually retrieving two word vectors using the indexing operator as described in gensim's documentation, then computer their cosine distances (recall it is defined as $d(x,y) = \frac{\langle x, y \rangle}{\|x\|\|y\|}$). You may be interested in `np.dot()` and `np.linalg.norm()`, see the numpy documentation for details. Compare this to the distance computed by gensim's functions.
## Comparison to vectors trained on WikiText-2 data
We have provided downloading/preprocessing code (similar to the previous code) for the WikiText-2 dataset. The code uses a random subsample of the data so it is comparable in size to the TED Talk data.Repeat the same analysis as above but on this dataset.
**Handin:** find a few words with similar nearest neighbours.
**Handin:** find an interesting cluster in the t-SNE plot.
**Handin:** Are there any notable differences between the embeddings learned on data compared to those learned on the TED Talk data?
## (Optional, for enthusiastic students) Clustering
If you have extra time, try performing a k-means clustering (e.g. using `sklearn.cluster.kmeans`) on the embeddings, tuning the number of clusters until you get interesting or meaningful clusters.# Handin
See the bolded "**Handin:**" parts above. On paper or verbally, show a practical demonstrator your response to these to get signed off.