https://github.com/prashantranjan09/elmo-tutorial
A short tutorial on Elmo training (Pre trained, Training on new data, Incremental training)
https://github.com/prashantranjan09/elmo-tutorial
allen allennlp elmo elmo-tutorial tutorial word-embeddings word-vectors
Last synced: 4 months ago
JSON representation
A short tutorial on Elmo training (Pre trained, Training on new data, Incremental training)
- Host: GitHub
- URL: https://github.com/prashantranjan09/elmo-tutorial
- Owner: PrashantRanjan09
- Created: 2018-07-06T08:51:48.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2020-06-20T06:41:13.000Z (almost 5 years ago)
- Last Synced: 2024-04-27T23:38:39.185Z (about 1 year ago)
- Topics: allen, allennlp, elmo, elmo-tutorial, tutorial, word-embeddings, word-vectors
- Language: Jupyter Notebook
- Size: 396 KB
- Stars: 151
- Watchers: 6
- Forks: 38
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Elmo-Tutorial
This is a short tutorial on using Deep contextualized word representations (ELMo) which is discussed in the paper https://arxiv.org/abs/1802.05365.
This tutorial can help in using:* **Pre Trained Elmo Model** - refer _Elmo_tutorial.ipynb_
* **Training an Elmo Model on your new data from scratch**To train and evaluate a biLM, you need to provide:
* a vocabulary file
* a set of training files
* a set of heldout filesThe vocabulary file is a text file with one token per line. It must also include the special tokens , and
The vocabulary file should be sorted in descending order by token count in your training data. The first three entries/lines should be the special tokens :
`` ,
`` and
``.The training data should be randomly split into many training files, each containing one slice of the data. Each file contains pre-tokenized and white space separated text, one sentence per line.
**Don't include the `` or `` tokens in your training data.**
Once done, git clone **https://github.com/allenai/bilm-tf.git**
and run:python bin/train_elmo.py --train_prefix= --vocab_file --save_dir
To get the weights file,
run:python bin/dump_weights.py --save_dir /output_path/to/checkpoint --outfile/output_path/to/weights.hdf5
In the save dir, one options.json will be dumped and above command will give you a weights file required to create an Elmo model (options file and the weights file)
For more information refer **Elmo_tutorial.ipynb**
* ## Incremental Learning/Training
To incrementally train an existing model with new data
While doing Incremental training :
git clone https://github.com/allenai/bilm-tf.gitOnce done, replace _train_elmo_ within allenai/bilm-tf/bin/ with **train_elmo_updated.py** provided at home.
**Updated changes** :
_train_elmo_updated.py_
tf_save_dir = args.save_dir
tf_log_dir = args.save_dir
train(options, data, n_gpus, tf_save_dir, tf_log_dir,restart_ckpt_file)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--save_dir', help='Location of checkpoint files')
parser.add_argument('--vocab_file', help='Vocabulary file')
parser.add_argument('--train_prefix', help='Prefix for train files')
parser.add_argument('--restart_ckpt_file', help='latest checkpoint file to start with')
This takes an argument (--restart_ckpt_file) to accept the path of the checkpointed file.replace _training.py_ within allenai/bilm-tf/bilm/ with **training_updated.py** provided at home.
Also, make sure to put your embedding layer name in line 758 in **training_updated.py** :exclude = ['the embedding layer name you want to remove']
**Updated changes** :_training_updated.py_
# load the checkpoint data if needed
if restart_ckpt_file is not None:
reader = tf.train.NewCheckpointReader(your_checkpoint_file)
cur_vars = reader.get_variable_to_shape_map()
exclude = ['the embedding layer name you want to remove']
variables_to_restore = tf.contrib.slim.get_variables_to_restore(exclude=exclude)
loader = tf.train.Saver(variables_to_restore)
#loader = tf.train.Saver()
loader.save(sess,'/tmp')
loader.restore(sess, '/tmp')
with open(os.path.join(tf_save_dir, 'options.json'), 'w') as fout:
fout.write(json.dumps(options))summary_writer = tf.summary.FileWriter(tf_log_dir, sess.graph)
The code reads the checkpointed file and reads all the current variables in the graph and excludes the layers mentioned in the _exclude_ variable, restores rest of the variables along with the associated weights.For training run:
python bin/train_elmo_updated.py --train_prefix= --vocab_file --save_dir --restart_ckpt_file
In the _train_elmo_updated.py_ within bin, set these options based on your data:
batch_size = 128 # batch size for each GPU
n_gpus = 3# number of tokens in training data
n_train_tokens =options = {
'bidirectional': True,
'dropout': 0.1,
'all_clip_norm_val': 10.0,'n_epochs': 10,
'n_train_tokens': n_train_tokens,
'batch_size': batch_size,
'n_tokens_vocab': vocab.size,
'unroll_steps': 20,
'n_negative_samples_batch': 8192,
**Visualisation**
Visualization of the word vectors using Elmo:
* Tsne
* Tensorboard

### Using Elmo Embedding layer in consequent models
if you want to use Elmo Embedding layer in consequent model build refer : https://github.com/PrashantRanjan09/WordEmbeddings-Elmo-Fasttext-Word2Vec