{"id":18905045,"url":"https://github.com/yoonkim/lstm-char-cnn","last_synced_at":"2025-04-13T11:47:48.553Z","repository":{"id":33703920,"uuid":"37357050","full_name":"yoonkim/lstm-char-cnn","owner":"yoonkim","description":"LSTM language model with CNN over characters","archived":false,"fork":false,"pushed_at":"2016-08-24T12:53:43.000Z","size":73124,"stargazers_count":827,"open_issues_count":14,"forks_count":220,"subscribers_count":61,"default_branch":"master","last_synced_at":"2025-04-04T05:45:32.301Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yoonkim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-13T03:59:17.000Z","updated_at":"2025-03-21T16:05:44.000Z","dependencies_parsed_at":"2022-09-13T05:31:43.572Z","dependency_job_id":null,"html_url":"https://github.com/yoonkim/lstm-char-cnn","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yoonkim%2Flstm-char-cnn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yoonkim%2Flstm-char-cnn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yoonkim%2Flstm-char-cnn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yoonkim%2Flstm-char-cnn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yoonkim","download_url":"https://codeload.github.com/yoonkim/lstm-char-cnn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248710409,"owners_count":21149186,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T09:10:23.984Z","updated_at":"2025-04-13T11:47:48.526Z","avatar_url":"https://github.com/yoonkim.png","language":"Lua","funding_links":[],"categories":["Uncategorized","Model Zoo"],"sub_categories":["Uncategorized","Recurrent Networks"],"readme":"## Character-Aware Neural Language Models\nCode for the paper [Character-Aware Neural Language Models](http://arxiv.org/abs/1508.06615) \n(AAAI 2016).\n\nA neural language model (NLM) built on character inputs only. Predictions\nare still made at the word-level. The model employs a convolutional neural network (CNN)\nover characters to use as inputs into an long short-term memory (LSTM)\nrecurrent neural network language model (RNN-LM). Also optionally\npasses the output from the CNN through a [Highway Network](http://arxiv.org/abs/1507.06228), \nwhich improves performance.\n\nMuch of the base code is from \n[Andrej Karpathy's excellent character RNN implementation](https://github.com/karpathy/char-rnn).\n\n### Requirements\nCode is written in Lua and requires Torch. It also requires\nthe `nngraph` and the `luautf8` packages, which can be installed via:\n```\nluarocks install nngraph\nluarocks install luautf8\n```\nGPU usage will additionally require `cutorch` and `cunn` packages:\n```\nluarocks install cutorch\nluarocks install cunn\n```\n\n`cudnn` will result in a good (8x-10x) speed-up for convolutions, so it is\nhighly recommended. This will make the training time of a character-level model \nbe somewhat competitive against a word-level model (1500 tokens/sec vs 3000 tokens/sec for \nthe large character/word-level models described below).\n\n```\ngit clone https://github.com/soumith/cudnn.torch.git\ncd cudnn.torch\nluarocks make cudnn-scm-1.rockspec\n```\n### Data\nData should be put into the `data/` directory, split into `train.txt`,\n`valid.txt`, and `test.txt`\n\nEach line of the .txt file should be a sentence. The English Penn \nTreebank (PTB) data (Tomas Mikolov's pre-processed version with vocab size equal to 10K,\nwidely used by the language modeling community) is given as the default.\n\nThe paper also runs the models on non-English data (Czech, French, German, Russian, and Spanish), from the ICML 2014\npaper [Compositional Morphology for Word Representations and Language Modelling](http://arxiv.org/abs/1405.4273)\nby Jan Botha and Phil Blunsom. This can be downloaded from [Jan's website](https://bothameister.github.io).\n\nFor ease of use, we provide a script to download the non-English data (`get_data.sh`). \nThe script also saves the downloaded data into the relevant folders.\n\n#### Note on PTB\nThe PTB data above does not have end-of-sentence tokens for each sentence, and hence these must be\nmanually appended. This can be done by adding `-EOS '+'` to the script (obviously you \ncan use other characters than `+` to represent an end-of-sentence token---we recommend a single\nunused character).\n\nThe non-English data already have end-of-sentence tokens for each line so, you want to add\n`-EOS ''` to the command line.\n\n#### Unicode in Lua\nLua is unicode-agnostic (each string is just a sequence of bytes) so we use\nthe `luautf8` package to deal with languages where a character can be more than one byte\n(e.g. Russian). Many thanks to [vseledkin](https://github.com/vseledkin) for alerting us\nto the fact that previous version of the code did not take this account!\n\n### Model\nHere are some example scripts. Add `-gpuid 0` to each line to use a GPU (which is\nrequired to get any reasonable speed with the CNN), and `-cudnn 1` to use the\ncudnn package. Scripts to reproduce the results of the paper can be found under `run_models.sh`\n\n#### Character-level models\nLarge character-level model (LSTM-CharCNN-Large in the paper).\nThis is the default: should get ~82 on valid and ~79 on test. Takes ~5 hours with `cudnn`.\n```\nth main.lua -savefile char-large -EOS '+'\n```\nSmall character-level model (LSTM-CharCNN-Small in the paper).\nThis should get ~96 on valid and ~93 on test. Takes ~2 hours with `cudnn`.\n```\nth main.lua -savefile char-small -rnn_size 300 -highway_layers 1 \n-kernels '{1,2,3,4,5,6}' -feature_maps '{25,50,75,100,125,150}' -EOS '+'\n```\n\n#### Word-level models\nLarge word-level model (LSTM-Word-Large in the paper).\nThis should get ~89 on valid and ~85 on test.\n```\nth main.lua -savefile word-large -word_vec_size 650 -highway_layers 0 \n-use_chars 0 -use_words 1 -EOS '+'\n```\nSmall word-level model (LSTM-Word-Small in the paper).\nThis should get ~101 on valid and ~98 on test.\n```\nth main.lua -savefile word-small -word_vec_size 200 -highway_layers 0 \n-use_chars 0 -use_words 1 -rnn_size 200 -EOS '+'\n```\n\n#### Combining both\nNote that if `-use_chars` and `-use_words` are both set to 1, the model\nwill concatenate the output from the CNN with the word embedding. We've\nfound this model to underperform a purely character-level model, though.\n\n### Evaluation\nBy default `main.lua` will evaluate the model on test data after training,\nbut this will use the last epoch's model, and also will be slow due to\nthe way the data is set up.\n\nEvaluation on test can be performed via the following script:\n```\nth evaluate.lua -model model_file.t7 -data_dir data/ptb -savefile model_results.t7\n```\nWhere `model_file.t7` is the path to the best performing (on validation) model.\nThis will also save some basic statistics (e.g. perplexity by token) in\n`model_results.t7`.\n\n### Hierarchical Softmax\nTraining on a larger vocabulary (e.g. 100K+) will require hierarchical softmax (HSM)\nto train at a reasonable speed. You can use the `-hsm` option to do this.\nFor example `-hsm 500` will randomly split the vocabulary into 500 clusters of\n(approximately) equal size. `-hsm 0` is the default and will not use HSM.\n`-hsm -1` will automatically choose the number of clusters for you, by choosing the integer\nclosest to sqrt(|V|).\n\n### Batch Size\nIf training on bigger datasets you should probably use a \nlarger batch size (e.g. `-batch_size 100`).\n\n### Licence\nMIT\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyoonkim%2Flstm-char-cnn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyoonkim%2Flstm-char-cnn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyoonkim%2Flstm-char-cnn/lists"}