{"id":13857023,"url":"https://github.com/astorfi/sequence-to-sequence-from-scratch","last_synced_at":"2025-04-15T16:58:01.807Z","repository":{"id":108139720,"uuid":"150924181","full_name":"astorfi/sequence-to-sequence-from-scratch","owner":"astorfi","description":":speech_balloon: Sequence to Sequence from Scratch Using Pytorch","archived":false,"fork":false,"pushed_at":"2019-08-29T19:47:49.000Z","size":15159,"stargazers_count":122,"open_issues_count":0,"forks_count":23,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-28T22:42:03.870Z","etag":null,"topics":["deep-learning","machine-learning","python","pytorch"],"latest_commit_sha":null,"homepage":"https://sequence-to-sequence-from-scratch.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/astorfi.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2018-09-30T02:56:40.000Z","updated_at":"2025-01-16T18:03:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"a08d5792-a2fd-4aed-bb5f-2ce8cd7eb799","html_url":"https://github.com/astorfi/sequence-to-sequence-from-scratch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2Fsequence-to-sequence-from-scratch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2Fsequence-to-sequence-from-scratch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2Fsequence-to-sequence-from-scratch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/astorfi%2Fsequence-to-sequence-from-scratch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/astorfi","download_url":"https://codeload.github.com/astorfi/sequence-to-sequence-from-scratch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249116145,"owners_count":21215140,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","python","pytorch"],"created_at":"2024-08-05T03:01:22.968Z","updated_at":"2025-04-15T16:58:01.779Z","avatar_url":"https://github.com/astorfi.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\n      \n\n##################\nTable of Contents\n##################\n.. contents::\n  :local:\n  :depth: 4\n\n***************\nDocumentation\n***************\n.. image:: https://badges.frapsoft.com/os/v2/open-source.png?v=103\n    :target: https://github.com/ellerbrock/open-source-badge/\n.. image:: https://img.shields.io/twitter/follow/amirsinatorfi.svg?label=Follow\u0026style=social\n      :target: https://twitter.com/amirsinatorfi\n\n\n==============================\nSequence to Sequence Modeling\n==============================\n\nIn this project we explain the sequence to sequence modeling using [`Pytorch \u003chttps://pytorch.org/\u003e`_].\n\n------------------------------------------------------------\nWhat is the problem?\n------------------------------------------------------------\n\nMachine Translation(MT) is one of the areas of NLP that has been profoundly affected by advances in deep learning.\nIn fact, progress in MT can be categorized into pre-deep learning and deep learning era. Confirmation of this could\nbe some of the reference books in NLP community such as ”Speech and Language Processing” [jurafsky2000speech]_. Second version of\nthis book was published in 2008 and chapter 25 is dedicated to machine translation but there is not a single mention of\ndeep learning usage for MT. However, today we know that the top performing machine translation systems are solely\nbased on neural networks which led to the term Neural Machine Translation (NMT).\n\nWhen we use the term neural machine translation, we are talking about applying different deep learning tech-\nniques for the task of machine translation. It was after success of neural network in image classification tasks\nthat researchers started to use neural networks in machine translation. Around 2013 research groups started to achieve\nbreakthrough results in NMT and boosted state of the art performance. Unlike traditional statistical machine transla-\ntion, NMT is based on an end-to-end neural network that increases the performance of machine translation systems\n[bahdanau2014neural]_.\n\nWe dedicate this project to a core deep learning based model for sequence-to-sequence modeling and in particular machine translation: An Encoder-Decoder architecture\nbased on Long-Short Term Memory (LSTM) networks.\n\n------------------------------------------------------------\nWhat makes the problem a problem?\n------------------------------------------------------------\n\nAlthough sequence to sequence modeling scope is broader than just the machine translation task,\nthe main focus on seq-2-seq research has been dedicated to MT due to its great importance in real-world\nproblems. Furthermore, machine translation is the bridge for a universal human-machine conversation.\n\n------------------------------------------------------------\nWhat is the secret sauce here?\n------------------------------------------------------------\n\nHere, we tried to achieve some primary goals as we hope to make this work unique compared to the many other available tutorials:\n\n  1. We called this repo ``\"from scratch\"`` due to the fact that we do NOT consider\n  any background for the reader in terms of implementation.\n\n  2. Instead of using high-level package modules,\n  simple RNN architectures are used for demonstration purposes.\n  This helps the reader to ``understand everything from scratch``.\n  The downside, however, is the relatively low speed of training.\n  This may not cause any trouble as we try to train a very small model.\n\n  3. The difference between ``uni-directional LSTMs`` and ``bi-directional LSTMs``\n  have been clarified using the simple encoder-decoder implementation.\n\n------------------------------------------------------------\nWho cares?\n------------------------------------------------------------\n\nIt tutorial has been provided for the developers/researchers who really want\nto start from scratch and learn everything ``spoon-by-spoon``. The goal is to\ngive as much detail as possible so the others do NOT have to spend the time to\nunderstand hidden and yet very important details.\n\n\n============\nModel\n============\n\nThe goal here is to create a **sequence-to-sequence mapping** model which is going to be built on an\nEncoder-Decoder network. The model encode the information into a specific representation. This representation\nlater on will be mapped as a target output sequence. This transition makes the model understand the interoperibility\nbetween two sequences. In another word, the meaningful connection between the two sequence will be created. Two important\nsequence to sequence modeling examples are ``Machine Transtional`` and ``Autoencoders``. Here, we can do both just by\nchaning the ``input-output`` language sequences.\n\n------------------\nWord Embedding\n------------------\n\nAt the very first step, we should know what are the ``input-output sequences`` and how we should ``represent the data``\nfor the model to understand it. Clearly, it should be a sequence of words in the input and the equivalent\nsequence in the output. In case of having an autoencoder, both input and output sentences\nare the same.\n\nA learned representation for context elements is called ``word embedding`` in which the words with similar meaning, ideally,\nbecome highly correlated in the representation space as well. One of the main incentives behind word embedding representations\nis the high generalization power as opposed to sparse higher dimensional representation [goldberg2017neural]_. Unlike the traditional\nbag-of-word representation in which different words have quite different representation regardless of their usage,\nin learning the distributed representation, the usage of words in the context is of great importance which lead to\nsimilar representation for correlated words in meaning. The are different approaches for creating word embedding. Please\nrefer to the great Pytorch tutorial titled [`WORD EMBEDDINGS: ENCODING LEXICAL SEMANTICS \u003chttps://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial\u003e`_]\nfor more details.\n\n------------------------------------------------------------\nEncoder\n------------------------------------------------------------\n\nThe encoder generates a single output vector that embodies the input sequence meaning. The general procedure is as follows:\n\n    1. In each step, a word will be fed to a network and it generates an output and a hidden state.\n    2. For the next step, the hidden step and the next word will be fed to the same network (W) for updating the weights.\n    3. In the end, the last output will be the representative of the input sentence (called the \"context vector\").\n\nThe ``EncoderRNN`` attribute is dedicated to the encoder structure. The Encoder in our code,\ncan be a ``unidirectional/bidirectional LSTM``. A *Bidirectional* LSTM consists of *two\nindependent LSTMs*, one take the input sequence in normal time order and the other one\nwill be fed with the input sequence in the reverse time order. The outputs of the two\nwill usually be concatenated at each time step (usually the *last hidden states* will be concatenated\nand returned). The created feature vector will represents the initial hidden states of the decoder. The\narchitecture of a bi-lstm is as below:\n\n.. figure:: _img/bilstm.png\n   :scale: 50\n   :alt: map to buried treasure\n\n**NOTE:** As can be observered in the figure *colors*, two ``independent`` different set of\nweights ``MUST`` be considered for the forward and backward passes, Otherwise, the network will\nassume the backward pass follows the forward pass!!\n\nThe encoder, will generally be initialized as below:\n\n.. code-block:: python\n\n  def __init__(self, hidden_size, input_size, batch_size, num_layers=1, bidirectional=False):\n     \"\"\"\n     * For nn.LSTM, same input_size \u0026 hidden_size is chosen.\n     :param input_size: The size of the input vocabulary\n     :param hidden_size: The hidden size of the RNN.\n     :param batch_size: The batch_size for mini-batch optimization.\n     :param num_layers: Number of RNN layers. Default: 1\n     :param bidirectional: If the encoder is a bi-directional LSTM. Default: False\n     \"\"\"\n     super(EncoderRNN, self).__init__()\n     self.batch_size = batch_size\n     self.num_layers = num_layers\n     self.bidirectional = bidirectional\n     self.hidden_size = hidden_size\n\n     # The input should be transformed to a vector that can be fed to the network.\n     self.embedding = nn.Embedding(input_size, embedding_dim=hidden_size)\n\n     # The LSTM layer for the input\n     self.lstm = nn.LSTM(input_size=hidden_size, hidden_size=hidden_size, num_layers=num_layers)\n\n\n**NOTE:** We ``do NOT`` generate the whole LSTM/Bi-LSTM architecture using Pytorch. Instead, we just use\nthe LSTM cells to represent **what exactly is going on in the encoding/decoding** phases!\n\nThe initialization of the LSTM is a little bit different compared to the LSTM\n[`Understanding LSTM Netwroks \u003chttp://colah.github.io/posts/2015-08-Understanding-LSTMs/\u003e`_].\nBoth cell state and hidden states must be initialized as belows:\n\n.. code-block:: python\n\n  def initHidden(self):\n\n    if self.bidirectional:\n        encoder_state = [torch.zeros(self.num_layers, 1, self.hidden_size, device=device),\n                                  torch.zeros(self.num_layers, 1, self.hidden_size, device=device)]\n        encoder_state = {\"forward\": encoder_state, \"backward\": encoder_state}\n        return encoder_state\n    else:\n        encoder_state = [torch.zeros(self.num_layers, 1, self.hidden_size, device=device),\n                          torch.zeros(self.num_layers, 1, self.hidden_size, device=device)]\n        return encoder_state\n\nAs it can be seen in the above code, for the *Bidirectional LSTM*, we have **separate and independent**\nstates for ``forwards`` and ``backward`` directions.\n\n\n-----------------------------\nDecoder\n-----------------------------\n\nFor the decoder, the final encoder hidden state (or the concatenation if we have a bi-lstm as the encoder)\nof the encoder will be called ``context vector``. This context vector, generated by the encoder, will\nbe used as the initial hidden state of the decoder. Decoding is as follows:\n\n    1. At each step, an input token and a hidden state is fed to the decoder.\n\n        * The initial input token is the ``\u003cSOS\u003e``.\n        * The first hidden state is the context vector generated by the encoder (the encoder's last hidden state).\n\n    2. The first output, should be the first word of the output sequence and so on.\n    3. The output token generation ends with ``\u003cEOS\u003e`` being generated or the predefined max_length of the output sentence.\n\nAfter the first decoder step, for the following steps, the input is going to be the previous word prediction of the RNN.\nSo the output generation will be upon the network sequence prediction. In case of using ``teacher_forcing``, the input is going to be the actual\ntargeted output word. It provides better guidance for the training but it is inconsistent with the evaluation stage as\ntargeted outputs do not exists! In order to handle the issue with this approach, new approaches have been proposed [lamb2016professor]_.\n\nThe decoder, will generally be initialized as below:\n\n.. code-block:: python\n\n    def __init__(self, hidden_size, output_size, batch_size, num_layers=1):\n        super(DecoderRNN, self).__init__()\n        self.batch_size = batch_size\n        self.num_layers = num_layers\n        self.hidden_size = hidden_size\n        self.embedding = nn.Embedding(output_size, hidden_size)\n        self.lstm = nn.LSTM(input_size=hidden_size, hidden_size=hidden_size, num_layers=1)\n        self.out = nn.Linear(hidden_size, output_size)\n\n    def forward(self, input, hidden):\n        output = self.embedding(input).view(1, 1, -1)\n        output, (h_n, c_n) = self.lstm(output, hidden)\n        output = self.out(output[0])\n        return output, (h_n, c_n)\n\n    def initHidden(self):\n        \"\"\"\n        The spesific type of the hidden layer for the RNN type that is used (LSTM).\n        :return: All zero hidden state.\n        \"\"\"\n        return [torch.zeros(self.num_layers, 1, self.hidden_size, device=device),\n                torch.zeros(self.num_layers, 1, self.hidden_size, device=device)]\n\n-------------------------------\nEncoder-Decoder Bridge\n-------------------------------\n\nThe context vector, generated by the encoder, will be used as the initial hidden state of the decoder.\nIn case that their *dimension is not matched*, a ``linear layer`` should be employed to transformed the context vector\nto a suitable input (shape-wise) for the decoder cell state (including the memory(Cn) and hidden(hn) states).\nThe shape mismatch is True in the following conditions:\n\n    1. The hidden sizes of encoder and decoder are the same BUT we have a bidirectional LSTM as the Encoder.\n    2. The hidden sizes of encoder and decoder are NOT same.\n    3. ETC?\n\n\nThe linear layer will be defined as below:\n\n.. code-block:: python\n\n    def __init__(self, bidirectional, hidden_size_encoder, hidden_size_decoder):\n        super(Linear, self).__init__()\n        self.bidirectional = bidirectional\n        num_directions = int(bidirectional) + 1\n        self.linear_connection_op = nn.Linear(num_directions * hidden_size_encoder, hidden_size_decoder)\n        self.connection_possibility_status = num_directions * hidden_size_encoder == hidden_size_decoder\n\n    def forward(self, input):\n\n        if self.connection_possibility_status:\n            return input\n        else:\n            return self.linear_connection_op(input)\n\n============\nDataset\n============\n\n**NOTE:** The dataset object is heavily inspired by the official Pytorch tutorial: [`TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION \u003chttps://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial\u003e`_]\nThe dataset is prepaired using the ``data_loader.py`` script.\n\nAt the first state we have to define ``word indexing`` for further processing. The ``word2index`` is the dictionary of\ntransforming word to its associated index and ``index2word`` does the reverse:\n\n.. code-block:: python\n\n  SOS_token = 1\n  EOS_token = 2\n\n  class Lang:\n    def __init__(self, name):\n        self.name = name\n        self.word2index = {}\n        self.word2count = {}\n        self.index2word = {0: \"\u003cpad\u003e\", SOS_token: \"SOS\", EOS_token: \"EOS\"}\n        self.n_words = 3  # Count SOS and EOS\n\n    def addSentence(self, sentence):\n        for word in sentence.split(' '):\n            self.addWord(word)\n\n    def addWord(self, word):\n        if word not in self.word2index:\n            self.word2index[word] = self.n_words\n            self.word2count[word] = 1\n            self.index2word[self.n_words] = word\n            self.n_words += 1\n        else:\n            self.word2count[word] += 1\n\nUnlike the [`Pytorch tutorial \u003chttps://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html/\u003e`_] we started\nthe indexing from ``1`` by ``SOS_token = 1`` to have the ``zero reserved``!\n\nIn the end, we define a dataset class to handle the processing:\n\n.. code-block:: python\n\n  class Dataset():\n      \"\"\"dataset object\"\"\"\n\n      def __init__(self, phase, num_embeddings=None, max_input_length=None, transform=None, auto_encoder=False):\n          \"\"\"\n          The initialization of the dataset object.\n          :param phase: train/test.\n          :param num_embeddings: The embedding dimentionality.\n          :param max_input_length: The maximum enforced length of the sentences.\n          :param transform: Post processing if necessary.\n          :param auto_encoder: If we are training an autoencoder or not.\n          \"\"\"\n          if auto_encoder:\n              lang_in = 'eng'\n              lang_out = 'eng'\n          else:\n              lang_in = 'eng'\n              lang_out = 'fra'\n          # Skip and eliminate the sentences with a length larger than max_input_length!\n          input_lang, output_lang, pairs = prepareData(lang_in, lang_out, max_input_length, auto_encoder=auto_encoder, reverse=True)\n          print(random.choice(pairs))\n\n          # Randomize list\n          random.shuffle(pairs)\n\n          if phase == 'train':\n              selected_pairs = pairs[0:int(0.8 * len(pairs))]\n          else:\n              selected_pairs = pairs[int(0.8 * len(pairs)):]\n\n          # Getting the tensors\n          selected_pairs_tensors = [tensorsFromPair(selected_pairs[i], input_lang, output_lang, max_input_length)\n                       for i in range(len(selected_pairs))]\n\n          self.transform = transform\n          self.num_embeddings = num_embeddings\n          self.max_input_length = max_input_length\n          self.data = selected_pairs_tensors\n          self.input_lang = input_lang\n          self.output_lang = output_lang\n\n====================\nTraining/Evaluation\n====================\n\nThe training/evaluation of this model is done in a not very optimized way deliberately!! The reasons are as follows:\n\n  1. I followed the principle of ``running with one click`` that I personnal have for all my open source projects.\n  The principle says: \"Everyone must be able to run everything by one click!\". So you see pretty much everything in one\n  Python file!\n\n  2. Instead of using ready-to-use RNN objects which process mini-batches of data, we input the sequence word-by-word to help\n  the readers having a better sense of what is happening behind the doors of seq-to-seq modeling scheme.\n\n  3. For the evaluation, we simply generate the outputs of\n  the system based on the built model to see if the model is good enough!\n\n\nFor mini-batch optimization, we input batches of sequences. There is a very important note for the batch feeding. After\ninputing each batch element, the ``encoder hidden states`` must be reset. Otherwise, the system may assume the next sequence in a batch follows\nthe previously processed sequence. It can be seen in the following Python script:\n\n\n.. code-block:: python\n  for step_idx in range(args.batch_size):\n      # reset the LSTM hidden state. Must be done before you run a new sequence. Otherwise the LSTM will treat\n      # the new input sequence as a continuation of the previous sequence.\n      encoder_hidden = encoder.initHidden()\n      input_tensor_step = input_tensor[:, step_idx][input_tensor[:, step_idx] != 0]\n      input_length = input_tensor_step.size(0)\n\n\n====================\nResults\n====================\n\nSome sample results for autoencoder training are as follows:\n\n.. code-block:: console\n\n    Input:  you re very generous  EOS\n    Output:  you re very generous  EOS\n    Predicted Output:  you re very generous  \u003cEOS\u003e\n\n    Input:  i m worried about the future  EOS\n    Output:  i m worried about the future  EOS\n    Predicted Output:  i m worried about the about  \u003cEOS\u003e\n\n    Input:  we re anxious  EOS\n    Output:  we re anxious  EOS\n    Predicted Output:  we re anxious  \u003cEOS\u003e\n\n    Input:  she is more wise than clever  EOS\n    Output:  she is more wise than clever  EOS\n    Predicted Output:  she is nothing than a than  \u003cEOS\u003e\n\n    Input:  i m glad i invited you  EOS\n    Output:  i m glad i invited you  EOS\n    Predicted Output:  i m glad i invited you  \u003cEOS\u003e\n\n**********************\nRecommended Readings\n**********************\n\n* `Sequence to Sequence Learning with Neural Networks \u003chttps://arxiv.org/abs/1409.3215\u003e`_ - Original Seq2Seq Paper\n* `Neural Machine Translation by Jointly Learning to Align and Translate \u003chttps://arxiv.org/abs/1409.0473\u003e`_ - Sequence to Sequence with Attention\n* `Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation \u003chttps://arxiv.org/abs/1406.1078\u003e`_\n\n\n***************\nReferences\n***************\n.. [jurafsky2000speech] Jurafsky, D., 2000. Speech and language processing: An introduction to natural language processing. Computational linguistics, and speech recognition.\n.. [goldberg2017neural] Goldberg, Yoav. \"Neural network methods for natural language processing.\" Synthesis Lectures on Human Language Technologies 10.1 (2017): 1-309.\n.. [lamb2016professor] Lamb, A.M., GOYAL, A.G.A.P., Zhang, Y., Zhang, S., Courville, A.C. and Bengio, Y., 2016. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems (pp. 4601-4609).\n.. [bahdanau2014neural] Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastorfi%2Fsequence-to-sequence-from-scratch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastorfi%2Fsequence-to-sequence-from-scratch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastorfi%2Fsequence-to-sequence-from-scratch/lists"}