{"id":14992125,"url":"https://github.com/ToluClassics/candle-tutorial","last_synced_at":"2025-09-25T14:30:50.959Z","repository":{"id":200348182,"uuid":"704568640","full_name":"ToluClassics/candle-tutorial","owner":"ToluClassics","description":"Tutorial for Porting PyTorch Transformer Models to Candle (Rust)","archived":false,"fork":false,"pushed_at":"2024-07-22T13:07:21.000Z","size":131,"stargazers_count":238,"open_issues_count":2,"forks_count":13,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-09-25T16:09:29.399Z","etag":null,"topics":["candle","pytorch","rust","rust-lang"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ToluClassics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-13T14:37:04.000Z","updated_at":"2024-09-24T07:07:58.000Z","dependencies_parsed_at":"2024-03-30T15:31:55.231Z","dependency_job_id":null,"html_url":"https://github.com/ToluClassics/candle-tutorial","commit_stats":null,"previous_names":["toluclassics/candle-tutorial"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ToluClassics%2Fcandle-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ToluClassics%2Fcandle-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ToluClassics%2Fcandle-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ToluClassics%2Fcandle-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ToluClassics","download_url":"https://codeload.github.com/ToluClassics/candle-tutorial/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234200170,"owners_count":18795139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["candle","pytorch","rust","rust-lang"],"created_at":"2024-09-24T15:00:45.186Z","updated_at":"2025-09-25T14:30:50.599Z","avatar_url":"https://github.com/ToluClassics.png","language":"Rust","funding_links":[],"categories":["Machine Learning","Rust"],"sub_categories":[],"readme":"# Candle Tutorial - Convert Pytorch Models to Candle\n\n[Candle](https://github.com/huggingface/candle) is an ML framework written in rust that takes advantage of the speed and memory safety Rust provides for writing machine workloads. It can be used as a drop in replacement for ML frameworks like PyTorch, it also has [python bindings](https://github.com/huggingface/candle/tree/main/candle-pyo3) so you can use it from python...\n\nThis repo provides some guide for converting pytorch models from the transformers library to Candle by directly translating the pytorch code to Candle ...\n\n❗️❗️: To make the code easily understandable, I have annotated each line of the Rust/Candle code with the equivalent PyTorch code.\nTutorial Structure:\n- [Getting Started](#getting-started)\n    - [0. Important things to note](#0-important-things-to-note)\n    - [1. Start a new rust project](#1-start-a-new-rust-project)\n    - [2. Install Candle \u0026 Other Packages](#2-install-candle--other-packages)\n\n- [Parallels between Pytorch and Candle](#3-parallels-between-pytorch-and-candle)\n    - [Tensors](#tensors)\n    - [Tensor Operations](#tensor-operations)\n- [Translating a PyTorch Transformer Model into Candle](#3-translating-a-pytorch-transformer-model-into-candle)\n    - [RoBERTa](#31-roberta)\n        - [a. Writing Building Blocks](#a-writing-building-blocks)\n        - [b. Roberta Config](#b-roberta-config)\n        - [c. RobertaEmbeddings](#c-robertaembeddings)\n        - [d. RobertaSelfAttention](#d-robertaselfattention)\n        - [e. RobertaSelfOutput](#e-robertaselfoutput)\n        - [f. RobertaIntermediate](#f-robertaintermediate)\n        - [g. RobertaOutput](#g-robertaoutput)\n        - [h. RobertaLayer](#h-robertalayer)\n        - [i. RobertaEncoder](#i-robertaencoder)\n        - [j. RobertaModel](#j-robertamodel)\n- [Debugging/Testing the model](#debugging-the-model)\n\n## Getting Started:\n\n### 0. Important things to note\n\n- When Porting an already trained checkpoint to Candle, there's a bunch of PyTorch code that are not relevant and they are mostly included for handling different scenarios in training. It's definitely beneficial to know which functions to bypass if the conversion effort is mostly geared towards loading an already trained model.\n\n- Python Built in Method: Unlike Python where we have built-in methods like `__call__` that allow us to use a class as a method and `__init__` for initializing a class, In rust we have to explicitly define methods like `Class::new()` to initialize a class and `Class::forward()` to perform a forward pass. This is going to be a recurrent theme in most of the code shown below.\n\n- It is important to write [unit tests](tests/test_roberta.rs) after writing most or every module to ensure that input and output shapes in Candle are consistent with the same module in pytorch.\n\n- In PyTorch, we can initialize module weights by creating a class method `_init_weights` but in candle it becomes a design decision, you can initialize a tensor using the shape of your weights/bias (e.g. ) and hold it in a `VarBuilder` which then used to initialize the tensors in each module.\n\n\n### 1. Start a new rust project\nThe command below will create a new rust project called `candle-roberta` in the current directory with a `Cargo.toml` file and a `src` directory with a `main.rs` file in it.\n\n```bash\n$ cargo new candle-roberta\n```\n\n\n### 2. Install Candle \u0026 Other Packages\n\nYou can follow the instructions [here](https://huggingface.github.io/candle/guide/installation.html) to install candle or you can use the command below to install candle directly from github.\n\nFor this tutorial, we would be using the `candle-core` and `candle-nn` crates.\n`candle-core` provides the core functionality of the candle framework. It provides an implementation the basic blocks for building neural networks and also integrations with different backends like Cuda, MKL, CPU etc, while `candle-nn` provides a high level API for building neural networks.\n\n```bash\n- cargo add --git https://github.com/huggingface/candle.git candle-core  # install candle-core\n- cargo add --git https://github.com/huggingface/candle.git candle-nn # install candle-nn\n```\n\nOther frameworks we would need for this tutorial are:\n- `anyhow` for error handling ==\u003e `cargo add anyhow`\n- `serde` for serialization ==\u003e `cargo add serde`\n- `serde_json` for json serialization ==\u003e `cargo add serde_json`\n- `hf-hub` for integrating with the huggingface hub ==\u003e `cargo add hf-hub`\n- `tokenizers` for tokenizing text ==\u003e `cargo add tokenizers`\n\n## 3. Parallels between Pytorch and Candle\n\nTo convert a pytorch model to candle, it is important understand the parallels between the two frameworks.\n- Candle is a rust framework, so it is statically typed, while pytorch is a python framework, so it is dynamically typed. This means that you need to be explicit about the types of your variables in candle, while in pytorch, you don't need to be explicit about the types of your variables.\n\n### Tensors\n\nThe examples shows below can be found [here]();\n\n- Initializing a Tensor: Tensors can be directly created from an array in both frameworks\n\n    - Pytorch: in pytorch the data type is automatically inffereed from the data;\n\n        ```python\n        import torch\n        from typing import List\n        \n        data: List = [1, 2, 3]\n        tensor = torch.tensor(data)\n        print(tensor)\n\n        nested_data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]\n        nested_tensor = torch.tensor(nested_data)\n        print(nested_tensor)\n        ```\n    - Candle: in candle, the data type needs to be explicitly specified;\n\n        ```rust\n        use candle_core::{DType, Device, Tensor};\n        use anyhow::Result;\n\n        let data: [u32; 3] = [1u32, 2, 3];\n        let tensor = Tensor::new(\u0026data, \u0026Device::Cpu)?;\n        println!(\"tensor: {:?}\", tensor.to_vec1::\u003cu32\u003e()?);\n\n        let nested_data: [[u32; 3]; 3] = [[1u32, 2, 3], [4, 5, 6], [7, 8, 9]];\n        let nested_tensor = Tensor::new(\u0026nested_data, \u0026Device::Cpu)?;\n        println!(\"nested_tensor: {:?}\", nested_tensor.to_vec2::\u003cu32\u003e()?);\n        ```\n\n- Creating a tensor from another tensor\n    \n    - Pytorch: in pytorch, the data type is automatically inferred from the data;\n\n        ```python\n        zero_tensor = torch.zeros_like(tensor)\n        ones_tensor = torch.ones_like(tensor)\n        random_tensor = torch.rand_like(tensor)\n        ```\n\n    - Candle: in candle, the data type needs to be explicitly specified;\n\n        ```rust\n        let data: [u32; 3] = [1u32, 2, 3];\n        let tensor = Tensor::new(\u0026data, \u0026Device::Cpu)?;\n\n        let zero_tensor = tensor.zeros_like()?;\n        println!(\"zero_tensor: {:?}\", zero_tensor.to_vec1::\u003cu32\u003e()?);\n\n        let ones_tensor = tensor.ones_like()?;\n        println!(\"ones_tensor: {:?}\", ones_tensor.to_vec1::\u003cu32\u003e()?);\n\n        let random_tensor = tensor.rand_like(0.0, 1.0)?;\n        println!(\"random_tensor: {:?}\", random_tensor.to_vec1::\u003cf64\u003e()?);\n        ```\n\n- Checking tensor dimensions:\n    - PyTorch\n        ```python\n        print(tensor.shape)\n        print(tensor.size())\n        ```\n    - Candle\n        ```rust\n        // 1 dimensional tensor\n        println!(\"tensor shape: {:?}\", tensor.shape().dims()); \n        // 2 dimensional tensor\n        println!(\"tensor shape: {:?}\", tensor.shape().dims2()); \n        // 3 dimensional tensor\n        println!(\"tensor shape: {:?}\", tensor.shape().dims3()); \n        ```\n\n###  Tensor Operations: \n\n    Performing tensor operations is pretty similar across both frameworks\n    Some examples can be found here:: [Candle CheatSheet](https://github.com/huggingface/candle/blob/main/README.md#how-to-use)\n\n\n\n\n## 3. Translating a PyTorch Transformer Model into Candle\n\nHere's the fun part! In this section we are going to take a look at translating models from the transformers library to candle. We would be using the [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html) and [XLM-Roberta](https://huggingface.co/docs/transformers/model_doc/xlm-roberta) model for this tutorial.\n\nWe would be translating the [Pytorch Source Code](https://github.com/huggingface/transformers/blob/main/src/transformers/models/roberta/modeling_roberta.py) into Candle Code and then load the pretrained checkpoint into Rust and compare the output from  both frameworks.\n\nNote ❗️❗️: To make the code easily understandable, I have annotated each line of the Rust/Candle code with the equivalent PyTorch code.\n\n### 3.1. RoBERTa\n\nRoBERTa is a variant of the BERT model. Although both models have different pretraining approaches, structurally both models are very similar and the major difference between both models is that in the RoBERTa layer, Position numbers begin at padding_idx+1,  While in BERT, Position numbers begin at 0.\n\nFollowing the transformers PyTorch implementation,  RoBERTa Model can be divided into the 2 main parts (embeddings and encoder):\n\n```\nRobertaModel(\n  (embeddings): RobertaEmbeddings(\n    (word_embeddings): Embedding(50265, 768, padding_idx=1)\n    (position_embeddings): Embedding(514, 768, padding_idx=1)\n    (token_type_embeddings): Embedding(1, 768)\n    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n    (dropout): Dropout(p=0.1, inplace=False)\n  )\n  (encoder): RobertaEncoder(\n    (layer): ModuleList(\n      (0-11): 12 x RobertaLayer(\n        (attention): RobertaAttention(\n          (self): RobertaSelfAttention(\n            (query): Linear(in_features=768, out_features=768, bias=True)\n            (key): Linear(in_features=768, out_features=768, bias=True)\n            (value): Linear(in_features=768, out_features=768, bias=True)\n            (dropout): Dropout(p=0.1, inplace=False)\n          )\n          (output): RobertaSelfOutput(\n            (dense): Linear(in_features=768, out_features=768, bias=True)\n            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n            (dropout): Dropout(p=0.1, inplace=False)\n          )\n        )\n        (intermediate): RobertaIntermediate(\n          (dense): Linear(in_features=768, out_features=3072, bias=True)\n          (intermediate_act_fn): GELUActivation()\n        )\n        (output): RobertaOutput(\n          (dense): Linear(in_features=3072, out_features=768, bias=True)\n          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n          (dropout): Dropout(p=0.1, inplace=False)\n        )\n      )\n    )\n  )\n)\n```\n\n- \u003cstrong\u003eRoberta Config\u003c/strong\u003e: For Holding Model Configuration\n- \u003cstrong\u003eRoberta Model (RobertaModel)\u003c/strong\u003e : This is the main model class that contains the embedding and the encoder module.\n    - Embedding  (RobertaEmbeddings)\n    - Encoder (RobertaEncoder)\n\n- \u003cstrong\u003eEmbedding (RobertaEmbeddings)\u003c/strong\u003e: The Embedding module is a combination of the following:\n    - Word Embedding   --\u003e PyTorch Linear Module\n    - Position Embedding   --\u003e PyTorch Linear Module\n    - Token Type Embedding   --\u003e PyTorch Linear Module\n    - Layer Norm \n\n- \u003cstrong\u003eEncoder  (RobertaEncoder)\u003c/strong\u003e: The Encoder is just made up a number of Attention Layers stacked on one another:\n    - x number of RobertaLayers: This is a a PyTorch ModuleList of RobertaLayer\n\n- \u003cstrong\u003eRoberta Layer (RobertaLayer)\u003c/strong\u003e: The RobertaLayer is made up of the following modules:\n    - \u003cstrong\u003eAttention Block (RobertaAttention)\u003c/strong\u003e -\u003e PyTorch Module (made up of Self Attention Layer and Self Attention Output Layer)\n        - \u003cstrong\u003eSelf Attention Layer (RobertaSelfAttention)\u003c/strong\u003e\n        - \u003cstrong\u003eSelf Attention Output Layer (RobertaSelfOutput)\u003c/strong\u003e\n\n    - \u003cstrong\u003eIntermediate Layer (RobertaIntermediate)\u003c/strong\u003e --\u003e PyTorch Linear Module\n    - \u003cstrong\u003eOutput Layer (RobertaOutput)\u003c/strong\u003e --\u003e PyTorch Linear Module\n\nListed above are the main components of the model. Other building blocks for implementing the model include:\n\n- \u003cstrong\u003eLayer Norm\u003c/strong\u003e --\u003e PyTorch LayerNorm Module\n- \u003cstrong\u003eDropout\u003c/strong\u003e --\u003e PyTorch Dropout Module\n- \u003cstrong\u003eActivation\u003c/strong\u003e --\u003e PyTorch Activation Function\n\n\n### Translating Pytorch Modules into Candle\n\n#### Import necessary Modules:\n\nImport the necessary modules from candle and other crates:\n\n- DType: This is an enum that represents the data type of a tensor.\n- Device: This is an enum that represents the device a tensor is stored on.\n- Result: This is a type alias for `std::result::Result\u003cT, anyhow::Error\u003e` for error handling\n- Tensor: This is a struct that represents a tensor.\n\n- Embedding: This is a prebuilt struct that represents an embedding layer similar to `nn.Embedding`.\n- Module: This is a trait that represents a neural network module similar to `nn.Module` in PyTorch.\n- Varbuilder: Module builder for creating variables similar to `nn.Parameter` in PyTorch.\n\n    \n    ```rust\n    use candle_core::{DType, Device, Result, Tensor}; \n    use candle_nn::{Embedding, Module, VarBuilder};\n    use serde::Deserialize;\n    ```\n\n### a. Writing Building Blocks:\n\n- Linear/Embedding: This is a helper function for loading the weights of a linear/embedding layer using `VarBuilder` from a checkpoint file. We create these 2 helper functions because we will use them multiple times.:\n\n    ```rust\n    fn embedding(vocab_size: usize, hidden_size: usize, vb: VarBuilder) -\u003e Result\u003cEmbedding\u003e {\n        let embeddings = vb.get((vocab_size, hidden_size), \"weight\")?;\n        Ok(Embedding::new(embeddings, hidden_size))\n    }\n\n    fn linear(size1: usize, size2: usize, vb: VarBuilder) -\u003e Result\u003cLinear\u003e {\n        let weight = vb.get((size2, size1), \"weight\")?;\n        let bias = vb.get(size2, \"bias\")?;\n        Ok(Linear::new(weight, Some(bias)))\n    }\n    ```\n\n    Both of these functions already exist in `candle_nn` and can be imported as such `candle_nn::{embedding,linear}`\n\n- Layer Norm (https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html): Used to a normalize a tensor over a given axis. It is used in the embedding layer and the encoder layer. A good explanation of layer normalization can be [found here](https://www.pinecone.io/learn/batch-layer-normalization/#What-is-Layer-Normalization). This is required because we need to implement the low-level layer norm module in Candle.\n\n    ![image info](./assets/layer_norm.png)\n    *Layer Normalization from https://www.pinecone.io/learn/batch-layer-normalization/#What-is-Layer-Normalization*\n\n\n    - PyTorch: In PyTorch, we can use LayerNorm by calling it as a module\n\n        ```python\n        from torch import nn\n\n        LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        ```\n\n    - Candle: In candle we can implement the layer normalization using the equation above or import it directly from `candle_nn` with `candle_nn::{LayerNorm,layer_norm}` Steps:\n        - Since normalization is done over the last axis which is the hidden size, we can use the `sum_keepdim` method to sum over the last axis and divide by dimension size to get `mean_x`.\n        - For each element in the tensor, we subtract the mean and square the result and divide by hidden dimension to get `norm_x`.\n        - To get the normalized input, we subtract the mean from the input and divide by the square root of `norm_x + eps`.\n        - To get the final output, we multiply the normalized input by the weight of the normalization layer and add the bias.\n\n        ```rust\n        pub struct LayerNorm {\n            weight: Tensor, // Weight vector of the LayerNorm Layer\n            bias: Tensor, // Bias vector of the LayerNorm Layer\n            eps: f64, // Epsilon value for numerical stability\n        }\n\n        impl LayerNorm {\n            // Constructor for LayerNorm \n            pub fn new(weight: Tensor, bias: Tensor, eps: f64) -\u003e Self {\n                Self { weight, bias, eps }\n            }\n\n            pub fn forward(\u0026self, x: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n                let x_dtype = x.dtype(); // Get the data type of the input tensor\n                let internal_dtype = match x_dtype {\n                    DType::F16 | DType::BF16 =\u003e DType::F32,\n                    d =\u003e d,\n                };\n                let (_bsize, _seq_len, hidden_size) = x.dims3()?; // Get the dimensions of the input tensor\n                let x = x.to_dtype(internal_dtype)?; \n                let mean_x = (x.sum_keepdim(2)? / hidden_size as f64)?; // Get the mean of the input tensor and divide by the hidden size\n                let x = x.broadcast_sub(\u0026mean_x)?; // Subtract the mean from the input tensor\n                let norm_x = (x.sqr()?.sum_keepdim(2)? / hidden_size as f64)?; // Get the squared norm of the input tensor and divide by the hidden size\n                let x_normed = x.broadcast_div(\u0026(norm_x + self.eps)?.sqrt()?)?; // Get the normalized input\n                let x = x_normed\n                    .to_dtype(x_dtype)?\n                    .broadcast_mul(\u0026self.weight)?\n                    .broadcast_add(\u0026self.bias)?;\n                Ok(x)\n            }\n        }\n        ```\n\n        This struct can be used as follows:\n\n        ```rust\n        let w_gen = Tensor::new(\u0026[[3f32, 1.]], \u0026Device::Cpu)?;\n        let b_gen = Tensor::new(-2f32, \u0026Device::Cpu)?;\n\n        // initialize a layer norm layer\n        let layer_norm = LayerNorm::new(w_gen, b_gen, 1f64);\n\n        let data: [u32; 3] = [1u32, 2, 3];\n        let input_tensor = Tensor::new(\u0026data, \u0026Device::Cpu)?;\n        let normalized_tensor = layer_norm.forward(\u0026input_tensor)?;\n        ```\n\n- Dropout: Randomly zero out different parts of the input tensor using a probability value. This is only used during training, since we are translating a pretrained model, we can just write a struct that returns the passed input tensor\n\n   - PyTorch: In PyTorch, we can use LayerNorm by calling it as a module\n\n        ```python\n        from torch import nn\n\n        dropout = nn.Dropout(p=0.1)\n        input = torch.randn(2)\n        output = dropout(input)\n        ```\n    - Candle: In candle we can implement the dropout layer by just returning the input tensor\n\n        ```rust\n        struct Dropout {\n            #[allow(dead_code)]\n            pr: f64,\n        }\n\n        impl Dropout {\n            fn new(pr: f64) -\u003e Self {\n                Self { pr }\n            }\n\n            fn forward(\u0026self, x: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n                Ok(x.clone())\n            }\n        }\n        ```\n\n        This struct can be used as follows:\n\n        ```rust\n        let dropout = Dropout::new(0.1);\n\n        let data: [u32; 3] = [1u32, 2, 3];\n        let input_tensor = Tensor::new(\u0026data, \u0026Device::Cpu)?;\n        let dropout_tensor = dropout.forward(\u0026input_tensor)?;\n        ```\n\n- Activation: The RoBERTa uses a GELU activation function. We can implement the GELU using a similar approach as dropout above with no input params. Candle tensors have an inbuilt module to perform this operation\n\n    - PyTorch: In PyTorch, we can use LayerNorm by calling it as a module\n\n        ```python\n        from torch import nn\n\n        activation = nn.GELU()\n        input = torch.randn(2)\n        output = activation(input)\n        ```\n\n    - Candle: In candle we can implement the dropout layer by just returning the input tensor\n\n        ```rust\n        struct Activation {}\n\n        impl Activation {\n            fn new() -\u003e Self {\n                Self {}\n            }\n\n            fn forward(\u0026self, x: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n                Ok(x.gelu()?)\n            }\n        }\n        ```\n\n        This struct can be used as follows:\n\n        ```rust\n        let activation = Activation::new();\n\n        let data: [u32; 3] = [1u32, 2, 3];\n        let input_tensor = Tensor::new(\u0026data, \u0026Device::Cpu)?;\n        let activation_tensor = activation.forward(\u0026input_tensor)?;\n        ```\n\n\n### b. Roberta Config:\n\nUp next is the Roberta Config. This is a struct that holds the configuration of the model. It is similar to the [RobertaConfig](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/configuration_roberta.py#L37) in the transformers library. For this Struct, We will initialize the default values for the config (We implement the `Default` trait for the `RobertaConfig` struct ) and then use the serde crate to deserialize the config from a json file. Alternatively we can create a `RobertaConfig::new()` method for creating a new instance of RobertaConfig\n\n```rust\npub struct RobertaConfig {\n    vocab_size: usize,\n    hidden_size: usize,\n    num_hidden_layers: usize,\n    num_attention_heads: usize,\n    intermediate_size: usize,\n    hidden_act: String,\n    hidden_dropout_prob: f64,\n    max_position_embeddings: usize,\n    type_vocab_size: usize,\n    initializer_range: f64,\n    layer_norm_eps: f64,\n    pad_token_id: usize,\n    bos_token_id: usize,\n    eos_token_id: usize,\n    position_embedding_type: String,\n    use_cache: bool,\n    classifier_dropout: Option\u003cf64\u003e,\n    model_type: Option\u003cString\u003e,\n}\n\nimpl Default for RobertaConfig {\n    fn default() -\u003e Self {\n        Self {\n            vocab_size: 50265,\n            hidden_size: 768,\n            num_hidden_layers: 12,\n            num_attention_heads: 12,\n            intermediate_size: 3072,\n            hidden_act: \"gelu\".to_string(),\n            hidden_dropout_prob: 0.1,\n            max_position_embeddings: 512,\n            type_vocab_size: 2,\n            initializer_range: 0.02,\n            layer_norm_eps: 1e-12,\n            pad_token_id: 1,\n            bos_token_id: 0,\n            eos_token_id: 2,\n            position_embedding_type: PositionEmbeddingType::Absolute,\n            use_cache: true,\n            classifier_dropout: None,\n            model_type: Some(\"roberta\".to_string()),\n        }\n    }\n}\n```\n\n### c. RobertaEmbeddings:\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L65)\n\nIn the `__init__` function of the embedding class, we have 3 linear layers for processing word_embeddings, position_embeddings and token_type_ids. Similar to the PyTorch implementation, there are two important class methods that we need to implement.\n\n- [create_position_ids_from_input_embeds](https://github.com/huggingface/transformers/blob/46092f763d26eb938a937c2a9cc69ce1cb6c44c2/src/transformers/models/roberta/modeling_roberta.py#L136): A function to generate position ids from embeddings. I have included the pytorch equivalent of each line as a comment.\n\n    ```rust\n    pub fn create_position_ids_from_input_embeds(\u0026self, input_embeds: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        // input_shape = inputs_embeds.size()\n        // In candle, we use dims3() for getting the size of a 3 dimensional tensor\n        let input_shape = input_embeds.dims3()?;\n        // sequence_length = input_shape[1]\n        let seq_length = input_shape.1;\n\n        // position_ids = torch.arange( self.padding_idx + 1, sequence_length + self.padding_idx + 1, \\\n        // dtype=torch.long, device=inputs_embeds.device)\n        let mut position_ids = Tensor::arange(\n            self.padding_idx + 1,\n            seq_length as u32 + self.padding_idx + 1,\n            \u0026Device::Cpu,\n        )?;\n\n        // return position_ids.unsqueeze(0).expand(input_shape)\n        position_ids = position_ids\n            .unsqueeze(0)?\n            .expand((input_shape.0, input_shape.1))?;\n        Ok(position_ids)\n    }\n    ```\n- [create_position_ids_from_input_ids](https://github.com/huggingface/transformers/blob/46092f763d26eb938a937c2a9cc69ce1cb6c44c2/src/transformers/models/roberta/modeling_roberta.py#L1558): A function to generate position_ids from input_ids. \n\n    ```rust\n    pub fn create_position_ids_from_input_ids(input_ids: \u0026Tensor, padding_idx: u32, past_key_values_length: u8) -\u003e Result\u003cTensor\u003e {\n        // mask = input_ids.ne(padding_idx).int()\n        let mask = input_ids.ne(padding_idx)?; \n        // incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask\n        let incremental_indices = cumsum_2d(\u0026mask, 0, input_ids.device())?; \n\n        // incremental_indices.long() + padding_idx\n        let incremental_indices = incremental_indices.broadcast_add(\u0026Tensor::new(\u0026[past_key_values_length], input_ids.device())?)?; \n\n        Ok(incremental_indices)\n    }\n    ```\n\n- [Embedding Layer] : The embedding layer is made up of 3 linear layers for processing word_embeddings, position_embeddings and token_type_ids. The output of the embedding layer is the sum of the word_embeddings, position_embeddings and token_type_embeddings. The output is then passed through a layer norm and dropout layer. A link to the pytorch implementation is shown above.\n\n    ```rust\n    pub struct RobertaEmbeddings {\n        word_embeddings: Embedding,\n        position_embeddings: Option\u003cEmbedding\u003e,\n        token_type_embeddings: Embedding,\n        layer_norm: LayerNorm,\n        dropout: Dropout,\n        pub padding_idx: u32,\n    }\n\n    impl RobertaEmbeddings {\n        pub fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n\n            // nn.Embedding(config.vocab_size, config.hidden_size)\n            let word_embeddings = embedding(\n                config.vocab_size,\n                config.hidden_size,\n                vb.pp(\"word_embeddings\"),\n            )?;\n\n            // nn.Embedding(config.max_position_embeddings, config.hidden_size)\n            let position_embeddings = embedding(\n                config.max_position_embeddings,\n                config.hidden_size,\n                vb.pp(\"position_embeddings\"),\n            )?;\n\n            // nn.Embedding(config.type_vocab_size, config.hidden_size)\n            let token_type_embeddings = embedding(\n                config.type_vocab_size,\n                config.hidden_size,\n                vb.pp(\"token_type_embeddings\"),\n            )?;\n\n            // nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n            let layer_norm = layer_norm(\n                config.hidden_size,\n                config.layer_norm_eps,\n                vb.pp(\"LayerNorm\"),\n            )?;\n\n            // nn.Dropout(config.hidden_dropout_prob)\n            let dropout = Dropout::new(config.hidden_dropout_prob);\n            \n            let padding_idx = config.pad_token_id as u32;\n\n            Ok(Self {\n                word_embeddings,\n                position_embeddings: Some(position_embeddings),\n                token_type_embeddings,\n                layer_norm,\n                dropout,\n                padding_idx,\n            })\n        }\n\n        pub fn forward(\u0026self, input_ids: \u0026Tensor, token_type_ids: \u0026Tensor, position_ids: Option\u003c\u0026Tensor\u003e, inputs_embeds: Option\u003c\u0026Tensor\u003e) -\u003e Result\u003cTensor\u003e {\n\n            let position_ids = match position_ids {\n                Some(ids) =\u003e ids.to_owned(),\n                None =\u003e {\n                    if Option::is_some(\u0026inputs_embeds){\n                        // self.create_position_ids_from_inputs_embeds(inputs_embeds)\n                        let position_ids = self.create_position_ids_from_input_embeds(inputs_embeds.unwrap())?; //\n                        position_ids\n                    } else {\n                        // create_position_ids_from_input_ids(input_ids, self.padding_idx, past_key_values_length)\n                        let position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx, 1)?; \n                        position_ids\n                    } \n                }\n            };\n\n\n            let inputs_embeds : Tensor = match inputs_embeds {\n                Some(embeds) =\u003e embeds.to_owned(),\n                None =\u003e {\n                    // self.word_embeddings(input_ids)\n                    let embeds = self.word_embeddings.forward(input_ids)?; \n                    embeds\n                }\n            };\n\n            // self.token_type_embeddings(token_type_ids)\n            let token_type_embeddings = self.token_type_embeddings.forward(token_type_ids)?; \n            // inputs_embeds + token_type_embeddings\n            let mut embeddings = (inputs_embeds + token_type_embeddings)?; \n\n            if let Some(position_embeddings) = \u0026self.position_embeddings {\n                // embeddings + self.position_embeddings(position_ids)\n                embeddings = embeddings.broadcast_add(\u0026position_embeddings.forward(\u0026position_ids)?)? \n            }\n\n            // self.LayerNorm(embeddings)\n            let embeddings = self.layer_norm.forward(\u0026embeddings)?; \n            // self.dropout(embeddings)\n            let embeddings = self.dropout.forward(\u0026embeddings)?; \n\n            Ok(embeddings)\n            \n        }\n    }\n    ```\n\n### d. RobertaSelfAttention:\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L155). The self attention layer is made up of 3 linear layers for processing the query, key and value. The output of the self attention layer is the dot product of the query and key. The output is then passed through a softmax layer and a dropout layer which is then multiplied by the value.\n\n```rust\n\n```rust\nstruct RobertaSelfAttention {\n    query: Linear,\n    key: Linear,\n    value: Linear,\n    dropout: Dropout,\n    num_attention_heads: usize,\n    attention_head_size: usize,\n}\n\nimpl RobertaSelfAttention {\n    fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n        // config.hidden_size / config.num_attention_heads\n        let attention_head_size = config.hidden_size / config.num_attention_heads;\n        // self.num_attention_heads * self.attention_head_size\n        let all_head_size = config.num_attention_heads * attention_head_size; \n        // nn.Dropout(config.attention_probs_dropout_prob)\n        let dropout = Dropout::new(config.hidden_dropout_prob); \n        let hidden_size = config.hidden_size;\n\n        // nn.Linear(config.hidden_size, self.all_head_size)\n        let query = linear(hidden_size, all_head_size, vb.pp(\"query\"))?; \n        // nn.Linear(config.hidden_size, self.all_head_size)\n        let value = linear(hidden_size, all_head_size, vb.pp(\"value\"))?; \n        // nn.Linear(config.hidden_size, self.all_head_size)\n        let key = linear(hidden_size, all_head_size, vb.pp(\"key\"))?; \n        Ok(Self {\n            query,\n            key,\n            value,\n            dropout,\n            num_attention_heads: config.num_attention_heads,\n            attention_head_size,\n        })\n    }\n\n    fn transpose_for_scores(\u0026self, xs: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        \n        // x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)\n        let mut new_x_shape = xs.dims().to_vec();\n        new_x_shape.pop();\n        new_x_shape.push(self.num_attention_heads);\n        new_x_shape.push(self.attention_head_size);\n\n        //  x = x.view(new_x_shape) || x.permute(0, 2, 1, 3)\n        let xs = xs.reshape(new_x_shape.as_slice())?.transpose(1, 2)?;\n        xs.contiguous()\n    }\n\n    fn forward(\u0026self, hidden_states: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        // self.query(hidden_states)\n        let query_layer = self.query.forward(hidden_states)?;\n        // self.key(hidden_states) \n        let key_layer = self.key.forward(hidden_states)?; \n        // self.value(hidden_states)\n        let value_layer = self.value.forward(hidden_states)?; \n\n        // self.transpose_for_scores(query_layer)\n        let query_layer = self.transpose_for_scores(\u0026query_layer)?;\n        // self.transpose_for_scores(key_layer) \n        let key_layer = self.transpose_for_scores(\u0026key_layer)?;\n        // self.transpose_for_scores(value_layer)\n        let value_layer = self.transpose_for_scores(\u0026value_layer)?; \n\n        // attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n        let attention_scores = query_layer.matmul(\u0026key_layer.t()?)?;\n        // attention_scores / math.sqrt(self.attention_head_size)\n        let attention_scores = (attention_scores / (self.attention_head_size as f64).sqrt())?; \n        // attention_probs = nn.functional.softmax(attention_scores, dim=-1)\n        let attention_probs = {candle_nn::ops::softmax(\u0026attention_scores, candle_core::D::Minus1)?}; \n        // attention_probs = self.dropout(attention_probs)\n        let attention_probs = self.dropout.forward(\u0026attention_probs)?; \n\n        // torch.matmul(attention_probs, value_layer)\n        let context_layer = attention_probs.matmul(\u0026value_layer)?;\n        // context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n        let context_layer = context_layer.transpose(1, 2)?.contiguous()?; \n\n        // new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n        // context_layer = context_layer.view(new_context_layer_shape)\n        let context_layer = context_layer.flatten_from(candle_core::D::Minus2)?; // \n        Ok(context_layer)\n    }\n}\n```\n\n### e. RobertaSelfOutput:\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L290). The output of the Self Attention Layer is passed through the Self Output layer which is made up of a linear layer, layer norm and dropout layer.\n\n```rust\nstruct RobertaSelfOutput {\n    dense: Linear,\n    layer_norm: LayerNorm,\n    dropout: Dropout,\n}\n\nimpl RobertaSelfOutput {\n    fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n        // nn.Linear(config.hidden_size, config.hidden_size)\n        let dense = linear(config.hidden_size, config.hidden_size, vb.pp(\"dense\"))?; \n        //  nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        let layer_norm = layer_norm(\n            config.hidden_size,\n            config.layer_norm_eps,\n            vb.pp(\"LayerNorm\"),\n        )?;\n\n        // nn.Dropout(config.hidden_dropout_prob)\n        let dropout = Dropout::new(config.hidden_dropout_prob); \n        Ok(Self {\n            dense,\n            layer_norm,\n            dropout,\n        })\n    }\n\n    fn forward(\u0026self, hidden_states: \u0026Tensor, input_tensor: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        // self.dense(hidden_states)\n        let hidden_states = self.dense.forward(hidden_states)?;\n        // self.dropout(hidden_states)\n        let hidden_states = self.dropout.forward(\u0026hidden_states)?;\n        // self.LayerNorm(hidden_states + input_tensor)\n        self.layer_norm.forward(\u0026(hidden_states + input_tensor)?) \n    }\n}\n```\n\n### f. RobertaAttention:\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L305). The Roberta Attention Layer is made up of the Self Attention Layer and the Self Output Layer implemented earlier. The output of the Self Attention Layer is passed through the Self Output Layer.\n\n```rust\nstruct RobertaAttention {\n    self_attention: RobertaSelfAttention, \n    self_output: RobertaSelfOutput,\n}\n\nimpl RobertaAttention {\n    fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n        // RobertaSelfAttention(config, position_embedding_type=position_embedding_type)\n        let self_attention = RobertaSelfAttention::load(vb.pp(\"self\"), config)?;\n        // RobertaSelfOutput(config) \n        let self_output = RobertaSelfOutput::load(vb.pp(\"output\"), config)?; \n\n        Ok(Self {\n            self_attention,\n            self_output,\n        })\n    }\n\n    fn forward(\u0026self, hidden_states: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        //self_outputs = self.self(hidden_states)\n        let self_outputs = self.self_attention.forward(hidden_states)?; \n        // attention_output = self.output(self_outputs[0], hidden_states)\n        let attention_output = self.self_output.forward(\u0026self_outputs, hidden_states)?; \n\n        Ok(attention_output)\n    }\n}\n```\n\n### g. RobertaIntermediate\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L355). The intermediate layer is made up of a linear layer and an activation function. Here we use the GELU activation function. This layer combined with the Attention Layer and an Output layer makes up the Encoder.\n\n```rust\nstruct RobertaIntermediate {\n    dense: Linear,\n    intermediate_act: HiddenActLayer,\n}\n\nimpl RobertaIntermediate {\n    fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n        // nn.Linear(config.hidden_size, config.intermediate_size)\n        let dense = linear(config.hidden_size, config.intermediate_size, vb.pp(\"dense\"))?; \n        Ok(Self {\n            dense,\n            intermediate_act: Activation::new(),\n        })\n    }\n\n    fn forward(\u0026self, hidden_states: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        // self.dense(hidden_states)\n        let hidden_states = self.dense.forward(hidden_states)?; \n        // self.intermediate_act_fn(hidden_states)\n        let ys = self.intermediate_act.forward(\u0026hidden_states)?; \n        Ok(ys)\n    }\n}\n```\n\n### h. RobertaOutput\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L371). The output layer is made up of a linear layer, layer norm and dropout layer. This layer combined with the Attention Layer and an Intermediate layer makes up the Encoder.\n\n```rust\nstruct RobertaOutput {\n    dense: Linear,\n    layer_norm: LayerNorm,\n    dropout: Dropout,\n}\n\nimpl RobertaOutput {\n    fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n        // nn.Linear(config.intermediate_size, config.hidden_size)\n        let dense = linear(config.intermediate_size, config.hidden_size, vb.pp(\"dense\"))?;\n        // nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)\n        let layer_norm = layer_norm(\n            config.hidden_size,\n            config.layer_norm_eps,\n            vb.pp(\"LayerNorm\"),\n        )?; \n        let dropout = Dropout::new(config.hidden_dropout_prob);\n        Ok(Self {\n            dense,\n            layer_norm,\n            dropout,\n        })\n    }\n\n    fn forward(\u0026self, hidden_states: \u0026Tensor, input_tensor: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        // self.dense(hidden_states)\n        let hidden_states = self.dense.forward(hidden_states)?;\n        // self.dropout(hidden_states)\n        let hidden_states = self.dropout.forward(\u0026hidden_states)?;\n        // self.LayerNorm(hidden_states + input_tensor)\n        self.layer_norm.forward(\u0026(hidden_states + input_tensor)?) \n    }\n}\n```\n\n### i. RobertaLayer\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L386): This does not include an implementation of cross-attention as in the Pytorch code. As mentioned in the previous layers, The Robertalayer is made up of an Attention Layer, an Intermediate Layer and an Output Layer. This layer combined with the Attention Layer and an Output layer makes up the Encoder.\n\n```rust\nstruct RobertaLayer {\n    attention: RobertaAttention,\n    intermediate: RobertaIntermediate,\n    output: RobertaOutput,\n}\n\nimpl RobertaLayer {\n    fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n        // RobertaAttention(config)\n        let attention = RobertaAttention::load(vb.pp(\"attention\"), config)?;\n        // RobertaIntermediate(config)\n        let intermediate = RobertaIntermediate::load(vb.pp(\"intermediate\"), config)?; \n        // RobertaOutput(config)\n        let output = RobertaOutput::load(vb.pp(\"output\"), config)?; \n        Ok(Self {\n            attention,\n            intermediate,\n            output,\n        })\n    }\n\n    fn forward(\u0026self, hidden_states: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        // self.attention(hidden_states)\n        let attention_output = self.attention.forward(hidden_states)?; \n\n        //  self.intermediate(attention_output)\n        let intermediate_output = self.intermediate.forward(\u0026attention_output)?; \n        // self.output(intermediate_output, attention_output)\n        let layer_output = self\n            .output\n            .forward(\u0026intermediate_output, \u0026attention_output)?; \n        Ok(layer_output)\n    }\n}\n```\n\n### j. RobertaEncoder\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L473). The Encoder is made up of a stack of RobertaLayers. The output of the Encoder is the output of the last RobertaLayer.\n\n```rust\nimpl RobertaEncoder {\n    fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n        // nn.ModuleList([RobertaLayer(config) for _ in range(config.num_hidden_layers)])\n        let layers = (0..config.num_hidden_layers)\n            .map(|index| RobertaLayer::load(vb.pp(\u0026format!(\"layer.{index}\")), config))\n            .collect::\u003cResult\u003cVec\u003c_\u003e\u003e\u003e()?; \n        Ok(RobertaEncoder { layers })\n    }\n\n    fn forward(\u0026self, hidden_states: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        let mut hidden_states = hidden_states.clone();\n\n        //for i, layer_module in enumerate(self.layer):\n        //  layer_outputs = layer_module(hidden_states)\n\n        for layer in self.layers.iter() {\n            hidden_states = layer.forward(\u0026hidden_states)?\n        }\n        Ok(hidden_states)\n    }\n}\n```\n\n### k. RobertaModel\n[HuggingFace PyTorch Implementation](https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/src/transformers/models/roberta/modeling_roberta.py#L691). VOila! We have implemented all the components of the Roberta Model. The Roberta Model is made up of an Embedding Layer and an Encoder. The output of the Roberta Model is the output of the Encoder.\n\n```rust\npub struct RobertaModel {\n    embeddings: RobertaEmbeddings,\n    encoder: RobertaEncoder,\n    pub device: Device,\n}\n\nimpl RobertaModel {\n    pub fn load(vb: VarBuilder, config: \u0026RobertaConfig) -\u003e Result\u003cSelf\u003e {\n        let (embeddings, encoder) = match (\n            RobertaEmbeddings::load(vb.pp(\"embeddings\"), config), // RobertaEmbeddings(config)\n            RobertaEncoder::load(vb.pp(\"encoder\"), config), // RobertaEncoder(config)\n        ) {\n            (Ok(embeddings), Ok(encoder)) =\u003e (embeddings, encoder),\n            (Err(err), _) | (_, Err(err)) =\u003e {\n                if let Some(model_type) = \u0026config.model_type {\n                    if let (Ok(embeddings), Ok(encoder)) = (\n                        RobertaEmbeddings::load(vb.pp(\u0026format!(\"{model_type}.embeddings\")), config),\n                        RobertaEncoder::load(vb.pp(\u0026format!(\"{model_type}.encoder\")), config),\n                    ) {\n                        (embeddings, encoder)\n                    } else {\n                        return Err(err);\n                    }\n                } else {\n                    return Err(err);\n                }\n            }\n        };\n        Ok(Self {\n            embeddings,\n            encoder,\n            device: vb.device().clone(),\n        })\n    }\n\n    pub fn forward(\u0026self, input_ids: \u0026Tensor, token_type_ids: \u0026Tensor) -\u003e Result\u003cTensor\u003e {\n        // self.embedding(input_ids=input_ids)\n        let embedding_output = self.embeddings.forward(input_ids, token_type_ids, None, None)?;\n         // self.encoder(embedding_output )\n        let sequence_output = self.encoder.forward(\u0026embedding_output)?;\n        Ok(sequence_output)\n    }\n\n}\n```\n\n\n### Debugging the Model\n\n#### Unit Tests for Different Components\nIt is important to write unit tests for the different components of the model. This is to ensure that the model is working as expected. Unit tests sometime appear to be time-consuming but they can be very important in the long run. Here are some unit tests I wrote during the porting process:\n\n```rust\n// Regression_test = https://github.com/huggingface/transformers/blob/21dc5859421cf0d7d82d374b10f533611745a8c5/tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py#L496\n#[test]\nfn test_create_position_ids_from_input_embeds() -\u003e Result\u003c()\u003e {\n\n    let config = RobertaConfig::default();\n    let vb = VarBuilder::zeros(DType::F32, \u0026Device::Cpu);\n    let embeddings_module = RobertaEmbeddings::load(vb, \u0026config).unwrap();\n\n    let input_embeds = Tensor::randn(0f32, 1f32, (2, 4, 30), \u0026Device::Cpu).unwrap();\n    let position_ids = embeddings_module.create_position_ids_from_input_embeds(\u0026input_embeds);\n\n    let expected_tensor: \u0026[[u32; 4]; 2] = \u0026[\n        [0 + embeddings_module.padding_idx + 1, 1 + embeddings_module.padding_idx + 1, 2 + embeddings_module.padding_idx + 1, 3 + embeddings_module.padding_idx + 1,],\n        [0 + embeddings_module.padding_idx + 1, 1 + embeddings_module.padding_idx + 1, 2 + embeddings_module.padding_idx + 1, 3 + embeddings_module.padding_idx + 1,]\n    ];\n\n    assert_eq!(position_ids.unwrap().to_vec2::\u003cu32\u003e()?, expected_tensor);\n\n    Ok(())\n\n}\n```\n\n- Testing the Model :: [Full Test Code](tests/test_roberta.rs)\n    ```rust\n    // https://github.com/huggingface/transformers/blob/e1cec43415e72c9853288d4e9325b734d36dd617/tests/models/roberta/test_modeling_roberta.py#L548\n    #[test]\n    fn test_modeling_roberta_base () -\u003e Result\u003c()\u003e {\n        // model = RobertaModel.from_pretrained(\"roberta-base\")\n        let (model, _) =  build_roberta_model_and_tokenizer(\"roberta-base\", false).unwrap();\n\n        // input_ids = torch.tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])\n        let input_ids = \u0026[[0u32, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]];\n        let input_ids = Tensor::new(input_ids, \u0026model.device).unwrap();\n\n        let token_ids = input_ids.zeros_like().unwrap();\n        let output = model.forward(\u0026input_ids, \u0026token_ids)?;\n\n        let expected_shape = [1, 11, 768];\n        assert_eq!(output.shape().dims(), \u0026expected_shape);\n\n        // expected_slice = torch.tensor([[[-0.0231, 0.0782, 0.0074], [-0.1854, 0.0540, -0.0175], [0.0548, 0.0799, 0.1687]]])\n        let expected_output = [[-0.0231, 0.0782, 0.0074], [-0.1854, 0.0540, -0.0175], [0.0548, 0.0799, 0.1687]];\n\n        // self.assertTrue(torch.allclose(output[:, :3, :3], expected_slice, atol=1e-4))\n        let output = output.squeeze(0)?;\n        let output = output.to_vec2::\u003cf32\u003e()?;\n        let output: Vec\u003cVec\u003cf32\u003e\u003e = output.iter().take(3).map(|nested_vec| nested_vec.iter().take(3).map(|\u0026x| round_to_decimal_places(x, 4)).collect()).collect();\n        assert_eq!(output, expected_output);\n\n        Ok(())\n\n    }\n    ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FToluClassics%2Fcandle-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FToluClassics%2Fcandle-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FToluClassics%2Fcandle-tutorial/lists"}