{"id":22202467,"url":"https://github.com/mytechnotalent/knet","last_synced_at":"2026-04-15T15:36:02.644Z","repository":{"id":264354784,"uuid":"893128686","full_name":"mytechnotalent/KNET","owner":"mytechnotalent","description":"KNET is an educational Retrieval-Augmented Generation (RAG) system built from scratch, designed to empower STEM learners by combining transformers and intelligent retrieval to simulate human reasoning and deliver precise, context-aware answers to foster curiosity and understanding!","archived":false,"fork":false,"pushed_at":"2024-11-23T16:19:10.000Z","size":9584,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-29T07:45:22.729Z","etag":null,"topics":["ai","chain-of-thought","chain-of-thought-reasoning","context-aware","data-science","machine-learning","machine-learning-models","neural-network","neural-networks","pytorch","rag","retrieval-augmented-generation","stem"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mytechnotalent.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-23T15:59:20.000Z","updated_at":"2024-11-23T16:25:10.000Z","dependencies_parsed_at":"2024-11-23T17:31:58.974Z","dependency_job_id":null,"html_url":"https://github.com/mytechnotalent/KNET","commit_stats":null,"previous_names":["mytechnotalent/knet"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mytechnotalent/KNET","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mytechnotalent%2FKNET","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mytechnotalent%2FKNET/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mytechnotalent%2FKNET/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mytechnotalent%2FKNET/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mytechnotalent","download_url":"https://codeload.github.com/mytechnotalent/KNET/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mytechnotalent%2FKNET/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265205775,"owners_count":23727511,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","chain-of-thought","chain-of-thought-reasoning","context-aware","data-science","machine-learning","machine-learning-models","neural-network","neural-networks","pytorch","rag","retrieval-augmented-generation","stem"],"created_at":"2024-12-02T16:25:31.084Z","updated_at":"2026-04-15T15:36:02.584Z","avatar_url":"https://github.com/mytechnotalent.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"```python\nfrom IPython.display import Image\n```\n\n\n```python\nImage(filename = 'KNET.jpeg')\n```\n\n\n\n\n    \n![jpeg](README_files/README_1_0.jpg)\n    \n\n\n\n# KNET\n\n#### KNET is an educational Retrieval-Augmented Generation (RAG) system built from scratch, designed to empower STEM learners by combining transformers and intelligent retrieval to simulate human reasoning and deliver precise, context-aware answers to foster curiosity and understanding!\n\n#### Author: [Kevin Thomas](mailto:ket189@pitt.edu)\n\n## Imports\n\n\n```python\nimport math\nimport torch\nimport torch.nn as nn\n```\n\n## 1. Define the Positioning Encoding Model\n\n### Transformers rely on positional encoding to retain the order of the sequence data.\n\n\n```python\nclass PositionalEncoding(nn.Module):\n    \"\"\"\n    Implements positional encoding for transformer models.\n\n    This module injects information about the relative or absolute position\n    of tokens in the sequence. the positional encodings have the same dimension\n    as the embeddings so that the two can be summed.\n\n    Attributes:\n        dropout (nn.Dropout): dropout layer.\n        pe (torch.Tensor): positional encoding matrix.\n    \"\"\"\n\n    def __init__(self, d_model, dropout=0.1, max_len=5000):\n        \"\"\"\n        Initializes the PositionalEncoding module.\n\n        Args:\n            d_model (int): the dimensionality of embeddings.\n            dropout (float): dropout rate. default is 0.1.\n            max_len (int): maximum length of input sequences. default is 5000.\n        \"\"\"\n        super(PositionalEncoding, self).__init__()\n        self.dropout = nn.Dropout(p=dropout)\n\n        # create positional encoding matrix\n        pe = torch.zeros(max_len, d_model)\n        position = torch.arange(0, max_len).unsqueeze(1).float()\n        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))\n        pe[:, 0::2] = torch.sin(position * div_term)  # even indices\n        pe[:, 1::2] = torch.cos(position * div_term)  # odd indices\n        self.pe = pe.unsqueeze(1)  # shape: (max_len, 1, d_model)\n\n    def forward(self, x):\n        \"\"\"\n        Adds positional encoding to the input tensor.\n\n        Args:\n            x (torch.Tensor): input tensor of shape (seq_len, batch_size, d_model).\n\n        Returns:\n            torch.Tensor: tensor with positional encoding added.\n        \"\"\"\n        x = x + self.pe[: x.size(0)]\n        return self.dropout(x)\n```\n\n## 2. Define the Transformer Model\n\n### This model includes an embedding layer, positional encoding, and the transformer itself.\n\n\n```python\nclass TransformerModel(nn.Module):\n    \"\"\"\n    Defines a transformer-based sequence-to-sequence model.\n\n    This model uses an encoder-decoder architecture with positional encoding\n    and transformer layers to process sequences.\n\n    Attributes:\n        embedding (nn.Embedding): embedding layer for tokens.\n        pos_encoder (PositionalEncoding): positional encoding module.\n        transformer (nn.Transformer): transformer module.\n        fc_out (nn.Linear): linear layer for output generation.\n        d_model (int): dimensionality of the model's embeddings.\n    \"\"\"\n\n    def __init__(self, vocab_size, d_model=128, nhead=4, num_layers=2, dropout=0.1):\n        \"\"\"\n        Initializes the TransformerModel.\n\n        Args:\n            vocab_size (int): size of the vocabulary.\n            d_model (int): dimensionality of embeddings. default is 128.\n            nhead (int): number of attention heads. default is 4.\n            num_layers (int): number of encoder and decoder layers. default is 2.\n            dropout (float): dropout rate. default is 0.1.\n        \"\"\"\n        super(TransformerModel, self).__init__()\n        self.embedding = nn.Embedding(vocab_size, d_model)\n        self.pos_encoder = PositionalEncoding(d_model, dropout)\n        self.transformer = nn.Transformer(\n            d_model=d_model,\n            nhead=nhead,\n            num_encoder_layers=num_layers,\n            num_decoder_layers=num_layers,\n            dropout=dropout,\n        )\n        self.fc_out = nn.Linear(d_model, vocab_size)\n        self.d_model = d_model\n\n    def forward(self, src, tgt, src_padding_mask=None, tgt_padding_mask=None, tgt_mask=None):\n        \"\"\"\n        Performs a forward pass through the transformer model.\n\n        Args:\n            src (torch.Tensor): source tensor of shape (seq_len_src, batch_size).\n            tgt (torch.Tensor): target tensor of shape (seq_len_tgt, batch_size).\n            src_padding_mask (torch.Tensor): source padding mask.\n            tgt_padding_mask (torch.Tensor): target padding mask.\n            tgt_mask (torch.Tensor): target mask to prevent attention to future tokens.\n\n        Returns:\n            torch.Tensor: output tensor of shape (seq_len_tgt, batch_size, vocab_size).\n        \"\"\"\n        src = self.embedding(src) * math.sqrt(self.d_model)\n        src = self.pos_encoder(src)\n        tgt = self.embedding(tgt) * math.sqrt(self.d_model)\n        tgt = self.pos_encoder(tgt)\n\n        output = self.transformer(\n            src,\n            tgt,\n            tgt_mask=tgt_mask,\n            src_key_padding_mask=src_padding_mask,\n            tgt_key_padding_mask=tgt_padding_mask,\n        )\n        output = self.fc_out(output)\n        return output\n```\n\n## 3. Prepare the Corpus and Vocabulary\n\n### We create a sample corpus and build a vocabulary from it.\n\n\n```python\ndef tokenize(text):\n    \"\"\"\n    Tokenizes a given text into individual words.\n\n    This function converts the input text to lowercase\n    and splits it into words based on spaces.\n\n    Args:\n        text (str): the input string to be tokenized.\n\n    Returns:\n        list: a list of words (tokens) from the input text.\n    \"\"\"\n    return text.lower().split()\n\n\n# define the corpus with query and document labels\ncorpus_with_labels = [\n    (\"What is reverse engineering?\", \"Reverse engineering is the process of analyzing a system to identify its components and their interrelationships.\"),\n    (\"What is disassembly?\", \"Disassembly is the process of converting machine code into human-readable assembly code.\"),\n    (\"What is static analysis?\", \"Static analysis is the examination of software without executing it to understand its structure and behavior.\"),\n    (\"What is dynamic analysis?\", \"Dynamic analysis is the process of analyzing software while it is running to study its behavior.\"),\n    (\"What is a debugger?\", \"A debugger is a tool used to test and debug programs by allowing inspection of memory, registers, and instructions during execution.\"),\n    (\"What is IDA Pro?\", \"IDA Pro is a popular reverse engineering tool that provides a disassembler and decompiler for analyzing binary files.\"),\n    (\"What is Ghidra?\", \"Ghidra is an open-source reverse engineering tool developed by the NSA, featuring a powerful decompiler.\"),\n    (\"What is a control flow graph?\", \"A control flow graph is a representation of all possible paths that can be taken through a program during its execution.\"),\n    (\"What is malware analysis?\", \"Malware analysis is the process of studying malicious software to understand its behavior and mitigate threats.\"),\n    (\"What is obfuscation?\", \"Obfuscation is a technique used to make code harder to understand, often to protect intellectual property or evade detection.\"),\n]\n\n\ndef build_vocab(corpus_with_labels):\n    \"\"\"\n    Builds a vocabulary from the corpus by tokenizing all queries and documents.\n\n    Args:\n        corpus_with_labels (list of tuples): list of query-document pairs.\n\n    Returns:\n        dict: word-to-index mapping.\n        dict: index-to-word mapping.\n    \"\"\"\n    vocab = set()\n    for query, doc in corpus_with_labels:\n        vocab.update(tokenize(query))\n        vocab.update(tokenize(doc))\n\n    vocab = sorted(vocab)\n    word2idx = {word: idx + 1 for idx, word in enumerate(vocab)}\n    word2idx['\u003cunk\u003e'] = 0  # unknown token\n    word2idx['\u003cpad\u003e'] = len(word2idx)  # padding token\n    word2idx['\u003csos\u003e'] = len(word2idx)  # start of sequence\n    word2idx['\u003ceos\u003e'] = len(word2idx)  # end of sequence\n    idx2word = {idx: word for word, idx in word2idx.items()}\n    return word2idx, idx2word\n\n\n# build vocabulary\nword2idx, idx2word = build_vocab(corpus_with_labels)\nvocab_size = len(word2idx)\n```\n\n## 4. Vectorize the Documents\n\n### Convert each document into a vector using the vocabulary.\n\n\n```python\ndef vectorize(text, word2idx, vocab_size):\n    \"\"\"\n    Converts a text document into a vector representation.\n\n    This function tokenizes the input text and creates a vector\n    where each index corresponds to a word in the vocabulary,\n    and the value at that index is the frequency of the word in the text.\n\n    Args:\n        text (str): the input document to be vectorized.\n        word2idx (dict): mapping of words to indices in the vocabulary.\n        vocab_size (int): size of the vocabulary.\n\n    Returns:\n        torch.Tensor: a tensor representing the frequency of words in the document.\n    \"\"\"\n    vec = torch.zeros(vocab_size)\n    tokens = tokenize(text)\n    for token in tokens:\n        idx = word2idx.get(token, word2idx['\u003cunk\u003e'])\n        vec[idx] += 1\n    return vec\n\n\n# create document vectors for the retriever\ndoc_vectors = [vectorize(doc, word2idx, vocab_size) for _, doc in corpus_with_labels]\n```\n\n## 5. Implement the Retriever\n\n### Retrieve the top k documents most similar to the query.\n\n\n```python\ndef retrieve(query, doc_vectors, corpus, word2idx, vocab_size, top_k=2):\n    \"\"\"\n    Retrieves the top-k most similar documents to the query based on cosine similarity.\n\n    Args:\n        query (str): the input query text.\n        doc_vectors (list of torch.Tensor): a list of document vectors representing the corpus.\n        corpus (list of tuples): the corpus of query-document pairs.\n        word2idx (dict): mapping of words to indices in the vocabulary.\n        vocab_size (int): size of the vocabulary.\n        top_k (int): the number of most similar documents to retrieve. default is 2.\n\n    Returns:\n        list of str: the top-k most similar documents from the corpus.\n    \"\"\"\n    query_vec = vectorize(query, word2idx, vocab_size)\n    similarities = []\n    for idx, doc_vec in enumerate(doc_vectors):\n        sim = torch.dot(query_vec, doc_vec) / (\n            torch.norm(query_vec) * torch.norm(doc_vec) + 1e-8\n        )\n        similarities.append((sim.item(), idx))\n    similarities.sort(reverse=True)\n    top_docs = [corpus[idx][1] for (_, idx) in similarities[:top_k]]\n    return top_docs\n```\n\n## 6. Simulate Chain of Thought\n\n### Create a function to simulate the reasoning process.\n\n\n```python\ndef generate_answer_with_chain_of_thought(query, retrieved_docs):\n    \"\"\"\n    Simulates a reasoning process to generate an answer using a chain of thought.\n\n    This function prints the logical steps involved in reasoning through\n    the query and the retrieved documents, then formulates and outputs a simulated answer.\n\n    Args:\n        query (str): the input query text.\n        retrieved_docs (list of str): a list of documents retrieved for the query.\n\n    Returns:\n        None\n    \"\"\"\n    print('Chain of Thought:')\n    reasoning_steps = [\n        f\"1. the query is: '{query}'\",\n        '2. retrieved the following documents:',\n    ]\n    for idx, doc in enumerate(retrieved_docs):\n        reasoning_steps.append(f'   {idx + 1}. {doc}')\n    reasoning_steps.append('3. analyzing the retrieved documents to answer the query.')\n    reasoning_steps.append('4. formulating the answer based on the information.')\n    for step in reasoning_steps:\n        print(step)\n```\n\n## 7. Initialize the Transformer Model\n\n### Instantiate the transformer model with the vocabulary size.\n\n\n```python\n# initialize the transformer model with updated parameters\ntransformer_model = TransformerModel(vocab_size=vocab_size, d_model=128, nhead=4, num_layers=2, dropout=0.1)\n```\n\n    /opt/anaconda3/envs/prod/lib/python3.12/site-packages/torch/nn/modules/transformer.py:379: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)\n      warnings.warn(\n\n\n## 8. Create the RAG System\n\n### Combine the retriever and the generator (transformer model).\n\n\n```python\ndef rag_system(query, transformer_model, retriever, doc_vectors, corpus, word2idx, idx2word, vocab_size):\n    \"\"\"\n    Implements a basic Retrieval-Augmented Generation (RAG) system.\n\n    This function retrieves relevant documents for a query using the retriever,\n    prepares the input for the transformer model by combining the query and retrieved documents,\n    and generates an output using the transformer model.\n\n    Args:\n        query (str): the input query text.\n        transformer_model (nn.Module): the transformer model for generating output.\n        retriever (function): the document retrieval function.\n        doc_vectors (list of torch.Tensor): a list of document vectors representing the corpus.\n        corpus (list of tuples): the corpus of query-document pairs.\n        word2idx (dict): mapping of words to indices in the vocabulary.\n        idx2word (dict): mapping of indices to words in the vocabulary.\n        vocab_size (int): size of the vocabulary.\n\n    Returns:\n        None\n    \"\"\"\n    # retrieve relevant documents\n    retrieved_docs = retriever(query, doc_vectors, corpus, word2idx, vocab_size)\n\n    # prepare input by concatenating the query and retrieved documents\n    input_text = query + ' ' + ' '.join(retrieved_docs)\n    print('Input to the transformer model:')\n    print(input_text)\n\n    # tokenize and convert input to indices\n    tokens = tokenize(input_text)\n    input_indices = [word2idx.get(token, word2idx['\u003cunk\u003e']) for token in tokens]\n    input_tensor = torch.tensor(input_indices).unsqueeze(1)  # add batch dimension\n\n    # initialize decoder input with \u003csos\u003e\n    tgt_input = torch.tensor([word2idx['\u003csos\u003e']]).unsqueeze(1)\n\n    # generate output tokens iteratively\n    transformer_model.eval()\n    with torch.no_grad():\n        for _ in range(50):  # maximum answer length\n            tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_input.size(0)).to(input_tensor.device)\n            src_padding_mask = (input_tensor.squeeze(1) == word2idx['\u003cpad\u003e']).unsqueeze(0)\n            tgt_padding_mask = (tgt_input.squeeze(1) == word2idx['\u003cpad\u003e']).unsqueeze(0)\n\n            output = transformer_model(\n                input_tensor,\n                tgt_input,\n                src_padding_mask=src_padding_mask,\n                tgt_padding_mask=tgt_padding_mask,\n                tgt_mask=tgt_mask,\n            )\n            next_token = output.argmax(dim=-1)[-1, :].item()\n            tgt_input = torch.cat([tgt_input, torch.tensor([[next_token]])], dim=0)\n            if next_token == word2idx['\u003ceos\u003e']:\n                break\n\n    # decode the output indices to words\n    output_indices = tgt_input.squeeze().tolist()[1:]  # exclude the first \u003csos\u003e token\n    answer = ' '.join([idx2word.get(idx, '\u003cunk\u003e') for idx in output_indices])\n\n    # simulate reasoning using the chain of thought process\n    generate_answer_with_chain_of_thought(query, retrieved_docs)\n\n    print('\\nAnswer:')\n    print(answer)\n```\n\n## 9. Test the RAG System\n\n### Run the system with a sample query.\n\n\n```python\n# test the RAG system with an example query\nquery = \"What is reverse engineering?\"\nrag_system(query, transformer_model, retrieve, doc_vectors, corpus_with_labels, word2idx, idx2word, vocab_size)\n```\n\n    Input to the transformer model:\n    What is reverse engineering? Ghidra is an open-source reverse engineering tool developed by the NSA, featuring a powerful decompiler. Reverse engineering is the process of analyzing a system to identify its components and their interrelationships.\n    Chain of Thought:\n    1. the query is: 'What is reverse engineering?'\n    2. retrieved the following documents:\n       1. Ghidra is an open-source reverse engineering tool developed by the NSA, featuring a powerful decompiler.\n       2. Reverse engineering is the process of analyzing a system to identify its components and their interrelationships.\n    3. analyzing the retrieved documents to answer the query.\n    4. formulating the answer based on the information.\n    \n    Answer:\n    provides tool behavior execution. study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study study\n\n\n    /opt/anaconda3/envs/prod/lib/python3.12/site-packages/torch/nn/functional.py:5849: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.\n      warnings.warn(\n\n\n## 10. Train and Test the RAG System\n\n### Run the system with a trained sample query.\n\n\n```python\nimport torch.optim as optim\n```\n\n\n```python\nfrom sklearn.model_selection import train_test_split\n```\n\n\n```python\n# split the data into training and validation sets\ntrain_pairs, val_pairs = train_test_split(corpus_with_labels, test_size=0.2, random_state=42)\n\n\ndef prepare_data(data, word2idx):\n    \"\"\"\n    Prepares tokenized input and output tensors for training or validation.\n\n    This function converts query-document pairs into indexed tensors\n    using the vocabulary mapping.\n\n    Args:\n        data (list of tuples): list of query-document pairs.\n        word2idx (dict): mapping of words to indices in the vocabulary.\n\n    Returns:\n        list of tuples: tokenized input-output tensor pairs.\n    \"\"\"\n    processed_data = []\n    for query, doc in data:\n        input_indices = [word2idx.get(token, word2idx['\u003cunk\u003e']) for token in tokenize(query)]\n        # add \u003csos\u003e and \u003ceos\u003e tokens to the target\n        output_indices = (\n            [word2idx['\u003csos\u003e']]\n            + [word2idx.get(token, word2idx['\u003cunk\u003e']) for token in tokenize(doc)]\n            + [word2idx['\u003ceos\u003e']]\n        )\n        input_tensor = torch.tensor(input_indices)\n        output_tensor = torch.tensor(output_indices)\n        processed_data.append((input_tensor, output_tensor))\n    return processed_data\n\n\n# prepare the training and validation data\ntrain_data = prepare_data(train_pairs, word2idx)\nval_data = prepare_data(val_pairs, word2idx)\n\n# implement batch training\nfrom torch.utils.data import DataLoader\n\n\ndef collate_fn(batch):\n    \"\"\"\n    Collate function to prepare batches with padding.\n\n    Args:\n        batch (list of tuples): list of (src, tgt) tuples.\n\n    Returns:\n        src_batch (torch.Tensor): Padded source sequences.\n        tgt_batch (torch.Tensor): Padded target sequences.\n    \"\"\"\n    src_batch, tgt_batch = zip(*batch)\n    src_padded = nn.utils.rnn.pad_sequence(src_batch, padding_value=word2idx['\u003cpad\u003e'])\n    tgt_padded = nn.utils.rnn.pad_sequence(tgt_batch, padding_value=word2idx['\u003cpad\u003e'])\n    return src_padded, tgt_padded\n\n\n# set batch size\nbatch_size = 2\n\n# create DataLoaders\ntrain_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)\nval_loader = DataLoader(val_data, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)\n\n# define loss and optimizer with reduced learning rate\ncriterion = nn.CrossEntropyLoss(ignore_index=word2idx['\u003cpad\u003e'])\noptimizer = optim.Adam(transformer_model.parameters(), lr=0.001)\n\n# increase dropout rate in the model (assuming you can modify model initialization)\ntransformer_model.dropout = 0.3\n\n# adjust learning rate scheduler and early stopping patience\nscheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3, verbose=True)\n\n\ndef train_model(model, train_loader, val_loader, vocab_size, epochs=1000):\n    \"\"\"\n    Trains the transformer model on the training set and validates it.\n\n    This function performs multiple epochs of training and calculates\n    validation loss after each epoch.\n\n    Args:\n        model (nn.Module): the transformer model to be trained.\n        train_loader (DataLoader): DataLoader for training data.\n        val_loader (DataLoader): DataLoader for validation data.\n        vocab_size (int): size of the vocabulary.\n        epochs (int): number of epochs for training.\n\n    Returns:\n        None\n    \"\"\"\n    best_val_loss = float('inf')\n    patience = 100  # increase patience to allow more epochs before early stopping\n    epochs_no_improve = 0\n    for epoch in range(epochs):\n        model.train()\n        total_loss = 0\n        for src_batch, tgt_batch in train_loader:\n            optimizer.zero_grad()\n\n            tgt_input = tgt_batch[:-1, :]  # Exclude last token for input\n            tgt_output = tgt_batch[1:, :]  # Exclude first token for output\n\n            # Generate masks\n            tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_input.size(0)).to(src_batch.device)\n            src_padding_mask = (src_batch == word2idx['\u003cpad\u003e']).transpose(0, 1)\n            tgt_padding_mask = (tgt_input == word2idx['\u003cpad\u003e']).transpose(0, 1)\n\n            output = model(\n                src_batch,\n                tgt_input,\n                src_padding_mask=src_padding_mask,\n                tgt_padding_mask=tgt_padding_mask,\n                tgt_mask=tgt_mask,\n            )\n            loss = criterion(output.view(-1, vocab_size), tgt_output.reshape(-1))\n            loss.backward()\n            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # gradient clipping\n            optimizer.step()\n            total_loss += loss.item()\n        avg_loss = total_loss / len(train_loader)\n\n        # validation step\n        model.eval()\n        val_loss = 0\n        with torch.no_grad():\n            for src_batch, tgt_batch in val_loader:\n                tgt_input = tgt_batch[:-1, :]\n                tgt_output = tgt_batch[1:, :]\n\n                tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_input.size(0)).to(src_batch.device)\n                src_padding_mask = (src_batch == word2idx['\u003cpad\u003e']).transpose(0, 1)\n                tgt_padding_mask = (tgt_input == word2idx['\u003cpad\u003e']).transpose(0, 1)\n\n                output = model(\n                    src_batch,\n                    tgt_input,\n                    src_padding_mask=src_padding_mask,\n                    tgt_padding_mask=tgt_padding_mask,\n                    tgt_mask=tgt_mask,\n                )\n                loss = criterion(output.view(-1, vocab_size), tgt_output.reshape(-1))\n                val_loss += loss.item()\n        avg_val_loss = val_loss / len(val_loader)\n        print(f'Epoch {epoch + 1}/{epochs}, Training Loss: {avg_loss:.4f}, Validation Loss: {avg_val_loss:.4f}')\n\n        # adjust learning rate based on validation loss\n        scheduler.step(avg_val_loss)\n\n        # check for early stopping\n        if avg_val_loss \u003c best_val_loss:\n            best_val_loss = avg_val_loss\n            epochs_no_improve = 0\n            torch.save(model.state_dict(), 'best_model.pt')  # save the best model\n        else:\n            epochs_no_improve += 1\n            if epochs_no_improve \u003e= patience:\n                print(f'Early stopping after {epoch + 1} epochs')\n                model.load_state_dict(torch.load('best_model.pt', weights_only=True))  # load the best model\n                break\n\n\n# train the model\ntrain_model(transformer_model, train_loader, val_loader, vocab_size, epochs=1000)\n\n# test the RAG system again\nprint(\"\\nRe-testing the RAG system with trained model:\")\nquery = \"What is reverse engineering?\"\nrag_system(query, transformer_model, retrieve, doc_vectors, corpus_with_labels, word2idx, idx2word, vocab_size)\n```\n\n    Epoch 1/1000, Training Loss: 1.1782, Validation Loss: 4.0728\n    Epoch 2/1000, Training Loss: 0.9192, Validation Loss: 3.9542\n    Epoch 3/1000, Training Loss: 0.8057, Validation Loss: 4.0870\n    Epoch 4/1000, Training Loss: 0.6739, Validation Loss: 4.1276\n    Epoch 5/1000, Training Loss: 0.5242, Validation Loss: 4.1934\n    Epoch 6/1000, Training Loss: 0.4641, Validation Loss: 4.2709\n    Epoch 7/1000, Training Loss: 0.4112, Validation Loss: 4.4382\n    Epoch 8/1000, Training Loss: 0.3263, Validation Loss: 4.3522\n    Epoch 9/1000, Training Loss: 0.2877, Validation Loss: 4.2833\n    Epoch 10/1000, Training Loss: 0.2524, Validation Loss: 4.3287\n    Epoch 11/1000, Training Loss: 0.2341, Validation Loss: 4.4280\n    Epoch 12/1000, Training Loss: 0.2051, Validation Loss: 4.5004\n    Epoch 13/1000, Training Loss: 0.1790, Validation Loss: 4.5016\n    Epoch 14/1000, Training Loss: 0.1630, Validation Loss: 4.4548\n    Epoch 15/1000, Training Loss: 0.1694, Validation Loss: 4.4290\n    Epoch 16/1000, Training Loss: 0.1529, Validation Loss: 4.4136\n    Epoch 17/1000, Training Loss: 0.1539, Validation Loss: 4.4288\n    Epoch 18/1000, Training Loss: 0.1380, Validation Loss: 4.4644\n    Epoch 19/1000, Training Loss: 0.1416, Validation Loss: 4.4763\n    Epoch 20/1000, Training Loss: 0.1313, Validation Loss: 4.4871\n    Epoch 21/1000, Training Loss: 0.1449, Validation Loss: 4.5029\n    Epoch 22/1000, Training Loss: 0.1263, Validation Loss: 4.5159\n    Epoch 23/1000, Training Loss: 0.1318, Validation Loss: 4.5258\n    Epoch 24/1000, Training Loss: 0.1360, Validation Loss: 4.5295\n    Epoch 25/1000, Training Loss: 0.1249, Validation Loss: 4.5281\n    Epoch 26/1000, Training Loss: 0.1133, Validation Loss: 4.5269\n    Epoch 27/1000, Training Loss: 0.1089, Validation Loss: 4.5249\n    Epoch 28/1000, Training Loss: 0.1204, Validation Loss: 4.5245\n    Epoch 29/1000, Training Loss: 0.1097, Validation Loss: 4.5249\n    Epoch 30/1000, Training Loss: 0.1081, Validation Loss: 4.5284\n    Epoch 31/1000, Training Loss: 0.1164, Validation Loss: 4.5314\n    Epoch 32/1000, Training Loss: 0.1179, Validation Loss: 4.5347\n    Epoch 33/1000, Training Loss: 0.1102, Validation Loss: 4.5373\n    Epoch 34/1000, Training Loss: 0.1226, Validation Loss: 4.5389\n    Epoch 35/1000, Training Loss: 0.1185, Validation Loss: 4.5399\n    Epoch 36/1000, Training Loss: 0.1061, Validation Loss: 4.5410\n    Epoch 37/1000, Training Loss: 0.1043, Validation Loss: 4.5426\n    Epoch 38/1000, Training Loss: 0.1174, Validation Loss: 4.5438\n    Epoch 39/1000, Training Loss: 0.1062, Validation Loss: 4.5447\n    Epoch 40/1000, Training Loss: 0.1018, Validation Loss: 4.5454\n    Epoch 41/1000, Training Loss: 0.1060, Validation Loss: 4.5461\n    Epoch 42/1000, Training Loss: 0.1026, Validation Loss: 4.5464\n    Epoch 43/1000, Training Loss: 0.1085, Validation Loss: 4.5464\n    Epoch 44/1000, Training Loss: 0.1022, Validation Loss: 4.5464\n    Epoch 45/1000, Training Loss: 0.1081, Validation Loss: 4.5464\n    Epoch 46/1000, Training Loss: 0.1130, Validation Loss: 4.5464\n    Epoch 47/1000, Training Loss: 0.1049, Validation Loss: 4.5465\n    Epoch 48/1000, Training Loss: 0.1106, Validation Loss: 4.5465\n    Epoch 49/1000, Training Loss: 0.1089, Validation Loss: 4.5465\n    Epoch 50/1000, Training Loss: 0.1093, Validation Loss: 4.5466\n    Epoch 51/1000, Training Loss: 0.1023, Validation Loss: 4.5466\n    Epoch 52/1000, Training Loss: 0.1131, Validation Loss: 4.5465\n    Epoch 53/1000, Training Loss: 0.1230, Validation Loss: 4.5465\n    Epoch 54/1000, Training Loss: 0.1135, Validation Loss: 4.5464\n    Epoch 55/1000, Training Loss: 0.1072, Validation Loss: 4.5464\n    Epoch 56/1000, Training Loss: 0.1164, Validation Loss: 4.5464\n    Epoch 57/1000, Training Loss: 0.1175, Validation Loss: 4.5465\n    Epoch 58/1000, Training Loss: 0.1066, Validation Loss: 4.5466\n    Epoch 59/1000, Training Loss: 0.1086, Validation Loss: 4.5466\n    Epoch 60/1000, Training Loss: 0.1089, Validation Loss: 4.5466\n    Epoch 61/1000, Training Loss: 0.1026, Validation Loss: 4.5466\n    Epoch 62/1000, Training Loss: 0.1066, Validation Loss: 4.5467\n    Epoch 63/1000, Training Loss: 0.1072, Validation Loss: 4.5467\n    Epoch 64/1000, Training Loss: 0.1030, Validation Loss: 4.5467\n    Epoch 65/1000, Training Loss: 0.1197, Validation Loss: 4.5467\n    Epoch 66/1000, Training Loss: 0.0998, Validation Loss: 4.5467\n    Epoch 67/1000, Training Loss: 0.1141, Validation Loss: 4.5467\n    Epoch 68/1000, Training Loss: 0.1155, Validation Loss: 4.5467\n    Epoch 69/1000, Training Loss: 0.1175, Validation Loss: 4.5467\n    Epoch 70/1000, Training Loss: 0.1138, Validation Loss: 4.5467\n    Epoch 71/1000, Training Loss: 0.1127, Validation Loss: 4.5467\n    Epoch 72/1000, Training Loss: 0.1137, Validation Loss: 4.5467\n    Epoch 73/1000, Training Loss: 0.1152, Validation Loss: 4.5467\n    Epoch 74/1000, Training Loss: 0.1058, Validation Loss: 4.5467\n    Epoch 75/1000, Training Loss: 0.1071, Validation Loss: 4.5467\n    Epoch 76/1000, Training Loss: 0.1063, Validation Loss: 4.5467\n    Epoch 77/1000, Training Loss: 0.1093, Validation Loss: 4.5467\n    Epoch 78/1000, Training Loss: 0.1082, Validation Loss: 4.5467\n    Epoch 79/1000, Training Loss: 0.1061, Validation Loss: 4.5467\n    Epoch 80/1000, Training Loss: 0.1159, Validation Loss: 4.5467\n    Epoch 81/1000, Training Loss: 0.1031, Validation Loss: 4.5467\n    Epoch 82/1000, Training Loss: 0.1186, Validation Loss: 4.5467\n    Epoch 83/1000, Training Loss: 0.1028, Validation Loss: 4.5467\n    Epoch 84/1000, Training Loss: 0.1189, Validation Loss: 4.5467\n    Epoch 85/1000, Training Loss: 0.1181, Validation Loss: 4.5467\n    Epoch 86/1000, Training Loss: 0.1109, Validation Loss: 4.5467\n    Epoch 87/1000, Training Loss: 0.1093, Validation Loss: 4.5467\n    Epoch 88/1000, Training Loss: 0.1102, Validation Loss: 4.5467\n    Epoch 89/1000, Training Loss: 0.1140, Validation Loss: 4.5467\n    Epoch 90/1000, Training Loss: 0.1169, Validation Loss: 4.5467\n    Epoch 91/1000, Training Loss: 0.1058, Validation Loss: 4.5467\n    Epoch 92/1000, Training Loss: 0.1096, Validation Loss: 4.5467\n    Epoch 93/1000, Training Loss: 0.1181, Validation Loss: 4.5467\n    Epoch 94/1000, Training Loss: 0.1099, Validation Loss: 4.5467\n    Epoch 95/1000, Training Loss: 0.1133, Validation Loss: 4.5467\n    Epoch 96/1000, Training Loss: 0.1054, Validation Loss: 4.5467\n    Epoch 97/1000, Training Loss: 0.1197, Validation Loss: 4.5468\n    Epoch 98/1000, Training Loss: 0.1232, Validation Loss: 4.5468\n    Epoch 99/1000, Training Loss: 0.1164, Validation Loss: 4.5468\n    Epoch 100/1000, Training Loss: 0.1107, Validation Loss: 4.5468\n    Epoch 101/1000, Training Loss: 0.1075, Validation Loss: 4.5468\n    Epoch 102/1000, Training Loss: 0.1197, Validation Loss: 4.5468\n    Early stopping after 102 epochs\n    \n    Re-testing the RAG system with trained model:\n    Input to the transformer model:\n    What is reverse engineering? Ghidra is an open-source reverse engineering tool developed by the NSA, featuring a powerful decompiler. Reverse engineering is the process of analyzing a system to identify its components and their interrelationships.\n    Chain of Thought:\n    1. the query is: 'What is reverse engineering?'\n    2. retrieved the following documents:\n       1. Ghidra is an open-source reverse engineering tool developed by the NSA, featuring a powerful decompiler.\n       2. Reverse engineering is the process of analyzing a system to identify its components and their interrelationships.\n    3. analyzing the retrieved documents to answer the query.\n    4. formulating the answer based on the information.\n    \n    Answer:\n    reverse engineering is the process of analyzing a system to identify its components and their interrelationships. \u003ceos\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmytechnotalent%2Fknet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmytechnotalent%2Fknet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmytechnotalent%2Fknet/lists"}