{"id":26016629,"url":"https://github.com/taabishhh/llm_training","last_synced_at":"2026-05-06T19:37:18.222Z","repository":{"id":280793470,"uuid":"882078739","full_name":"taabishhh/LLM_Training","owner":"taabishhh","description":"This project implements a distributed pipeline for NLP model training using Apache Spark and DeepLearning4J (DL4J). The methodology utilizes a sliding window approach for data preparation, positional embeddings for token encoding, and Word2Vec model training with parallel processing. The model and training process is designed for scalability and op","archived":false,"fork":false,"pushed_at":"2024-11-17T06:41:31.000Z","size":17778,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-05T10:50:08.408Z","etag":null,"topics":["apache-spark","deeplearning4j","dl4j","llm","llm-training","logback-classic","mapreduce-scala","scalatest","sliding-window","tensorflow","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/taabishhh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-01T20:50:30.000Z","updated_at":"2025-03-03T18:35:17.000Z","dependencies_parsed_at":"2025-03-05T11:01:34.954Z","dependency_job_id":null,"html_url":"https://github.com/taabishhh/LLM_Training","commit_stats":null,"previous_names":["taabishhh/llm_training"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taabishhh%2FLLM_Training","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taabishhh%2FLLM_Training/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taabishhh%2FLLM_Training/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taabishhh%2FLLM_Training/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/taabishhh","download_url":"https://codeload.github.com/taabishhh/LLM_Training/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242145752,"owners_count":20079200,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","deeplearning4j","dl4j","llm","llm-training","logback-classic","mapreduce-scala","scalatest","sliding-window","tensorflow","word2vec"],"created_at":"2025-03-06T04:22:27.554Z","updated_at":"2026-05-06T19:37:18.217Z","avatar_url":"https://github.com/taabishhh.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CS 441-Homework2 (UIC): NLP Training Pipeline with Apache Spark and DL4J\n\n**Author:** Taabish Sutriwala  \n**UIN:** 673379837  \n**Email:** tsutr@uic.edu  \n\n## Project Overview\nThis project implements a distributed pipeline for NLP model training using Apache Spark and DeepLearning4J (DL4J). The methodology utilizes a sliding window approach for data preparation, positional embeddings for token encoding, and Word2Vec model training with parallel processing. The model and training process is designed for scalability and optimized for large datasets.\n\n### Methodology\n1. **Data Preprocessing**:  \n   - The dataset is loaded as CSV, with each row containing a token and its embeddings.  \n   - Text is split into sentences, which are further segmented into words. Tokens and embeddings are grouped to form structured sentences.\n\n2. **Sliding Window with Positional Embeddings**:  \n   - A sliding window approach is applied to sentences, creating fixed-size windows for each sequence.  \n   - Positional embeddings are added to account for token positions within the window, enhancing sequence understanding.\n\n3. **Model Training (Word2Vec)**:  \n   - The model trains using the sliding window embeddings as inputs and associated next tokens as target outputs.\n   - Apache Spark enables distributed training, leveraging DL4J for efficient neural network operations.\n\n4. **Performance Monitoring and Metrics**:  \n   - During training, statistics like accuracy, loss, and runtime are logged for analysis.  \n   - Additional metrics like convergence rate and model size provide insights into training effectiveness and efficiency.\n\n### Partitioning\nData is partitioned by sentences, with each partition consisting of a series of tokens and corresponding embeddings. Each sliding window operation extracts a subset of tokens, embedding data for training input, and predicts the next token in the sequence.\n\n### Input and Output\n- **Input**: A CSV file containing token embeddings in the following format:\ntoken,embedding_dim_0,embedding_dim_1,...,embedding_dim_n the,0.009552239,0.08198426,...,-0.32604042\n\n- **Output**: A CSV file (`sliding_window_data.csv`) containing structured sliding window data, formatted as:\ninputWindowTokens,inputEmbeddings,targetToken,targetEmbedding\n\n\n### Installation\n\n1. **Clone the Repository**:\n ```bash\n git clone \u003crepository-url\u003e\n cd Exercises441\n```\n2. Install Dependencies: Ensure SBT is installed. SBT will handle dependency resolution upon build.\n\n3. Configure Paths: Update the input and output file paths in ConfigLoader.\n\n4. Build the Project:\n```\nsbt clean compile\nsbt assembly`\n```\nRunning the Project\nTo execute the program:\n```\nsbt run \u003cinputPath\u003e \u003coutputPath\u003e\n```\nThe application executes the following steps:\n\nSlidingWindowSpark: Generates sliding window data with positional embeddings.\n\nTrainingWithSlidingWindowSpark: Utilizes sliding window data for model training.\n\n### Dependencies\n```\nThisBuild / version := \"0.1.0-SNAPSHOT\"\nThisBuild / scalaVersion := \"2.12.13\"\n\nlazy val root = (project in file(\".\"))\n  .settings(\n    name := \"Exercises441\"\n  )\n\n// Hadoop, Spark, DL4J, TensorFlow, CSV Handling, Logging\nlibraryDependencies ++= Seq(\n  \"org.apache.hadoop\" % \"hadoop-common\" % \"3.3.6\",\n  \"org.apache.hadoop\" % \"hadoop-mapreduce-client-core\" % \"3.3.6\",\n  \"org.apache.spark\" %% \"spark-core\" % \"3.5.3\",\n  \"org.deeplearning4j\" % \"deeplearning4j-core\" % \"1.0.0-M2.1\",\n  \"org.deeplearning4j\" %% \"dl4j-spark\" % \"1.0.0-M2.1\",\n  \"org.tensorflow\" % \"tensorflow\" % \"1.15.0\",\n  \"org.apache.commons\" % \"commons-csv\" % \"1.9.0\",\n  \"ch.qos.logback\" % \"logback-classic\" % \"1.5.6\",\n  \"org.scalatest\" %% \"scalatest\" % \"3.2.19\" % \"test\"\n)\n```\n### Performance Metrics Collection\nDuring training, the following statistics are logged for analysis:\n\nTraining Accuracy and Loss: Measures model convergence over epochs.\nRuntime Performance: Captures model execution time for different stages.\nModel Size and Parameters: Tracks model size and parameter count for resource evaluation.\nMemory and CPU Utilization: Monitors system resource usage to assess load balancing and scalability.\n\n### Repository Structure\n\n\u003cimg width=\"394\" alt=\"Screenshot 2024-11-05 at 11 35 22 AM\" src=\"https://github.com/user-attachments/assets/f88ee70a-439b-4d4d-9adf-70c23bc8778b\"\u003e\n\n### Link to Video Demonstration\n[Link to Video](https://youtu.be/O8bY4Lq_f7E)\n\nNotes\nEnsure Apache Hadoop, Spark, and DL4J libraries are configured and accessible. Update ConfigLoader for dataset paths, and monitor log outputs for metric collection.\n\nFor additional information, contact: Taabish Sutriwala at tsutr@uic.edu.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaabishhh%2Fllm_training","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftaabishhh%2Fllm_training","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaabishhh%2Fllm_training/lists"}