{"id":44052309,"url":"https://github.com/pulp-platform/pulp-trainlib","last_synced_at":"2026-02-07T23:36:01.695Z","repository":{"id":49498246,"uuid":"493144380","full_name":"pulp-platform/pulp-trainlib","owner":"pulp-platform","description":"Floating-Point Optimized On-Device Learning Library for the PULP Platform.","archived":false,"fork":false,"pushed_at":"2025-12-05T10:23:16.000Z","size":31393,"stargazers_count":37,"open_issues_count":6,"forks_count":18,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-12-08T19:49:18.523Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pulp-platform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-05-17T07:34:48.000Z","updated_at":"2025-12-05T10:23:12.000Z","dependencies_parsed_at":"2024-04-15T13:03:17.593Z","dependency_job_id":"33d2c662-3c88-4ff7-ae8c-7ab2ab3071dd","html_url":"https://github.com/pulp-platform/pulp-trainlib","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/pulp-platform/pulp-trainlib","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulp-platform%2Fpulp-trainlib","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulp-platform%2Fpulp-trainlib/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulp-platform%2Fpulp-trainlib/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulp-platform%2Fpulp-trainlib/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pulp-platform","download_url":"https://codeload.github.com/pulp-platform/pulp-trainlib/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulp-platform%2Fpulp-trainlib/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29212754,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-07T23:14:30.912Z","status":"ssl_error","status_checked_at":"2026-02-07T23:14:17.253Z","response_time":63,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-07T23:36:01.127Z","updated_at":"2026-02-07T23:36:01.689Z","avatar_url":"https://github.com/pulp-platform.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PULP-TrainLib\n\nPULP-TrainLib is the first Deep Neural Network (DNN) training library for multi-core RISC-V MCUs (as PULP), enabling On-Device Learning on ultra-low-power devices of this class.\n\nPULP-TrainLib features a variety of training primitives and support functions to enable backpropagation-based training on multicore MCUs. More in depth:\n\n- A set of performance-tunable DNN layer primitives for training, based on Matrix Multiplication (MM). \n- Commonly used loss functions, like MSE and CrossEntropy. \n- SGD-based optimizers.\n- Activation (ReLU, etc) and support functions.\n\n[PULP-TrainLib](./lib/) is fully released as open-source under [Apache License Version 2.0](./LICENSE).\n\nTo ease the deployment of DNN training tasks on MCUs, PULP-TrainLib is equipped with additional tools:\n\n- [TrainLib_Deployer](./tools/TrainLib_Deployer/), an automated code-generation tool to generate the C code to validate and train a user-specified DNN model on a PULP architecture. \n- [AutoTuner](./tools/AutoTuner/), a pre-deployment tool to select the fastest configuration of each layer of a DNN model, according to the shapes of the layer, the training step and the tiling strategy.\n\nIf you use any part of PULP-TrainLib , please cite:\n```\n@InProceedings{10.1007/978-3-031-15074-6_13,\nauthor=\"Nadalini, Davide\nand Rusci, Manuele\nand Tagliavini, Giuseppe\nand Ravaglia, Leonardo\nand Benini, Luca\nand Conti, Francesco\",\neditor=\"Orailoglu, Alex\nand Reichenbach, Marc\nand Jung, Matthias\",\ntitle=\"PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning\",\nbooktitle=\"Embedded Computer Systems: Architectures, Modeling, and Simulation\",\nyear=\"2022\",\npublisher=\"Springer International Publishing\",\naddress=\"Cham\",\npages=\"200--216\",\nabstract=\"An open challenge in making Internet-of-Things sensor nodes ``smart'' and self-adaptive is to enable on-chip Deep Neural Network (DNN) training on Ultra-Low-Power (ULP) microcontroller units (MCUs). To this aim, we present a framework, based on PULP-TrainLib, to deploy DNN training tasks on RISC-V-based Parallel-ULP (PULP) MCUs. PULP-TrainLib is a library of parallel software DNN primitives enabling the execution of forward and backward steps on PULP MCUs. To optimize PULP-TrainLib's kernels, we propose a strategy to automatically select and configure (autotune) the fastest among a set of tiling options and optimized floating-point matrix multiplication kernels, according to the tensor shapes of every DNN layer. Results on an 8-core RISC-V MCU show that our auto-tuned primitives improve MAC/clk by up to 2.4{\\$}{\\$}{\\backslash}times {\\$}{\\$}{\\texttimes}compared to ``one-size-fits-all'' matrix multiplication, achieving up to 4.39 MAC/clk - 36.6{\\$}{\\$}{\\backslash}times {\\$}{\\$}{\\texttimes}better than a commercial STM32L4 MCU executing the same DNN layer training workload. Furthermore, our strategy proves to be 30.7{\\$}{\\$}{\\backslash}times {\\$}{\\$}{\\texttimes}faster than AIfES, a state-of-the-art training library for MCUs, while training a complete TinyML model.\",\nisbn=\"978-3-031-15074-6\"\n}\n```\n\nThis repository is released under the [Apache License Version 2.0](./LICENSE).\n\n## PULP-TrainLib's training library\n\nPULP-TrainLib is the first open-source training library for RISC-V-based multicore MCUs, including a set of performance-tunable DNN layer primitives to enable DNN training on ultra-low-power devices. The training flow of PULP-TrainLib's primitives follows the canonic approach for the backpropagation algorithm, currently considering a streaming approach (batch size = 1). I.e., first we compute the Forward (FW) step to compute the prediction for a given input data. Then, a Backward Step is called to compute, layer-by-layer, the gradient of the loss function with respect to the weights (WG-BW) and the gradient of the loss function with respect to the input (IG-BW). The structure of the training flow of a single layer (e.g. a Fully-Connected) is depicted as follows:\n\n![PULP-TrainLib's Primitives](./assets/img/pulp-trainlib-primitives.png)\n\nNote that every training step for most of the layers is implemented as a Matrix Multiplication (MM) between tensor data. E.g., for a Conv2D and Fully-Connected Layer, the structure and sizes of the involved matrices can be represented as follows:\n\n![MM-based training primitives](./assets/img/pulp-trainlib-mm-flow.png)\n\nConvolutions are implemented as Image-to-Column (or Image-to-Row) pre-processed data + MM. The sizes of the tensors are denoted as (CI, HI, WI) for the input feature map, (CO, CI, Hk, Wk) for the weights, and (CO, HO, WO) for the output feature map. To tune the performances of the training primitives, specific optimizations can be selected case-by-case for the MM algorithm. \n\n\n## The TrainLib_Deployer\n\nThe development of C code for running On-Device Learning can be a time-consuming process. To make deployment easier, PULP-TrainLib provides TrainLib_Deployer, a code generation tool which creates all the necessary files to run DNN validation and training on a [PULP](https://pulp-platform.org/)-based MCU. To minimize the memory occupation, the TrainLib_Deployer adopts a data-reuse approach to store tensors in C arrays. The flow of the TrainLib_Deployer is illustrated as follows:\n\n![TrainLib_Deployer](./assets/img/trainlib-deployer.png)\n\nThe input arguments of the TrainLib_Deployer are the architecture of the model to be trained on an MCU and the setup (memory and number of cores) of the target device. Indeed, the tool assumes to run an On-Device Learning routine on an MCU equipped with N parallel cores for computation. While running, the tool takes care of verifying if the model fits the memory. As output, the tool generates a project folder containing the code to run a verification task of the target DNN model on the target device (PyTorch Golden Model, or GM, C code, Makefile). \n\n\n## PULP-TrainLib's AutoTuner\n\nPULP-TrainLib optimizes the core computational kernel of DNN training primitives - the Matrix Multiplication (or MM) - with various unrolling and parallelization schemes. To select the best optimization for a given training step and tile size, PULP-TrainLib provides an Autotuner, which exhaustively searches for the fastest kernel among the [library of available optimized MM kernels](lib/include/pulp_matmul_fp32.h). AutoTuner's flow can be represented as follows:\n\n![AutoTuner](./assets/img/autotuner.png)\n\nGiven the properties of the target device and the layer/training step informations on a generic layer (e.g. 8 cores, 64kB, Conv2D, Forward), AutoTuner exhaustively searches for the fastest tile shape which best fits the specified memory amount and the fastest MM kernel which minimizes the latency on the given tile shape. For further info, readers may refer to \"PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-Core MCUs through Performance-Driven Autotuning\" [SAMOS Pre-Print Version](https://www.samos-conference.com/Resources_Samos_Websites/Proceedings_Repository_SAMOS/2022/Papers/Paper_14.pdf).\n\n\n\n\n# Repository overview\n\nPULP-TrainLib's library files are located under the `lib/` folder ([lib's README](lib/README.md)).\n\nThe `tests/` folder provides useful tests to try out and verify PULP-TrainLib's layers and functions (tests are performed with respecto to a PyTorch Golden models).\nEach test can be customized according to the user specifications and profiles the execution of the layer's primitives with PULP's performance counters.\nIf further info are needed, please refer to the [test's README](tests/README.md).\n\nThe `tools/` folder contains useful tools which ease the usage of PULP-TrainLib, as the TrainLib_Deployer and AutoTuner. For further info, please refer to [tools' README](tools/README.md).\n\nThe `assets/` folder contains useful support files for PULP-TrainLib. Inside [CI_test_suite](assets/CI_test_suite/), users can find a testing environment that can be used to verify PULP-TrainLib's primitives for Continuous Integration (TO BE COMPLETED). \n\n\n# Tutorials\n\nTo learn how to generate the code with our TrainLib_Deployer and more details about the optimizations used in this library, a [tutorial repository](https://github.com/dnadalini/PULP-TrainLib-Tutorial) is available online. This repository contains tutorials and a guide to easily install a conda environment with all the necessary requirements to run PULP-TrainLib.\n\n\n\n# Installation and requirements\n\n## PULP-SDK\n\nPULP-TrainLib requires [PULP-SDK](https://github.com/pulp-platform/pulp-sdk) and the [RISC-V GNU GCC TOOLCHAIN](https://github.com/pulp-platform/pulp-riscv-gnu-toolchain) to be used and compiled.\nPlease refer to the links to correctly setup your working environment.\n\n## Python - PyTorch requirements\n\nTo successfully run the tests, Python (\u003e= 3.6) is needed, together with PyTorch (\u003e= 1.9.0). To install the dependencies (with CPU only), run:\n\n```\npython -m pip install argparse \npython -m pip install install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu\npython -m pip install torchsummary\n```\n\nIf you require the GPU (CUDA \u003e= 10.2) version for your applications, instead run:\n\n```\npython -m pip install argparse \npython -m pip install torch torchvision torchaudio\npython -m pip install torchsummary\n```\n\nThe tests have been verified using torch version \"1.9.0+cpu\".\n\n\n## PULP-TrainLib\n\nTo get started with PULP-TrainLib, just clone this repository on your local PC.\n\nBefore compiling any project, source `pulp-sdk/configs/pulp_open.sh` from the terminal from which you intend to compile your project. \nThe `configs/` folder is located inside the path to your pulp-sdk directory.\n\nWhen generating a DNN for PULP with the TrainLib Deployer, make sure to launch the python task from a terminal in which you did not source the `pulp_open.sh`.\n\n\n# Testing and verification\n\nTo add new functionalities, users can follow the naming convention of PULP-TrainLib and provide [primitives](lib/) and a related test inside the `tests/` folder. For integrating the new features, we recommend to extend the [continuous integration test suite](assets/CI_test_suite/test_suite.py) to functionally verify the primitives before the integration.\n\n\n\n# Branches\n\nPULP-TrainLib's repository is organized with these branches:\n- `main`: main branch, targeting PULP architectures.\n- `trainlib-tutorial`: branch reserved for tutorial purposes (see [https://github.com/dnadalini/PULP-TrainLib-Tutorial](https://github.com/dnadalini/PULP-TrainLib-Tutorial)).\n- `pulp-trainlib-paper`: branch to reproduce the results provided in the paper [\"PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-Core MCUs through Performance-Driven Autotuning\"](https://www.samos-conference.com/Resources_Samos_Websites/Proceedings_Repository_SAMOS/2022/Papers/Paper_14.pdf).\n- `pulp-trainlib-stm32`: this is a PULP-TrainLib port compatible with STM32 and other MCUs (FP32 format only).\n\n\n\n# Available features status log\n\n\u003e Note: checked are complete, unchecked are ongoing/buggy\n\nPULP-TrainLib:\n\n- [X] Forward passes for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected, Transposed Convolution 2D (FP32, FP16)\n- [X] Weight gradients for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected, Transposed Convolution 2D (FP32, FP16)\n- [X] Input gradients for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected, Transposed Convolution 2D (FP32, FP16)\n- [X] CWH data layout for DepthWise, PointWise and 2D Convolutions, Transposed Convolution 2D (FP32, FP16)\n- [X] HWC data layout for PointWise Convolution (FP32, FP16) and 2D Convolutions (FP32, FP16)\n- [X] Stride and Padding (only naive 2D Convolutions, without im2col+mm optimization)\n- [X] ReLU, Leaky ReLU, Sigmoid activation functions (FP32, FP16)\n- [X] Gradient Descent optimizer (FP32, FP16) with weight decay\n- [X] L1Loss, MSE Loss, berHu Loss (FP32, FP16)\n- [ ] CrossEntropyLoss (FP32, FP16)\n- [X] Max and Average Pooling (FP32, FP16)\n- [X] RNN training primitives (FP32)\n- [X] Multihead Self Attention training primitives (FP32)\n- [X] Residual connection (FP32, FP16)\n- [X] InstanceNorm (FP32, FP16)\n- [X] Biases for Conv2D (FP32, FP16) \n- [ ] Biases for Fully-Connected, Weight and Input grad steps (forward bugged) (FP32, FP16)\n- [ ] Padding operators for DepthWise and 2D Convolution (im2col + mm)\n- [ ] HWC data layout management for DepthWise Convolution (FP32, FP16)\n- [ ] Stride operators for 2D Convolutions and DepthWise (im2col + mm)\n- [ ] RNN training primitives (FP16)\n- [ ] Multihead Self Attention training primitives (FP16)\n- [ ] Biases for DepthWise and PointWise Convolutions (FP32, FP16)\n- [ ] Sparse Update (layer-wise) in TrainLib_Deployer\n- [ ] Partial Im2Col / Im2Row for Conv2D (FP32, FP16)\n\nTrainLib_Deployer:\n\n- [X] No Buffer and Single Buffer mode, supporting layer-wise execution (tiling not supported)\n- [X] Conv2D, PointWise, DepthWise Convolutions, Fully-Connected support (FP32, FP16)\n- [X] Average and Max Pooling (FP32, FP16)\n- [X] ReLU, LeakyReLU, Sigmoid Activations (FP32, FP16)\n- [X] InstanceNorm (FP32, FP16)\n- [X] Residual Connections (FP32, FP16, only no buffer mode)\n- [ ] Residual Connections (FP32, FP16, single buffer mode)\n- [X] SGD Optimizer (FP32, FP16)\n- [ ] FP32-FP16 Layer-Wise Mixed Precision Mode\n- [X] Layer-Wise Sparse Update\n- [X] CHW Data Layout\n- [ ] HWC Data Layout\n- [X] Online Learning (batch size = 1)\n- [ ] Mini-Batch Learning (batch size \u003e 1)\n\n# Known bugs / issues (open for contributions)\n\n- AutoTuner working with \"NUM_TILING_SOLUTIONS = 1\"\n- Sporadic bugs in \"mm_u2\" in FP32 (mostly on leftovers)\n- Performance bugs in im2col/im2row with DMA loading (performances tend to be less than im2col/im2row with cores)\n- Missing integration for RNN / MHSE in TrainLib_Deployer\n- FP32 MHSA primitives (Input Grad)\n- Missing integration of sigmoid function in TrainLib_Deployer\n- Performances of FP16 sigmoid may need to be optimized with FP16 exponenetial (e.g., https://github.com/0xBYTESHIFT/fp16/blob/master/include/half/half.hpp)\n\nTrainLib_Deployer:\n- Training does not converge in DNNs generated with TrainLib_Deployer if the last layer is not updated \n- With no single/double buffering, not updating a PW layer in a sparse update results in wrong backward computation\n\n\n# Contributors\n\n## Currently active\n\n- Davide Nadalini (d.nadalini@unibo.it, davide.nadalini@polito.it)\n- Alberto Dequino (alberto.dequino@unibo.it, alberto.dequino@polito.it)\n- Manuele Rusci (manuele.rusci@kuleuven.be)\n- Francesco Conti (f.conti@unibo.it)\n- Cristian Cioflan (cioflanc@iis.ee.ethz.ch)\n- Luca Bompani (luca.bompani5@unibo.it)\n- Lan Mei (lanmei@student.ethz.ch)\n- Calin Diaconu (calin.diaconu@studio.unibo.it)\n\n## Past Contributors\n\n- Giacomo Saporetti (giacomo.saporetti@studio.unibo.it)\n- Francesco Conoscenti (francesco.conoscenti@studio.unibo.it)\n- Leonardo Ravaglia (leonardo.ravaglia2@unibo.it)\n\n\n# References\n\n\u003e D. Nadalini, M. Rusci, L. Benini, and F. Conti, \"Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers\" [ArXiv Pre-Print](https://arxiv.org/abs/2305.19167)\n\u003e \n\u003e D. Nadalini, M. Rusci, G. Tagliavini, L. Ravaglia, L. Benini, and F. Conti, \"PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-Core MCUs through Performance-Driven Autotuning\" [SAMOS Pre-Print Version](https://www.samos-conference.com/Resources_Samos_Websites/Proceedings_Repository_SAMOS/2022/Papers/Paper_14.pdf), [Springer Published Version](https://link.springer.com/chapter/10.1007/978-3-031-15074-6_13)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpulp-platform%2Fpulp-trainlib","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpulp-platform%2Fpulp-trainlib","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpulp-platform%2Fpulp-trainlib/lists"}