{"id":14964551,"url":"https://github.com/vitoplantamura/onnxstream","last_synced_at":"2025-05-14T11:10:02.031Z","repository":{"id":181771175,"uuid":"666197652","full_name":"vitoplantamura/OnnxStream","owner":"vitoplantamura","description":"Lightweight inference library for ONNX files, written in C++. It can run Stable Diffusion XL 1.0 on a RPI Zero 2 (or in 298MB of RAM) but also Mistral 7B on desktops and servers. ARM, x86, WASM, RISC-V supported. Accelerated by XNNPACK.","archived":false,"fork":false,"pushed_at":"2025-03-29T09:51:04.000Z","size":34903,"stargazers_count":1927,"open_issues_count":56,"forks_count":89,"subscribers_count":29,"default_branch":"master","last_synced_at":"2025-04-03T21:51:15.386Z","etag":null,"topics":["llama","machine-learning","mistral","onnx","raspberry-pi","stable-diffusion","tinyml","wasm","webassembly","yolov8"],"latest_commit_sha":null,"homepage":"https://yolo.vitoplantamura.com/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vitoplantamura.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-14T00:21:50.000Z","updated_at":"2025-04-01T11:31:45.000Z","dependencies_parsed_at":"2023-12-14T20:41:31.011Z","dependency_job_id":"1db472fd-0ff1-4eac-ba42-991ff9db7d77","html_url":"https://github.com/vitoplantamura/OnnxStream","commit_stats":{"total_commits":58,"total_committers":4,"mean_commits":14.5,"dds":"0.12068965517241381","last_synced_commit":"bea90d59b59a1f1d2e83b52bacfed6c881e71256"},"previous_names":["vitoplantamura/onnxstream"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vitoplantamura%2FOnnxStream","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vitoplantamura%2FOnnxStream/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vitoplantamura%2FOnnxStream/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vitoplantamura%2FOnnxStream/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vitoplantamura","download_url":"https://codeload.github.com/vitoplantamura/OnnxStream/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248339691,"owners_count":21087302,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llama","machine-learning","mistral","onnx","raspberry-pi","stable-diffusion","tinyml","wasm","webassembly","yolov8"],"created_at":"2024-09-24T13:33:22.234Z","updated_at":"2025-05-14T11:10:02.018Z","avatar_url":"https://github.com/vitoplantamura.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"﻿#### News 📣\n\n- April 15, 2025: Added WebAssembly demo of **OpenAI's Whisper** [here](https://github.com/vitoplantamura/OnnxStream/tree/master/examples/Whisper_wasm) (running in the browser).\n- September 19, 2024: Added WebAssembly support for the library! Demo of the **YOLOv8** object detection model [here](https://yolo.vitoplantamura.com/) (running in the browser).\n- January 14, 2024: Added LLM chat application (**TinyLlama 1.1B and Mistral 7B**) with initial GPU support! More info [here](https://github.com/vitoplantamura/OnnxStream/blob/master/assets/LLM.md).\n- December 14, 2023: Added support for **Stable Diffusion XL Turbo 1.0**! (thanks to @AeroX2)\n- October 3, 2023: Added support for **Stable Diffusion XL 1.0 Base**!\n\n#### Index 👇\n\n- [Introduction](https://github.com/vitoplantamura/OnnxStream#onnxstream)\n- **[Stable Diffusion 1.5](https://github.com/vitoplantamura/OnnxStream#stable-diffusion-15)**\n- **[Stable Diffusion XL 1.0 Base](https://github.com/vitoplantamura/OnnxStream#stable-diffusion-xl-10-base)**\n- **[Stable Diffusion XL Turbo 1.0](https://github.com/vitoplantamura/OnnxStream#stable-diffusion-xl-turbo-10)**\n- **[TinyLlama 1.1B and Mistral 7B](https://github.com/vitoplantamura/OnnxStream/blob/master/assets/LLM.md)**\n- **[YOLOv8](https://yolo.vitoplantamura.com/)** (running in the browser)\n- **[OpenAI's Whisper](https://github.com/vitoplantamura/OnnxStream/tree/master/examples/Whisper_wasm)** (running in the browser)\n- [Features of OnnxStream](https://github.com/vitoplantamura/OnnxStream#features-of-onnxstream)\n- [Performance](https://github.com/vitoplantamura/OnnxStream#performance)\n- [Attention Slicing and Quantization](https://github.com/vitoplantamura/OnnxStream#attention-slicing-and-quantization)\n- [How OnnxStream Works](https://github.com/vitoplantamura/OnnxStream#how-onnxstream-works)\n- **[How to Build](https://github.com/vitoplantamura/OnnxStream#how-to-build-the-stable-diffusion-example-on-linuxmacwindowstermuxfreebsd)** (Linux/Mac/Windows/Termux/FreeBSD)\n- [How to Convert SD 1.5 Model](https://github.com/vitoplantamura/OnnxStream#how-to-convert-and-run-a-custom-stable-diffusion-15-model-with-onnxstream-by-gaelicthunder)\n- [Related Projects](https://github.com/vitoplantamura/OnnxStream#related-projects)\n- [Credits](https://github.com/vitoplantamura/OnnxStream#credits)\n\n# OnnxStream\n\nThe challenge is to run [Stable Diffusion](https://github.com/CompVis/stable-diffusion) 1.5, which includes a large transformer model with almost 1 billion parameters, on a [Raspberry Pi Zero 2](https://www.raspberrypi.com/products/raspberry-pi-zero-2-w/), which is a microcomputer with 512MB of RAM, without adding more swap space and without offloading intermediate results on disk. The recommended minimum RAM/VRAM for Stable Diffusion 1.5 is typically 8GB.\n\nGenerally major machine learning frameworks and libraries are focused on minimizing inference latency and/or maximizing throughput, all of which at the cost of RAM usage. So I decided to write a super small and hackable inference library specifically focused on minimizing memory consumption: OnnxStream.\n\nOnnxStream is based on the idea of decoupling the inference engine from the component responsible of providing the model weights, which is a class derived from `WeightsProvider`. A `WeightsProvider` specialization can implement any type of loading, caching and prefetching of the model parameters. For example a custom `WeightsProvider` can decide to download its data from an HTTP server directly, without loading or writing anything to disk (hence the word \"Stream\" in \"OnnxStream\"). Three default `WeightsProviders` are available: `DiskNoCache`, `DiskPrefetch` and `Ram`.\n\n**OnnxStream can consume even 55x less memory than OnnxRuntime with only a 50% to 200% increase in latency** (on CPU, with a good SSD, with reference to the SD 1.5's UNET - see the Performance section below).\n\n# Stable Diffusion 1.5\n\nThese images were generated by the Stable Diffusion example implementation included in this repo, using OnnxStream, at different precisions of the VAE decoder. The VAE decoder is the only model of Stable Diffusion 1.5 that could not fit into the RAM of the Raspberry Pi Zero 2 in single or half precision. This is caused by the presence of residual connections and very big tensors and convolutions in the model. The only solution was static quantization (8 bit). The third image was generated by my RPI Zero 2 in about ~~3 hours~~ 1.5 hours (using the MAX_SPEED option when compiling). The first image was generated on my PC using the same latents generated by the RPI Zero 2, for comparison:\n\nVAE decoder in W16A16 precision:\n\n![W16A16 VAE Decoder](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/output_W16A16.png)\n\nVAE decoder in W8A32 precision:\n\n![W8A32 VAE Decoder](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/output_W8A32.png)\n\nVAE decoder in W8A8 precision, generated by my RPI Zero 2 in about ~~3 hours~~ 1.5 hours (using the MAX_SPEED option when compiling):\n\n![W8A8 VAE Decoder](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/output_W8A8.png)\n\n# Stable Diffusion XL 1.0 (base)\n\nThe OnnxStream Stable Diffusion example implementation now supports SDXL 1.0 (without the Refiner). The ONNX files were exported from the SDXL 1.0 implementation of the Hugging Face's [Diffusers](https://github.com/huggingface/diffusers) library (version 0.19.3).\n\nSDXL 1.0 is significantly more computationally expensive than SD 1.5. The most significant difference is the ability to generate 1024x1024 images instead of 512x512. To give you an idea, generating a 10-steps image with HF's Diffusers takes 26 minutes on my 12-core PC with 32GB of RAM. The minimum recommended VRAM for SDXL is typically 12GB.\n\n**OnnxStream can run SDXL 1.0 in less than 300MB of RAM and therefore is able to run it comfortably on a RPI Zero 2**, without adding more swap space and without writing anything to disk during inference. Generating a 10-steps image takes about 11 hours on my RPI Zero 2.\n\n#### SDXL Specific Optimizations\n\nThe same set of optimizations for SD 1.5 has been used for SDXL 1.0, but with the following differences.\n\nAs for the UNET model, in order to make it run in less than 300MB of RAM on the RPI Zero 2, UINT8 dynamic quantization is used, but limited to a specific subset of large intermediate tensors.\n\nThe situation for the VAE decoder is more complex than for SD 1.5. SDXL 1.0's VAE decoder is 4x the size of SD 1.5's, and consumes 4.4GB of RAM when run with OnnxStream in FP32 precision.\n\nIn the case of SD 1.5 the VAE decoder is statically quantized (UINT8 precision) and this is enough to reduce RAM consumption to 260MB. Instead, the SDXL 1.0's VAE decoder overflows when run with FP16 arithmetic and the numerical ranges of its activations are too large to get good quality images with UINT8 quantization.\n\nSo we are stuck with a model that consumes 4.4GB of RAM, which cannot be run in FP16 precision and which cannot be quantized in UINT8 precision. (NOTE: there is at least [one solution](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) to the FP16 problem, but I have not investigated further since even running the VAE decoder in FP16 precision, the total memory consumed would be divided by 2, so the model would ultimately consume 2.2GB instead of 4.4GB, which is still way too much for the RPI Zero 2)\n\nThe inspiration for the solution came from the implementation of the VAE decoder of the Hugging Face's [Diffusers](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/autoencoder_kl.py) library, i.e. using tiled decoding. The final result is absolutely indistinguishable from an image decoded by the full decoder and in this way it is possible to reduce RAM memory consumption from 4.4GB to 298MB!\n\nThe idea is simple. The result of the diffusion process is a tensor with shape (1,4,128,128). The idea is to split this tensor into 5x5 (therefore 25) overlapping tensors with shape (1,4,32,32) and to decode these tensors separately. Each of these tensors is overlapped by 25% on the tile to its left and the one above. The decoding result is a tensor with shape (1,3,256,256) which is then appropriately blended into the final image.\n\nFor example, this is an image generated by the tiled decoder with blending manually turned off in the code. **You can clearly see the tiles in the image:**\n\n![SDXL Output with Tiles](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/sdxl_tiles.png)\n\nWhile this is the same image with blending turned on. **This is the final result:**\n\n![SDXL Output without Tiles](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/sdxl_without_tiles.png)\n\nThis is another image generated by my RPI Zero 2 in about 11 hours: (10 steps, Euler Ancestral)\n\n![SDXL Output generated by RPI Zero 2](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/sdxl_out_1.png)\n\n# Stable Diffusion XL Turbo 1.0\n\nSupport for SDXL Turbo was contributed by the kind [@AeroX2](https://github.com/AeroX2).\n\nThe main difference between SDXL and SDXL Turbo is that the Turbo version generates 512x512 images instead of 1024x1024, but with a much lower number of steps. It is possible to get good quality images even with just one step!\n\nNo additional optimizations compared to SDXL 1.0 were required to run SDXL Turbo on the RPI Zero 2. SDXL and SDXL Turbo share the same text encoder and VAE decoder: tiled decoding is required to keep memory consumption under 300MB.\n\nThis image was generated by my Raspberry PI Zero 2 in **29 minutes** (**1 step**):\n\n![sdxlturbo steps 1](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/sdxlturbo_steps1.png)\n\nThis image is an example of **3 step** generation, and took **50 minutes** on my RPI Zero 2. The quality is the same as the 1 step generated image:\n\n![sdxlturbo steps 3](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/sdxlturbo_steps3.png)\n\nA comparison between SDXL 1.0 and SDXL Turbo run in OnnxStream with respect to the number of steps is available in the [model card](https://huggingface.co/AeroX2/stable-diffusion-xl-turbo-1.0-onnxstream) of the model (by @AeroX2).\n\n# Features of OnnxStream\n\n- Inference engine decoupled from the `WeightsProvider`\n- `WeightsProvider` can be `DiskNoCache`, `DiskPrefetch`, `Ram` or custom\n- Attention slicing\n- Dynamic quantization (8 bit unsigned, asymmetric, percentile)\n- Static quantization (W8A8 unsigned, asymmetric, percentile)\n- Easy calibration of a quantized model\n- FP16 support (with or without FP16 arithmetic)\n- 40 ONNX operators implemented (the most common)\n- Operations executed sequentially but ~~all~~ most operators are multithreaded\n- Single implementation file + header file\n- XNNPACK calls wrapped in the `XnnPack` class (for future replacement)\n- Initial GPU support with cuBLAS (only FP16 and FP32 and only for the [LLM app](https://github.com/vitoplantamura/OnnxStream/blob/master/assets/LLM.md))\n- WebAssembly builds available (with multithreading and SIMD support; Demo [here](https://yolo.vitoplantamura.com/))\n\nOnnxStream depends on [XNNPACK](https://github.com/google/XNNPACK) for some (accelerated) primitives: MatMul, Convolution, element-wise Add/Sub/Mul/Div, Sigmoid, Softmax, MaxPool and Transpose.\n\n# Performance\n\nStable Diffusion 1.5 consists of three models: **a text encoder** (672 operations and 123 million parameters), the **UNET model** (2050 operations and 854 million parameters) and the **VAE decoder** (276 operations and 49 million parameters). Assuming that the batch size is equal to 1, a full image generation with 10 steps, which yields good results (with the Euler Ancestral scheduler), requires 2 runs of the text encoder, 20 (i.e. 2*10) runs of the UNET model and 1 run of the VAE decoder.\n\nThis table shows the various inference times of the three models of Stable Diffusion 1.5, together with the memory consumption (i.e. the `Peak Working Set Size` in Windows or the `Maximum Resident Set Size` in Linux).\n\n| Model / Library             |       1st run        |       2nd run        |       3rd run        |\n| --------------------------- | :------------------: | :------------------: | :------------------: |\n| FP16 UNET / OnnxStream      | 0.133 GB - 18.2 secs | 0.133 GB - 18.7 secs | 0.133 GB - 19.8 secs |\n| FP16 UNET / OnnxRuntime     | 5.085 GB - 12.8 secs | 7.353 GB - 7.28 secs | 7.353 GB - 7.96 secs |\n| FP32 Text Enc / OnnxStream  | 0.147 GB - 1.26 secs | 0.147 GB - 1.19 secs | 0.147 GB - 1.19 secs |\n| FP32 Text Enc / OnnxRuntime | 0.641 GB - 1.02 secs | 0.641 GB - 0.06 secs | 0.641 GB - 0.07 secs |\n| FP32 VAE Dec / OnnxStream   | 1.004 GB - 20.9 secs | 1.004 GB - 20.6 secs | 1.004 GB - 21.2 secs |\n| FP32 VAE Dec / OnnxRuntime  | 1.330 GB - 11.2 secs | 2.026 GB - 10.1 secs | 2.026 GB - 11.1 secs |\n\nIn the case of the UNET model (when run in FP16 precision, with FP16 arithmetic enabled in OnnxStream), OnnxStream can consume even 55x less memory than OnnxRuntime with a 50% to 200% increase in latency.\n\nNotes:\n\n* The first run for OnnxRuntime is a warm up inference, since its `InferenceSession` is created before the first run and reused for all the subsequent runs. No such thing as a warm up exists for OnnxStream since it is purely eager by design (however subsequent runs can benefit from the caching of the weights files by the OS).\n* At the moment OnnxStream doesn't support inputs with a batch size != 1, unlike OnnxRuntime, which can greatly speed up the whole diffusion process using a batch size = 2 when running the UNET model.\n* In my tests, changing OnnxRuntime's `SessionOptions` (like `EnableCpuMemArena` and `ExecutionMode`) produces no significant difference in the results.\n* Performance of OnnxRuntime is very similar to that of NCNN (the other framework I evaluated), both in terms of memory consumption and inference time. I'll include NCNN benchmarks in the future, if useful.\n* Tests were run on my development machine: Windows Server 2019, 16GB RAM, 8750H cpu (AVX2), 970 EVO Plus SSD, 8 virtual cores on VMWare.\n\n# Attention Slicing and Quantization\n\nThe use of \"attention slicing\" when running the UNET model and the use of W8A8 quantization for the VAE decoder were crucial in reducing memory consumption to a level that allowed execution on a RPI Zero 2.\n\nWhile there is a lot of information on the internet about quantizing neural networks, little can be found about \"attention slicing\". The idea is simple: the goal is to avoid materializing the full `Q @ K^T` matrix when calculating the scaled dot-product attention of the various multi-head attentions in the UNET model. With an attention head count of 8 in the UNET model, `Q` has a shape of (8,4096,40), while `K^T` has a shape of (8,40,4096): so the result of the first MatMul has a final shape of (8,4096,4096), which is a 512MB tensor (in FP32 precision):\n\n![Attention Slicing](https://raw.githubusercontent.com/vitoplantamura/OnnxStream/master/assets/attention_mem_consumpt.png)\n\nThe solution is to split `Q` vertically and then to proceed with the attention operations normally on each chunk of `Q`. `Q_sliced` has a shape of (1,x,40), where x is 4096 (in this case) divided by `onnxstream::Model::m_attention_fused_ops_parts` (which has a default value of 2, but can be customized). This simple trick allows to lower the overall consumed memory of the UNET model from 1.1GB to 300MB (when the model is run in FP32 precision). A possible alternative, certainly more efficient, would be to use FlashAttention, however FlashAttention would require writing a custom kernel for each supported architecture (AVX, NEON etc), bypassing XnnPack in our case.\n\n# How OnnxStream works\n\nThis code can run a model defined in the `path_to_model_folder/model.txt`: (all the model operations are defined in the `model.txt` text file; OnnxStream expects to find all the weights files in that same folder, as a series of `.bin` files)\n\n``` cpp\n#include \"onnxstream.h\"\n\nusing namespace onnxstream;\n\nint main()\n{\n    Model model;\n\n    //\n    // Optional parameters that can be set on the Model object:\n    //\n    // model.set_weights_provider( ... ); // specifies a different weights provider (default is DiskPrefetchWeightsProvider)\n    // model.read_range_data( ... ); // reads a range data file (which contains the clipping ranges of the activations for a quantized model)\n    // model.write_range_data( ... ); // writes a range data file (useful after calibration)\n    // model.m_range_data_calibrate = true; // calibrates the model\n    // model.m_use_fp16_arithmetic = true; // uses FP16 arithmetic during inference (useful if weights are in FP16 precision)\n    // model.m_use_uint8_arithmetic = true; // uses UINT8 arithmetic during inference\n    // model.m_use_uint8_qdq = true; // uses UINT8 dynamic quantization (can reduce memory consumption of some models)\n    // model.m_fuse_ops_in_attention = true; // enables attention slicing\n    // model.m_attention_fused_ops_parts = ... ; // see the \"Attention Slicing\" section above\n    //\n\n    model.read_file(\"path_to_model_folder/model.txt\");\n\n    tensor_vector\u003cfloat\u003e data;\n    \n    ... // fill the tensor_vector with the tensor data. \"tensor_vector\" is just an alias to a std::vector with a custom allocator.\n\n    Tensor t;\n    t.m_name = \"input\";\n    t.m_shape = { 1, 4, 64, 64 };\n    t.set_vector(std::move(data));\n    model.push_tensor(std::move(t));\n\n    model.run();\n    \n    auto\u0026 result = model.m_data[0].get_vector\u003cfloat\u003e();\n    \n    ... // process the result: \"result\" is a reference to the first result of the inference (a tensor_vector\u003cfloat\u003e as well).\n\n    return 0;\n}\n```\n\nThe `model.txt` file contains all the model operations in ASCII format, as exported from the original ONNX file. Each line corresponds to an operation: for example this line represents a convolution in a quantized model:\n\n```\nConv_4:Conv*input:input_2E_1(1,4,64,64);post_5F_quant_5F_conv_2E_weight_nchw.bin(uint8[0.0035054587850383684,134]:4,4,1,1);post_5F_quant_5F_conv_2E_bias.bin(float32:4)*output:input(1,4,64,64)*dilations:1,1;group:1;kernel_shape:1,1;pads:0,0,0,0;strides:1,1\n```\n\nIn order to export the `model.txt` file and its weights (as a series of `.bin` files) from an ONNX file for use in OnnxStream, a notebook (with a single cell) is provided (`onnx2txt.ipynb`).\n\nSome things must be considered when exporting a Pytorch `nn.Module` (in our case) to ONNX for use in OnnxStream:\n\n1. When calling `torch.onnx.export`, `dynamic_axes` should be left empty, since OnnxStream doesn't support inputs with a dynamic shape.\n2. It is strongly recommended to run the excellent [ONNX Simplifier](https://github.com/daquexian/onnx-simplifier) on the exported ONNX file before its conversion to a `model.txt` file.\n\n# How to Build the Stable Diffusion example on Linux/Mac/Windows/Termux/FreeBSD\n\n- **Windows only**: start the following command prompt: `Visual Studio Tools` \u003e `x64 Native Tools Command Prompt`.\n- **Mac only**: make sure to install cmake: `brew install cmake`.\n\n\u003cdetails\u003e\n\u003csummary\u003e--\u003e FreeBSD only \u003c--\u003c/summary\u003e\n\n**ADVANCED!**\n\nXNNPACK does not support building on FreeBSD at the time of writing. However it is possible to build it on FreeBSD with small changes to its CMake files.\n\nThe incompatibility concerns these two main points:\n\n1) The two variables **CMAKE_SYSTEM_NAME** and **CMAKE_SYSTEM_PROCESSOR**.\n\n2) The [cpuinfo](https://github.com/pytorch/cpuinfo) dependency: FreeBSD support in this project was added recently, so we need to instruct XNNPACK to download a newer version.\n\nFor example these are the changes to successfully build commit **1c8ee1b68f3a3e0847ec3c53c186c5909fa3fbd3** of XNNPACK on FreeBSD:\n\n```patch\ndiff --git a/CMakeLists.txt b/CMakeLists.txt\nindex d33268bd9..4efd58b86 100644\n--- a/CMakeLists.txt\n+++ b/CMakeLists.txt\n@@ -88,7 +88,7 @@ ELSEIF(CMAKE_GENERATOR MATCHES \"^Visual Studio \" AND CMAKE_GENERATOR_PLATFORM)\n   ENDIF()\n ELSEIF(CMAKE_SYSTEM_PROCESSOR MATCHES \"^i[3-7]86$\")\n   SET(XNNPACK_TARGET_PROCESSOR \"x86\")\n-ELSEIF(CMAKE_SYSTEM_PROCESSOR STREQUAL \"AMD64\")\n+ELSEIF(CMAKE_SYSTEM_PROCESSOR STREQUAL \"AMD64\" OR CMAKE_SYSTEM_PROCESSOR STREQUAL \"amd64\")\n   SET(XNNPACK_TARGET_PROCESSOR \"x86_64\")\n ELSEIF(CMAKE_SYSTEM_PROCESSOR MATCHES \"^armv[5-8]\")\n   SET(XNNPACK_TARGET_PROCESSOR \"arm\")\n@@ -249,7 +249,7 @@ ENDIF()\n # ---[ Build flags\n IF(NOT CMAKE_SYSTEM_NAME)\n   MESSAGE(FATAL_ERROR \"CMAKE_SYSTEM_NAME not defined\")\n-ELSEIF(NOT CMAKE_SYSTEM_NAME MATCHES \"^(Android|Darwin|iOS|Linux|Windows|CYGWIN|MSYS|QURT)$\")\n+ELSEIF(NOT CMAKE_SYSTEM_NAME MATCHES \"^(Android|Darwin|iOS|Linux|Windows|CYGWIN|MSYS|QURT|FreeBSD)$\")\n   MESSAGE(FATAL_ERROR \"Unrecognized CMAKE_SYSTEM_NAME value \\\"${CMAKE_SYSTEM_NAME}\\\"\")\n ENDIF()\n IF(CMAKE_SYSTEM_NAME MATCHES \"Windows\")\ndiff --git a/cmake/DownloadCpuinfo.cmake b/cmake/DownloadCpuinfo.cmake\nindex 01e4b9806..4dfff8f6f 100644\n--- a/cmake/DownloadCpuinfo.cmake\n+++ b/cmake/DownloadCpuinfo.cmake\n@@ -17,8 +17,8 @@ ENDIF()\n \n INCLUDE(ExternalProject)\n ExternalProject_Add(cpuinfo\n-  URL https://github.com/pytorch/cpuinfo/archive/3c8b1533ac03dd6531ab6e7b9245d488f13a82a5.zip\n-  URL_HASH SHA256=5d7f00693e97bd7525753de94be63f99b0490ae6855df168f5a6b2cfc452e49e\n+  URL https://github.com/pytorch/cpuinfo/archive/cebb0933058d7f181c979afd50601dc311e1bf8c.zip\n+  URL_HASH SHA256=52e0ffd7998d8cb3a927d8a6e1145763744d866d2be09c4eccea27fc157b6bb0\n   SOURCE_DIR \"${CMAKE_BINARY_DIR}/cpuinfo-source\"\n   BINARY_DIR \"${CMAKE_BINARY_DIR}/cpuinfo\"\n   CONFIGURE_COMMAND \"\"\n```\n\u003c/details\u003e\n\nFirst you need to build [XNNPACK](https://github.com/google/XNNPACK).\n\nSince the function prototypes of XnnPack can change at any time, I've included a `git checkout` ​​that ensures correct compilation of OnnxStream with a compatible version of XnnPack at the time of writing:\n\n```\ngit clone https://github.com/google/XNNPACK.git\ncd XNNPACK\ngit checkout 1c8ee1b68f3a3e0847ec3c53c186c5909fa3fbd3\nmkdir build\ncd build\ncmake -DXNNPACK_BUILD_TESTS=OFF -DXNNPACK_BUILD_BENCHMARKS=OFF ..\ncmake --build . --config Release\n```\n\nThen you can build the Stable Diffusion example.\n\n`\u003cDIRECTORY_WHERE_XNNPACK_WAS_CLONED\u003e` is for example `/home/vito/Desktop/XNNPACK` or `C:\\Projects\\SD\\XNNPACK` (on Windows):\n\n```\ngit clone https://github.com/vitoplantamura/OnnxStream.git\ncd OnnxStream\ncd src\nmkdir build\ncd build\ncmake -DMAX_SPEED=ON -DOS_LLM=OFF -DOS_CUDA=OFF -DXNNPACK_DIR=\u003cDIRECTORY_WHERE_XNNPACK_WAS_CLONED\u003e ..\ncmake --build . --config Release\n```\n\n**Important:** the MAX_SPEED option allows to increase performance by about 10% in Windows, but by more than 50% on the Raspberry Pi. This option consumes much more memory at build time and the produced executable may not work (as was the case with Termux in my tests). So in case of problems, the first attempt to make is to set MAX_SPEED to OFF.\n\nNow you can run the Stable Diffusion example.\n\n\u003cdetails\u003e\n\u003csummary\u003eThe most recent version of the application downloads the weights of the selected model automatically at the first run. Click here for how to download the weights manually.\u003c/summary\u003e\n\nIn the case of **Stable Diffusion 1.5**, the weights can be downloaded here (about 2GB).\n\n```\ngit lfs install\ngit clone --depth=1 https://huggingface.co/vitoplantamura/stable-diffusion-1.5-onnxstream\n```\n\nIn the case of **Stable Diffusion XL 1.0 Base**, the weights can be downloaded here (about 8GB):\n\n```\ngit lfs install\ngit clone --depth=1 https://huggingface.co/vitoplantamura/stable-diffusion-xl-base-1.0-onnxstream\n```\n\nIn the case of **Stable Diffusion XL Turbo 1.0**, the weights can be downloaded here (about 8GB):\n\n```\ngit lfs install\ngit clone --depth=1 https://huggingface.co/vitoplantamura/stable-diffusion-xl-turbo-1.0-anyshape-onnxstream\n```\n\n\u003c/details\u003e\n\nThese are the command line options of the Stable Diffusion example:\n\n```\n--xl                Runs Stable Diffusion XL 1.0 instead of Stable Diffusion 1.5.\n--turbo             Runs Stable Diffusion Turbo 1.0 instead of Stable Diffusion 1.5.\n--models-path       Sets the folder containing the Stable Diffusion models.\n--ops-printf        During inference, writes the current operation to stdout.\n--output            Sets the output PNG file.\n--decode-latents    Skips the diffusion, and decodes the specified latents file.\n--prompt            Sets the positive prompt.\n--neg-prompt        Sets the negative prompt.\n--steps             Sets the number of diffusion steps.\n--seed              Sets the seed.\n--save-latents      After the diffusion, saves the latents in the specified file.\n--decoder-calibrate (ONLY SD 1.5) Calibrates the quantized version of the VAE decoder.\n--not-tiled         (ONLY SDXL 1.0 and TURBO) Don't use the tiled VAE decoder.\n--res               (ONLY TURBO) Sets the output PNG file resolution. Default is \"512x512\".\n--ram               Uses the RAM WeightsProvider (Experimental).\n--download          A[uto] / F[orce] / N[ever] (re)download current model.\n--curl-parallel     Sets the number of parallel downloads with CURL. Default is 16.\n--rpi               A[uto] / F[orce] / N[ot] configure the models to run on a Raspberry Pi.\n--rpi-lowmem        Configures the models to run on a Raspberry Pi Zero 2.\n--threads           Sets the number of threads, negative values use (cores - N) threads.\n--preview-steps     Save every diffusion step in low resolution.\n--preview-steps-x8  Magnify previews to full resolution.\n--decode-steps      Decode and save every diffusion step in full resolution.\n--embed-parameters  Store parameters of generation (e. g. model path) in image comments.\n```\n\nOptions you're probably interested in: `--xl`, `--turbo`, `--prompt`, `--steps`, `--rpi`.\n\n# How to Convert and Run a Custom Stable Diffusion 1.5 Model with OnnxStream (by @GaelicThunder)\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand\u003c/summary\u003e\n\nThis guide aims to assist you in converting a custom Stable Diffusion model for use with OnnxStream. Whether you're starting from `.safetensors` or `.onnx`, this guide has you covered.\n\n### Prerequisites\n\n- Python 3.x\n- ONNX\n- ONNX Simplifier\n- Linux environment (tested on Ubuntu, Windows WSL also works)\n- Swap space (amount varies depending on your approach)\n\n### Why Specific Steps?\n\n#### Understanding Einsum and Other Operations\n\nAUTO1111's Stable Diffusion implementation uses operations like Einsum, which are not supported by OnnxStream (yet). Hence, it's advised to use the Hugging Face implementation, which is more compatible.\n\n### Optional: Converting .safetensors to ONNX\n\nIf you're starting with a `.safetensors` file, you can convert it to `.onnx` using the tool available at [this GitHub repository](https://github.com/AUTOMATIC1111/stable-diffusion-webui-tensorrt).\n\nHowever, it is recommended to follow the approach in section \"Option A\" below.\n\n### Exporting Your Model\n\n#### Option A: Exporting from Hugging Face (Recommended)\n\n```python\nfrom diffusers import StableDiffusionPipeline\nimport torch\n\npipe = StableDiffusionPipeline.from_single_file(\"https://huggingface.co/YourUsername/YourModel/blob/main/Model.safetensors\")\n\ndummy_input = (torch.randn(1, 4, 64, 64), torch.randn(1), torch.randn(1, 77, 768))\ninput_names = [\"sample\", \"timestep\", \"encoder_hidden_states\"]\noutput_names = [\"out_sample\"]\n\ntorch.onnx.export(pipe.unet, dummy_input, \"/path/to/save/unet_temp.onnx\", verbose=False, input_names=input_names, output_names=output_names, opset_version=14, do_constant_folding=True, export_params=True)\n```\n\n#### Option B: Manually Fixing Input Shapes\n\n```bash\npython -m onnxruntime.tools.make_dynamic_shape_fixed --input_name sample --input_shape 1,4,64,64 model.onnx model_fixed1.onnx\npython -m onnxruntime.tools.make_dynamic_shape_fixed --input_name timestep --input_shape 1 model_fixed1.onnx model_fixed2.onnx\npython -m onnxruntime.tools.make_dynamic_shape_fixed --input_name encoder_hidden_states --input_shape 1,77,768 model_fixed2.onnx model_fixed3.onnx\n```\n\nNote by Vito: This can be achieved simply by following the approach outlined in \"Option A\" above, which remains the recommended approach. Making the input shapes fixed might be useful if your starting point is already an ONNX file.\n\n### Running ONNX Simplifier\n\n```bash\npython -m onnx_simplifier model_fixed3.onnx model_simplified.onnx\n```\n\nWith big models you might run into problems with the simplifier. This tool can sometimes help in that case: https://github.com/luchangli03/onnxsim_large_model\n\n**Note**: \n- If you exported your model from Hugging Face, you'll need around 100GB of swap space. \n- If you manually fixed the input shapes, 16GB of RAM should suffice.\n- The process may take some time; please be patient.\n\n### Final Steps and Running the Model\n\nOnce you have the final model from `onnx2txt`, move it into the `unet_fp16` folder of the standard SD 1.5 model, which can be found in the Windows release of OnnxStream.\n\nThe command to run the model might look like this:\n\n```bash\n./sd --models-path ./Converted/ --prompt \"space landscape\" --steps 28 --rpi\n```\n\n### Note on the \"Shape\" Operator\n\nIf you see the \"Shape\" operator in the output of Onnx Simplifier or in `onnx2txt.ipynb`, it indicates that Onnx Simplifier may not be functioning as expected. This issue is often not caused by Onnx Simplifier itself but rather by Onnx's Shape Inference.\n\n#### Alternative Solution\n\nIn such cases, you have the alternative to re-export the model by modifying the parameters of `torch.onnx.export`. Locate this file on your computer:\n\n[export_onnx.py from GitHub](https://github.com/AUTOMATIC1111/stable-diffusion-webui-tensorrt/blob/master/export_onnx.py)\n\nAnd make sure to:\n- Set `opset_version` to 14\n- Remove `dynamic_axes`\n\nAfter making these changes, you can rerun Onnx Simplifier and `onnx2txt`.\n\nNote by Vito: This solution, although working, generates ONNX files with Einsum operations. When OnnxStream supports the Einsum operator, this solution will become the recommended one.\n\n### Conclusion\n\nThis guide is designed to be a comprehensive resource for those looking to run a custom Stable Diffusion 1.5 model with OnnxStream. Additional contributions are welcome!\n\n\u003c/details\u003e\n\n# Related Projects\n\n- [OnnxStreamGui](https://github.com/ThomAce/OnnxStreamGui) by @ThomAce: a web and desktop user interface for OnnxStream.\n- [Auto epaper art](https://github.com/rvdveen/epaper-slow-generative-art) by @rvdveen: a self-contained image generation picture frame showing news.\n- [PaperPiAI](https://github.com/dylski/PaperPiAI) by @dylski: Raspberry Pi Zero powered AI-generated e-ink picture frame.\n\n# Credits\n\n- The Stable Diffusion 1.5 implementation in `sd.cpp` is based on [this project](https://github.com/fengwang/Stable-Diffusion-NCNN), which in turn is based on [this project](https://github.com/EdVince/Stable-Diffusion-NCNN) by @EdVince. The original code was modified in order to use OnnxStream instead of NCNN.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvitoplantamura%2Fonnxstream","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvitoplantamura%2Fonnxstream","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvitoplantamura%2Fonnxstream/lists"}