https://github.com/intel/intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
https://github.com/intel/intel-extension-for-transformers
4-bits autoround chatbot chatpdf gaudi3 habana intel-optimized-llamacpp large-language-model llm-cpu llm-inference neural-chat neural-chat-7b rag retrieval speculative-decoding streamingllm
Last synced: 4 months ago
JSON representation
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Host: GitHub
URL: https://github.com/intel/intel-extension-for-transformers
Owner: intel
License: apache-2.0
Archived: true
Created: 2022-11-11T05:32:27.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-10-08T21:09:46.000Z (8 months ago)
Last Synced: 2024-10-29T15:34:42.493Z (8 months ago)
Topics: 4-bits, autoround, chatbot, chatpdf, gaudi3, habana, intel-optimized-llamacpp, large-language-model, llm-cpu, llm-inference, neural-chat, neural-chat-7b, rag, retrieval, speculative-decoding, streamingllm
Language: Python
Homepage:
Size: 585 MB
Stars: 2,133
Watchers: 28
Forks: 211
Open Issues: 56
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: docs/code_of_conduct.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project

awesome-oneapi - intel-extension-for-transformers - Intel Extension for Transformers is a toolkit designed to efficiently accelerate transformer-based models on Intel platforms, optimized for 4th gen Intel Xeon Scalable Processor (codename Sapphire Rapids). (Table of Contents / AI - Frameworks and Toolkits)
StarryDivineSky - intel/intel-extension-for-transformers
awesome-production-machine-learning - Intel® Extension for Transformers - extension-for-transformers.svg?style=social) - An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere. (Deployment and Serving)
README

        


  

Intel® Extension for Transformers

===========================

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere


[![](https://dcbadge.vercel.app/api/server/Wxk3J3ZJkU?compact=true&style=flat-square)](https://discord.gg/Wxk3J3ZJkU)

[![Release Notes](https://img.shields.io/github/v/release/intel/intel-extension-for-transformers)](https://github.com/intel/intel-extension-for-transformers/releases)

[🏭Architecture](./docs/architecture.md)   |   [💬NeuralChat](./intel_extension_for_transformers/neural_chat)   |   [😃Inference on CPU](https://github.com/intel/neural-speed/tree/main)   |   [😃Inference  on GPU](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu)   |   [💻Examples](./docs/examples.md)   |   [📖Documentations](https://intel.github.io/intel-extension-for-transformers/latest/docs/Welcome.html)



## 🚀Latest News

* [2024/06] Support Qwen2, please find the details in [Blog](https://medium.com/intel-analytics-software/accelerating-qwen2-models-with-intel-extension-for-transformers-99403de82f68)

* [2024/04] Support the launch of **[Meta Llama 3](https://llama.meta.com/llama3/)**, the next generation of Llama models. Check out [Accelerate Meta* Llama 3 with Intel AI Solutions](https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-meta-llama3-with-intel-ai-solutions.html).

* [2024/04] Demonstrated the chatbot in 4th, 5th, and 6th Gen Xeon Scalable Processors in [**Intel Vision Pat's Keynote**](https://youtu.be/QB7FoIpx8os?t=2280).

* [2024/04] Supported **INT4 inference on Intel Meteor Lake**.

* [2024/04] Achieved a 1.8x performance improvement in GPT-J inference on the 5th Gen Xeon MLPerf v4.0 submission compared to v3.1. [News](https://www.intel.com/content/www/us/en/newsroom/news/new-gaudi-2-xeon-performance-ai-inference.html#gs.71ti1m), [Results](https://mlcommons.org/2024/03/mlperf-inference-v4/).

* [2024/01] Supported **INT4 inference on Intel GPUs** including Intel Data Center GPU Max Series (e.g., PVC) and Intel Arc A-Series (e.g., ARC). Check out the [examples](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu) and [scripts](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py).

* [2024/01] Demonstrated **Intel Hybrid Copilot** in **CES 2024 Great Minds** Session "[Bringing the Limitless Potential of AI Everywhere](https://youtu.be/70J3uO3eLZA?t=1348)".

* [2023/12] Supported **QLoRA on CPUs** to make fine-tuning on client CPU possible. Check out the [blog](https://medium.com/@NeuralCompressor/creating-your-own-llms-on-your-laptop-a08cc4f7c91b) and [readme](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/qloracpu.md) for more details.

* [2023/11] Released **top-1 7B-sized LLM** [**NeuralChat-v3-1**](https://huggingface.co/Intel/neural-chat-7b-v3-1) and [DPO dataset](https://huggingface.co/datasets/Intel/orca_dpo_pairs). Check out the [nice video](https://www.youtube.com/watch?v=bWhZ1u_1rlc) published by [WorldofAI](https://www.youtube.com/@intheworldofai).

* [2023/11] Published a **4-bit chatbot demo** (based on NeuralChat) available on [Intel Hugging Face Space](https://huggingface.co/spaces/Intel/NeuralChat-ICX-INT4). Welcome to have a try! To setup the demo locally, please follow the [instructions](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/notebooks/setup_text_chatbot_service_on_spr.ipynb).

---



## 🏃Installation

### Quick Install from Pypi

```bash

pip install intel-extension-for-transformers

```

> For system requirements and other installation tips, please refer to [Installation Guide](./docs/installation.md)

## 🌟Introduction

Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

*  Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor)

*  Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper [Fast Distilbert on CPUs](https://arxiv.org/abs/2211.07715) and [QuaLA-MiniLM: a Quantized Length Adaptive MiniLM](https://arxiv.org/abs/2210.17114), and NeurIPS 2021's paper [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754))

*  Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa) 

*  [NeuralChat](intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of [plugins](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/advanced_features.md) such as [Knowledge Retrieval](./intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md), [Speech Interaction](./intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md), [Query Caching](./intel_extension_for_transformers/neural_chat/pipeline/plugins/caching/README.md), and [Security Guardrail](./intel_extension_for_transformers/neural_chat/pipeline/plugins/security/README.md). This framework supports Intel Gaudi2/CPU/GPU.

*  [Inference](https://github.com/intel/neural-speed/tree/main) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting [GPT-NEOX](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox), [LLAMA](https://github.com/intel/neural-speed/tree/main/neural_speed/models/llama), [MPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/mpt), [FALCON](https://github.com/intel/neural-speed/tree/main/neural_speed/models/falcon), [BLOOM-7B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/bloom), [OPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/opt), [ChatGLM2-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/chatglm), [GPT-J-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptj), and [Dolly-v2-3B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox). Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html).

## 🔓Validated Hardware

	

		

			Hardware

			Fine-Tuning

			Inference

		

		

			Full

			PEFT

			8-bit

			4-bit

		

		

			Intel Gaudi2

			✔

			✔

			WIP (FP8)

			-

		

		

			Intel Xeon Scalable Processors

			✔

			✔

			✔ (INT8, FP8)

			✔ (INT4, FP4, NF4)

		

		

			Intel Xeon CPU Max Series

			✔

			✔

			✔ (INT8, FP8)

			✔ (INT4, FP4, NF4)

		

		

			Intel Data Center GPU Max Series

			WIP 

			WIP 

			WIP (INT8)

			✔ (INT4)

		

		

			Intel Arc A-Series

			-

			-

			WIP (INT8)

			✔ (INT4)

		

		

			Intel Core Processors

			-

			✔

			✔ (INT8, FP8)

			✔ (INT4, FP4, NF4)

		

	

> In the table above, "-" means not applicable or not started yet.

## 🔓Validated Software

	

		

			Software

			Fine-Tuning

			Inference

		

		

			Full

			PEFT

			8-bit

			4-bit

		

		

			PyTorch

			2.0.1+cpu, 2.0.1a0 (gpu)

			2.0.1+cpu, 2.0.1a0 (gpu)

			2.1.0+cpu, 2.0.1a0 (gpu)

			2.1.0+cpu, 2.0.1a0 (gpu)

		

		

			Intel® Extension for PyTorch

			2.1.0+cpu, 2.0.110+xpu

			2.1.0+cpu, 2.0.110+xpu

			2.1.0+cpu, 2.0.110+xpu

			2.1.0+cpu, 2.0.110+xpu

		

		

			Transformers

			4.35.2(CPU), 4.31.0 (Intel GPU)

			4.35.2(CPU), 4.31.0 (Intel GPU)

			4.35.2(CPU), 4.31.0 (Intel GPU)

			4.35.2(CPU), 4.31.0 (Intel GPU)

		

		

			Synapse AI

			1.13.0

			1.13.0

			1.13.0

			1.13.0

		

		

			Gaudi2 driver

			1.13.0-ee32e42

			1.13.0-ee32e42

			1.13.0-ee32e42

			1.13.0-ee32e42

		

                

                        intel-level-zero-gpu

                        1.3.26918.50-736~22.04 

                        1.3.26918.50-736~22.04 

                        1.3.26918.50-736~22.04 

                        1.3.26918.50-736~22.04 

                

	

> Please refer to the detailed requirements in [CPU](intel_extension_for_transformers/neural_chat/requirements_cpu.txt), [Gaudi2](intel_extension_for_transformers/neural_chat/requirements_hpu.txt), [Intel GPU](intel_extension_for_transformers/neural_chat/requirements_xpu.txt).

## 🔓Validated OS

Ubuntu 20.04/22.04, Centos 8.

## 🌱Getting Started

### Chatbot

Below is the sample code to create your chatbot. See more [examples](intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).

#### Serving (OpenAI-compatible RESTful APIs)

NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs.

You can start NeuralChat server either using the Shell command or Python code.

```shell

# Shell Command

neuralchat_server start --config_file ./server/config/neuralchat.yaml

```

```python

# Python Code

from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor

server_executor = NeuralChatServerExecutor()

server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")

```

NeuralChat service can be accessible through [OpenAI client library](https://github.com/openai/openai-python), `curl` commands, and `requests` library. See more in [NeuralChat](intel_extension_for_transformers/neural_chat/README.md).

#### Offline

```python

from intel_extension_for_transformers.neural_chat import build_chatbot

chatbot = build_chatbot()

response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

```

### Transformers-based extension APIs

Below is the sample code to use the extended Transformers APIs. See more [examples](https://github.com/intel/neural-speed/tree/main).

#### INT4 Inference (CPU)

We encourage you to install [NeuralSpeed](https://github.com/intel/neural-speed) to get the latest features (e.g., GGUF support) of LLM low-bit inference on CPUs. You may also want to use v1.3 without NeuralSpeed by following the [document](https://github.com/intel/intel-extension-for-transformers/tree/v1.3/intel_extension_for_transformers/llm/runtime/graph/README.md)

```python

from transformers import AutoTokenizer

from intel_extension_for_transformers.transformers import AutoModelForCausalLM

model_name = "Intel/neural-chat-7b-v3-1"     

prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

outputs = model.generate(inputs)

```

You can also load GGUF format model from Huggingface, we only support Q4_0/Q5_0/Q8_0 gguf format for now.

```python

from transformers import AutoTokenizer

from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface

model_name = "TheBloke/Llama-2-7B-Chat-GGUF"

# Download the the specific gguf model file from the above repo

gguf_file = "llama-2-7b-chat.Q4_0.gguf"

# make sure you are granted to access this model on the Huggingface.

tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)

inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)

outputs = model.generate(inputs)

```

You can also load PyTorch Model from Modelscope

>**Note**:require modelscope

```python

from transformers import TextStreamer

from modelscope import AutoTokenizer

from intel_extension_for_transformers.transformers import AutoModelForCausalLM

model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model

prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

inputs = tokenizer(prompt, return_tensors="pt").input_ids

streamer = TextStreamer(tokenizer)

outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

```

You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.

```python

from transformers import AutoTokenizer

from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig

# Hugging Face GPTQ/AWQ model or use local quantize model

model_name = "MODEL_NAME_OR_PATH"

prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

outputs = model.generate(inputs)

```

#### INT4 Inference (GPU)

```python

import intel_extension_for_pytorch as ipex

from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM

from transformers import AutoTokenizer

import torch

device_map = "xpu"

model_name ="Qwen/Qwen-7B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

prompt = "Once upon a time, there existed a little girl,"

inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,

                                              device_map=device_map, load_in_4bit=True)

model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, quantization_config=True, device=device_map)

output = model.generate(inputs)

```

> Note: Please refer to the [example](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu) and [script](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py) for more details.

### Langchain-based extension APIs

Below is the sample code to use the extended Langchain APIs. See more [examples](intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md).

```python

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

from langchain.chains import RetrievalQA

from langchain_core.vectorstores import VectorStoreRetriever

from intel_extension_for_transformers.langchain.vectorstores import Chroma

retriever = VectorStoreRetriever(vectorstore=Chroma(...))

retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)

```

## 🎯Validated  Models

You can access the validated models, accuracy and performance from [Release data](./docs/release_data.md) or [Medium blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176).

## 📖Documentation

  

    OVERVIEW

  

  

    NeuralChat

    Neural Speed

  

  

    NEURALCHAT

  

  

    Chatbot on Intel CPU

    Chatbot on Intel GPU

    Chatbot on Gaudi

  

  

    Chatbot on Client

    More Notebooks

  

  

    NEURAL SPEED

  

 

    Neural Speed

    Streaming LLM

    Low Precision Kernels

    Tensor Parallelism

  

  

    LLM COMPRESSION

  

  

    SmoothQuant (INT8)

    Weight-only Quantization (INT4/FP4/NF4/INT8)

    QLoRA on CPU

  

  

    GENERAL COMPRESSION

  

  

    Quantization

    Pruning

    Distillation

    Orchestration

  

  

    Data Augmentation

    Export

    Metrics

    Objectives

  

  

    Pipeline

    Length Adaptive

    Early Exit

  

  

    TUTORIALS & RESULTS

  

  

    Tutorials

    LLM List

    General Model List

    Model Performance

  

## 🙌Demo

* LLM Infinite Inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

* LLM QLoRA on Client CPU

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

## 📃Selected Publications/Events

* Blog published on Huggingface: [Building Cost-Efficient Enterprise RAG applications with Intel Gaudi 2 and Intel Xeon](https://huggingface.co/blog/cost-efficient-rag-applications-with-intel) (May 2024)

* Blog published on Intel Developer News: [Efficient Natural Language Embedding Models with Intel® Extension for Transformers](https://www.intel.com/content/www/us/en/developer/articles/technical/efficient-natural-language-embedding-models.html) (May 2024)

* Blog published on Techcrunch: [Intel and others commit to building open generative AI tools for the enterprise](https://techcrunch.com/2024/04/16/intel-and-others-commit-to-building-open-generative-ai-tools-for-the-enterprise) (Apr 2024)

* Video on YouTube: [Intel Vision Keynotes 2024](https://www.youtube.com/watch?v=QB7FoIpx8os&t=2280s) (Apr 2024)

* Blog published on Vectara: [Do Smaller Models Hallucinate More?](https://vectara.com/blog/do-smaller-models-hallucinate-more) (Apr 2024)

* Blog of Intel Developer News: [Use the neural-chat-7b Model for Advanced Fraud Detection: An AI-Driven Approach in Cybersecurity](https://www.intel.com/content/www/us/en/developer/articles/technical/bilics-approach-cybersecurity-using-neuralchat-7b.html) (March 2024)

* CES 2024: [CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo](https://youtu.be/70J3uO3eLZA?t=1348) (Jan 2024)

* Blog published on Medium: [Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling](https://medium.com/11tensors/connect-an-ai-agent-with-your-api-intel-neural-chat-7b-llm-can-replace-open-ai-function-calling-242d771e7c79) (Dec 2023)

* NeurIPS'2023 on Efficient Natural Language and Speech Processing: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502) (Nov 2023)

* Blog published on Hugging Face: [Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance](https://huggingface.co/blog/Andyrasika/neural-chat-intel) (Nov 2023)

* Blog published on VMware: [AI without GPUs: A Technical Brief for VMware Private AI with Intel](https://core.vmware.com/resource/ai-without-gpus-technical-brief-vmware-private-ai-intel#section6) (Nov 2023)

  

> View [Full Publication List](./docs/publication.md)

## Additional Content

* [Release Information](./docs/release.md)

* [Contribution Guidelines](./docs/contributions.md)

* [Legal Information](./docs/legal.md)

* [Security Policy](SECURITY.md)

* [Apache License](./LICENSE)

## Acknowledgements

* Excellent open-source projects: [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [FastChat](https://github.com/lm-sys/FastChat), [fastRAG](https://github.com/IntelLabs/fastRAG), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [llama.cpp](https://github.com/ggerganov/llama.cpp), [lm-evauation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [peft](https://github.com/huggingface/peft), [trl](https://github.com/huggingface/trl), [streamingllm](https://github.com/mit-han-lab/streaming-llm) and many others.

* Thanks to all the [contributors](./docs/contributors.md).

## 💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach [us](mailto:[email protected]), and we look forward to our collaborations on Intel Extension for Transformers!
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/intel/intel-extension-for-transformers

Awesome Lists containing this project

README

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere