Projects in Awesome Lists tagged with llm-inference

https://github.com/nomic-ai/gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

Last synced: 24 Sep 2025

https://github.com/ray-project/ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

data-science deep-learning deployment distributed hyperparameter-optimization hyperparameter-search large-language-models llm llm-inference llm-serving machine-learning optimization parallel python pytorch ray reinforcement-learning rllib serving tensorflow

Last synced: 09 Sep 2025

https://microsoft.github.io/autogen/

A programming framework for agentic AI 🤖

agent-based-framework agent-oriented-programming agentic agentic-agi chat chat-application chatbot chatgpt gpt gpt-35-turbo gpt-4 llm-agent llm-framework llm-inference llmops

Last synced: 08 May 2025

https://github.com/gitleaks/gitleaks

Find secrets with Gitleaks 🔑

ai-powered ci-cd cicd cli data-loss-prevention devsecops dlp git gitleaks go golang hacktoberfest llm llm-inference llm-training open-source secret security security-tools

Last synced: 15 Dec 2025

https://github.com/liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

llm llm-inference llm-serving llm-training llmops

Last synced: 15 May 2025

https://github.com/lightning-ai/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

Last synced: 12 May 2025

https://github.com/Lightning-AI/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

Last synced: 26 Mar 2025

https://github.com/bentoml/openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 23 Oct 2025

https://github.com/mistralai/mistral-inference

Official inference library for Mistral models

llm llm-inference mistralai

Last synced: 12 May 2025

https://github.com/mistralai/mistral-inference?tab=readme-ov-file

Official inference library for Mistral models

llm llm-inference mistralai

Last synced: 16 Mar 2025

https://github.com/bentoml/OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 14 Mar 2025

https://github.com/openvinotoolkit/openvino

OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

ai computer-vision deep-learning deploy-ai diffusion-models generative-ai good-first-issue inference llm-inference natural-language-processing nlp openvino optimize-ai performance-boost recommendation-system speech-recognition stable-diffusion transformers yolo

Last synced: 12 May 2025

https://github.com/sjtu-ipads/powerinfer

High-speed Large Language Model Serving for Local Deployment

large-language-models llama llm llm-inference local-inference

Last synced: 12 May 2025

https://github.com/SJTU-IPADS/PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

bamboo-7b falcon large-language-models llama llm llm-inference local-inference

Last synced: 18 Mar 2025

https://github.com/bentoml/bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 May 2025

https://github.com/bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 12 Mar 2025

https://github.com/internlm/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

codellama cuda-kernels deepspeed fastertransformer internlm llama llama2 llama3 llm llm-inference turbomind

Last synced: 24 Dec 2025

https://github.com/InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

codellama cuda-kernels deepspeed fastertransformer internlm llama llama2 llama3 llm llm-inference turbomind

Last synced: 20 Mar 2025

https://github.com/superduper-io/superduper

Superduper: End-to-end framework for building custom AI applications and agents.

ai chatbot data database distributed-ml inference llm-inference llm-serving llmops ml mlops mongodb pretrained-models python pytorch rag semantic-search torch transformers vector-search

Last synced: 14 May 2025

https://github.com/kserve/kserve

Standardized Serverless ML Inference Platform on Kubernetes

artificial-intelligence genai hacktoberfest istio k8s knative kserve kubeflow kubernetes llm-inference machine-learning mlops model-interpretability model-serving pytorch service-mesh sklearn tensorflow xgboost

Last synced: 13 May 2025

https://github.com/deftruth/awesome-llm-inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

awesome-llm deepseek deepseek-r1 deepseek-v3 flash-attention flash-attention-3 flash-mla llm-inference minimax-01 mla paged-attention tensorrt-llm vllm

Last synced: 04 Apr 2025

https://github.com/fellouai/eko

Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai

agent agentic-ai agentic-ai-development agentic-framework agentic-workflow agents ai-agents browser-automation browseruse chain-of-thought computer-automation computeruse genai llm-agents llm-inference llmapi natural-language-inference prompt-engineering rag workflow

Last synced: 29 Dec 2025

https://github.com/neuralmagic/deepsparse

Sparsity-aware deep learning inference runtime for CPUs

computer-vision cpus deepsparse inference llm-inference machinelearning nlp object-detection onnx performance pretrained-models pruning quantization sparsification

Last synced: 14 May 2025

https://github.com/nvidia/generativeaiexamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server

Last synced: 13 May 2025

https://github.com/predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers

Last synced: 12 May 2025

https://github.com/NVIDIA/GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server

Last synced: 28 Mar 2025

https://github.com/flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

attention cuda gpu jit large-large-models llm-inference nvidia pytorch

Last synced: 05 Jan 2026

https://github.com/gpustack/gpustack

Manage GPU clusters for running AI models

ascend cuda deepseek distributed distributed-inference genai ggml inference llama llamacpp llm llm-inference llm-serving maas metal mindie openai qwen rocm vllm

Last synced: 23 Oct 2025

https://github.com/databricks/dbrx

Code examples and resources for DBRX, a large language model developed by Databricks

databricks gen-ai generative-ai llm llm-inference llm-training mosaic-ai

Last synced: 25 Oct 2025

https://github.com/fasterdecoding/medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

llm llm-inference

Last synced: 14 May 2025

https://github.com/FasterDecoding/Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

llm llm-inference

Last synced: 09 May 2025

https://github.com/codelion/optillm

Optimizing inference proxy for LLMs

agent agentic-ai agentic-framework agentic-workflow agents api-gateway chain-of-thought genai large-language-models llm llm-inference llmapi mixture-of-experts moa monte-carlo-tree-search openai openai-api optimization prompt-engineering proxy-server

Last synced: 10 Jun 2025

https://github.com/codelion/openevolve

Open-source implementation of AlphaEvolve

alpha-evolve alphacode alphaevolve coding-agent deepmind deepmind-lab discovery distributed-evolutionary-algorithms evolutionary-algorithms evolutionary-computation genetic-algorithm genetic-algorithms iterative-methods iterative-refinement llm-engineering llm-ensemble llm-inference openevolve optimize

Last synced: 10 Jun 2025

https://github.com/intel/intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

4-bits autoround chatbot chatpdf gaudi3 habana intel-optimized-llamacpp large-language-model llm-cpu llm-inference neural-chat neural-chat-7b rag retrieval speculative-decoding streamingllm

Last synced: 24 Feb 2025

https://github.com/microsoft/aici

AICI: Prompts as (Wasm) Programs

ai inference language-model llm llm-framework llm-inference llm-serving llmops model-serving rust transformer wasm wasmtime

Last synced: 14 May 2025

https://github.com/b4rtaz/distributed-llama

Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

distributed-computing distributed-llm llama2 llama3 llm llm-inference llms neural-network open-llm

Last synced: 13 Apr 2025

https://github.com/liltom-eth/llama2-webui

Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.

llama-2 llama2 llm llm-inference

Last synced: 14 May 2025

https://github.com/ray-project/ray-llm

RayLLM - LLMs on Ray

distributed-systems large-language-models llm llm-inference llm-serving llmops ray serving transformers

Last synced: 25 Feb 2025

https://github.com/ray-project/aviary

RayLLM - LLMs on Ray

distributed-systems large-language-models llm llm-inference llm-serving llmops ray serving transformers

Last synced: 05 Mar 2025

https://github.com/sauravpanda/browserai

Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser

agents ai llama llm llm-inference local localllm tts webgpu

Last synced: 14 May 2025

https://github.com/PySpur-Dev/PySpur

Graph-Based Editor for LLM Workflows

agent agents ai javascript llm llm-inference openai python react workflow

Last synced: 14 Sep 2025

https://github.com/cloud-code-ai/browserai

Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser

agents ai llama llm llm-inference local localllm tts webgpu

Last synced: 14 Apr 2025

https://github.com/PySpur-Dev/pyspur

Graph-Based Editor for LLM Workflows

agent agents ai javascript llm llm-inference openai python react workflow

Last synced: 07 Sep 2025

https://github.com/lean-dojo/LeanCopilot

LLMs as Copilots for Theorem Proving in Lean

formal-mathematics lean lean4 llm-inference machine-learning theorem-proving

Last synced: 09 Jul 2025

https://github.com/zhihu/zhilight

A highly optimized LLM inference acceleration engine for Llama and its variants.

cuda deepseek-r1 gpt inference-engine llama llm llm-inference llm-serving model-serving pytorch

Last synced: 15 May 2025

https://github.com/harleyszhang/llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming kv-cache llm llm-inference transformer-models triton-kernels vllm

Last synced: 23 Aug 2025

https://github.com/SafeAILab/EAGLE

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

large-language-models llm-inference speculative-decoding

Last synced: 20 Mar 2025

https://github.com/beam-cloud/beta9

Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world

autoscaler cloudrun cuda developer-productivity distributed-computing faas fine-tuning functions-as-a-service generative-ai gpu large-language-models llm llm-inference ml-platform paas self-hosted serverless serverless-containers

Last synced: 14 Apr 2025

https://github.com/katanemo/archgw

Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by the core contributors of Envoy proxy, on Envoy.

ai-gateway envoy envoyproxy gateway generative-ai llm-gateway llm-inference llm-routing llmops llms openai prompt proxy proxy-server routing

Last synced: 21 Oct 2025

https://github.com/inspector-apm/neuron-ai

The PHP Agent Development Kit - powered by Inspector.dev

agent agentic-ai agentic-framework agents ai llm llm-inference llms php vector-database

Last synced: 21 Jun 2025

https://github.com/stoyan-stoyanov/llmflows

LLMFlows - Simple, Explicit and Transparent LLM Apps

ai chatgpt gpt-4 llm llm-inference llmops llms machine-learning openai prompt-engineering python question-answering vector-database

Last synced: 14 May 2025

https://github.com/mukel/llama3.java

Practical Llama 3 inference in Java

chatgpt genai gguf huggingface java llama llama3 llamacpp llm llm-inference llms openai simd transformers

Last synced: 15 May 2025

https://github.com/eastriverlee/llm.swift

LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.

gguf ios llm llm-inference macos swift tvos visionos watchos

Last synced: 15 May 2025

https://github.com/run-ai/genv

GPU environment and cluster management with LLM support

bash container-runtime containers data-science deep-learning docker gpu gpus jupyter-notebook jupyterlab-extension k8s kubernetes llm-inference llms nvidia-gpu ollama ray vscode vscode-extension zsh

Last synced: 16 May 2025

https://github.com/foldl/chatllm.cpp

Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)

llm llm-inference

Last synced: 15 May 2025

https://github.com/rohan-paul/LLM-FineTuning-Large-Language-Models

LLM (Large Language Model) FineTuning

gpt-3 gpt3-turbo large-language-models llama2 llm llm-finetuning llm-inference llm-serving llm-training mistral-7b open-source-llm pytorch

Last synced: 16 Oct 2025

https://github.com/zeux/calm

CUDA/Metal accelerated language model inference

cuda llm-inference ml

Last synced: 10 Apr 2025

https://github.com/eastriverlee/LLM.swift

LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.

gguf ios llm llm-inference macos swift tvos visionos watchos

Last synced: 11 Mar 2025

https://github.com/rohan-paul/llm-finetuning-large-language-models

LLM (Large Language Model) FineTuning

gpt-3 gpt3-turbo large-language-models llama2 llm llm-finetuning llm-inference llm-serving llm-training mistral-7b open-source-llm pytorch

Last synced: 04 Apr 2025

https://github.com/nano-collective/nanocoder

A beautiful local-first coding agent running in your terminal - built by the community for the community ⚒

ai ai-agents ai-coding coding-agents llm llm-inference ollama openai openrouter

Last synced: 05 Jan 2026

https://github.com/stanford-mast/blast

Browser-LLM Auto-Scaling Technology

ai-agents browser-automation llm-inference python

Last synced: 11 May 2025

https://github.com/michael-a-kuykendall/shimmy

⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.

api-server command-line-tool developer-tools gguf huggingface huggingface-models huggingface-transformers inference-server llama llamacpp llm-inference local-ai lora machine-learning ollama-api openai-compatible rust rust-crate transformers

Last synced: 13 Sep 2025

https://github.com/feifeibear/long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

attention-is-all-you-need deepspeed-ulysses llm-inference llm-training pytorch ring-attention

Last synced: 14 May 2025

https://github.com/Nano-Collective/nanocoder

A beautiful local-first coding agent running in your terminal - built by the community for the community ⚒

ai ai-agents ai-coding coding-agents llm llm-inference ollama openai openrouter

Last synced: 23 Sep 2025

https://github.com/hpcaitech/swiftinfer

Efficient AI Inference & Serving

artificial-intelligence deep-learning gpt inference llama llama2 llm-inference llm-serving

Last synced: 05 Apr 2025

https://github.com/tilmangriesel/chipper

✨ AI interface for tinkerers (Ollama, Haystack RAG, Python)

agent agentic-ai deepseek deepseek-chat deepseek-r1 embedding hugging-face huggingface llama3 llm llm-inference ollama ollama-api ollama-client ollama-gui phi4 rag retrival-augmented-generation

Last synced: 16 May 2025

https://github.com/kenza-ai/sagify

LLMs and Machine Learning done easily

ai-gateway anthropic cohere generative-ai langchain langchain-python large-language-model large-language-models llm llm-inference llmops open-source-llm openai sagemaker

Last synced: 16 May 2025

https://github.com/flagai-open/aquila2

The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.

llm llm-inference llm-training

Last synced: 15 May 2025

https://github.com/vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 09 May 2025

https://github.com/vectorch-ai/scalellm

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 14 Apr 2025

https://github.com/rizerphe/local-llm-function-calling

A tool for generating function arguments and choosing what function to call with local LLMs

chatgpt-functions huggingface-transformers json-schema llm llm-inference openai-function-call openai-functions

Last synced: 26 Oct 2025

https://github.com/FlagAI-Open/Aquila2

The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.

llm llm-inference llm-training

Last synced: 08 Apr 2025

https://github.com/preternatural-explore/mlx-swift-chat

A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.

ios llm-inference macos mlx mlx-swift swiftui

Last synced: 10 Apr 2025

https://github.com/jax-ml/scaling-book

Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs

jax llm-inference llms roofline tpus

Last synced: 18 Jun 2025

https://github.com/felladrin/minisearch

Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space

ai ai-search-engine artificial-intelligence generative-ai gpu-accelerated information-retrieval llm llm-inference metasearch metasearch-engine perplexity perplexity-ai question-answering rag retrieval-augmented-generation searxng web-llm web-search webapp wllama

Last synced: 12 Apr 2025

https://github.com/EulerSearch/embedding_studio

Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.

embeddings embeddings-similarity fine-tuning llm-inference query-parser search-algorithm search-engine search-query-parser semantic-similarity unstructured-data unstructured-search vector-database

Last synced: 06 Aug 2025

https://github.com/felladrin/MiniSearch

Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space

ai ai-search-engine artificial-intelligence generative-ai gpu-accelerated information-retrieval llm llm-inference metasearch metasearch-engine perplexity perplexity-ai question-answering rag retrieval-augmented-generation searxng web-llm web-search webapp wllama

Last synced: 24 Mar 2025

https://github.com/nvidia/star-attention

Efficient LLM Inference over Long Sequences

attention-mechanism large-language-models llm-inference

Last synced: 16 May 2025

https://github.com/microsoft/sarathi-serve

A low-latency & high-throughput serving engine for LLMs

llama llm-inference pytorch transformer

Last synced: 16 May 2025

https://github.com/intel/neural-speed

An innovative library for efficient LLM inference via low-bit quantization

cpu fp4 fp8 gaudi2 gpu int1 int2 int3 int4 int5 int6 int7 int8 llamacpp llm-fine-tuning llm-inference low-bit mxformat nf4 sparsity

Last synced: 25 Oct 2025

https://github.com/zjhellofss/KuiperLLama

校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2

Last synced: 08 Sep 2025

https://github.com/devflowinc/uzi

CLI for running large numbers of coding agents in parallel with git worktrees

agentic-ai ai codegen go golang llm llm-inference parallelization

Last synced: 24 Jun 2025

https://github.com/zjhellofss/kuiperllama

校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2

Last synced: 16 May 2025

https://github.com/ai-hypercomputer/jetstream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer

Last synced: 23 Oct 2025

https://github.com/ugorsahin/TalkingHeads

A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi

browser-automation chatgpt chatgpt-api claude copilot free gemini huggingchat llm-inference python selenium undetected-chromedriver

Last synced: 11 Apr 2025

https://github.com/alipay/painlessinferenceacceleration

Accelerate inference without tears

llm-inference

Last synced: 16 May 2025

https://github.com/ugorsahin/talkingheads

A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi

browser-automation chatgpt chatgpt-api claude copilot free gemini huggingchat llm-inference python selenium undetected-chromedriver

Last synced: 15 May 2025

https://github.com/unifyai/unify

Notion for AI Observability 📊

ai claude gpt gpt-4 llama2 llm llm-inference llms mixtral openai python

Last synced: 15 May 2025

https://github.com/AI-Hypercomputer/JetStream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer

Last synced: 31 Mar 2025

https://github.com/morpheuslord/hackbot

AI-powered cybersecurity chatbot designed to provide helpful and accurate answers to your cybersecurity-related queries and also do code analysis and scan analysis.

ai automation chatbot cli-chat-app cybersecurity cybersecurity-education cybersecurity-tools llama-api llama2 llama2-7b llamacpp llm-inference runpod

Last synced: 08 Oct 2025

https://github.com/ray-project/ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

deep-learning distributed-machine-learning generative-ai llm llm-inference llm-serving ray ray-data ray-distributed ray-serve ray-train ray-tune

Last synced: 08 May 2025

https://github.com/andrewkchan/deepseek.cpp

CPU inference for the DeepSeek family of large language models in pure C++

cpp deepseek llama llm llm-inference machine-learning transformers

Last synced: 16 May 2025

https://github.com/andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

cpp cuda inference-engine llama llamacpp llm llm-inference machine-learning mistral

Last synced: 12 Apr 2025

https://github.com/armbues/SiLLM

SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.

apple-silicon dpo large-language-models llm llm-inference llm-training lora mlx

Last synced: 18 Jul 2025

https://github.com/structuredllm/syncode

Efficient and general syntactical decoding for Large Language Models

grammar large-language-models llm llm-inference parser

Last synced: 11 May 2025

https://github.com/JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark benchmark-mixture benchmarking-framework benchmarking-suite evaluation evaluation-framework foundation-models large-language-model large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference mixeval

Last synced: 14 Sep 2025

https://github.com/modelscope/dash-infer

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

cpu cuda guided-decoding llm llm-inference native-engine

Last synced: 12 Apr 2025

https://github.com/inferflow/inferflow

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

baichuan2 bloom deepseek falcon gemma internlm llama2 llamacpp llm-inference m2m100 minicpm mistral mixtral mixture-of-experts model-quantization moe multi-gpu-inference phi-2 qwen

Last synced: 07 Apr 2025

https://github.com/expectedparrot/edsl

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

anthropic data-labeling deepinfra domain-specific-language experiments llama2 llm llm-agent llm-framework llm-inference market-research mixtral open-source openai python social-science surveys synthetic-data

Last synced: 15 May 2025

https://github.com/picovoice/picollm

On-device LLM Inference Powered by X-Bit Quantization

compression efficient-inference gemma generative-ai language-model language-models large-language-model llama llama2 llama3 llm llm-inference llms mistral mixtral model-compression natural-language-processing quantization self-hosted

Last synced: 23 Oct 2025