Projects in Awesome Lists tagged with llm-inference
A curated list of projects in awesome lists tagged with llm-inference .
https://github.com/nomic-ai/gpt4all
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Last synced: 24 Sep 2025
https://github.com/ray-project/ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
data-science deep-learning deployment distributed hyperparameter-optimization hyperparameter-search large-language-models llm llm-inference llm-serving machine-learning optimization parallel python pytorch ray reinforcement-learning rllib serving tensorflow
Last synced: 09 Sep 2025
https://microsoft.github.io/autogen/
A programming framework for agentic AI 🤖
agent-based-framework agent-oriented-programming agentic agentic-agi chat chat-application chatbot chatgpt gpt gpt-35-turbo gpt-4 llm-agent llm-framework llm-inference llmops
Last synced: 08 May 2025
https://github.com/gitleaks/gitleaks
Find secrets with Gitleaks 🔑
ai-powered ci-cd cicd cli data-loss-prevention devsecops dlp git gitleaks go golang hacktoberfest llm llm-inference llm-training open-source secret security security-tools
Last synced: 15 Dec 2025
https://github.com/liguodongiot/llm-action
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
llm llm-inference llm-serving llm-training llmops
Last synced: 15 May 2025
https://github.com/lightning-ai/litgpt
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
ai artificial-intelligence deep-learning large-language-models llm llm-inference llms
Last synced: 12 May 2025
https://github.com/Lightning-AI/litgpt
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
ai artificial-intelligence deep-learning large-language-models llm llm-inference llms
Last synced: 26 Mar 2025
https://github.com/bentoml/openllm
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna
Last synced: 23 Oct 2025
https://github.com/mistralai/mistral-inference
Official inference library for Mistral models
Last synced: 12 May 2025
https://github.com/mistralai/mistral-inference?tab=readme-ov-file
Official inference library for Mistral models
Last synced: 16 Mar 2025
https://github.com/bentoml/OpenLLM
Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.
bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna
Last synced: 14 Mar 2025
https://github.com/openvinotoolkit/openvino
OpenVINO™ is an open source toolkit for optimizing and deploying AI inference
ai computer-vision deep-learning deploy-ai diffusion-models generative-ai good-first-issue inference llm-inference natural-language-processing nlp openvino optimize-ai performance-boost recommendation-system speech-recognition stable-diffusion transformers yolo
Last synced: 12 May 2025
https://github.com/sjtu-ipads/powerinfer
High-speed Large Language Model Serving for Local Deployment
large-language-models llama llm llm-inference local-inference
Last synced: 12 May 2025
https://github.com/SJTU-IPADS/PowerInfer
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
bamboo-7b falcon large-language-models llama llm llm-inference local-inference
Last synced: 18 Mar 2025
https://github.com/bentoml/bentoml
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python
Last synced: 12 May 2025
https://github.com/bentoml/BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!
ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python
Last synced: 12 Mar 2025
https://github.com/internlm/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
codellama cuda-kernels deepspeed fastertransformer internlm llama llama2 llama3 llm llm-inference turbomind
Last synced: 24 Dec 2025
https://github.com/InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
codellama cuda-kernels deepspeed fastertransformer internlm llama llama2 llama3 llm llm-inference turbomind
Last synced: 20 Mar 2025
https://github.com/superduper-io/superduper
Superduper: End-to-end framework for building custom AI applications and agents.
ai chatbot data database distributed-ml inference llm-inference llm-serving llmops ml mlops mongodb pretrained-models python pytorch rag semantic-search torch transformers vector-search
Last synced: 14 May 2025
https://github.com/kserve/kserve
Standardized Serverless ML Inference Platform on Kubernetes
artificial-intelligence genai hacktoberfest istio k8s knative kserve kubeflow kubernetes llm-inference machine-learning mlops model-interpretability model-serving pytorch service-mesh sklearn tensorflow xgboost
Last synced: 13 May 2025
https://github.com/deftruth/awesome-llm-inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
awesome-llm deepseek deepseek-r1 deepseek-v3 flash-attention flash-attention-3 flash-mla llm-inference minimax-01 mla paged-attention tensorrt-llm vllm
Last synced: 04 Apr 2025
https://github.com/fellouai/eko
Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai
agent agentic-ai agentic-ai-development agentic-framework agentic-workflow agents ai-agents browser-automation browseruse chain-of-thought computer-automation computeruse genai llm-agents llm-inference llmapi natural-language-inference prompt-engineering rag workflow
Last synced: 29 Dec 2025
https://github.com/neuralmagic/deepsparse
Sparsity-aware deep learning inference runtime for CPUs
computer-vision cpus deepsparse inference llm-inference machinelearning nlp object-detection onnx performance pretrained-models pruning quantization sparsification
Last synced: 14 May 2025
https://github.com/nvidia/generativeaiexamples
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server
Last synced: 13 May 2025
https://github.com/predibase/lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers
Last synced: 12 May 2025
https://github.com/NVIDIA/GenerativeAIExamples
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server
Last synced: 28 Mar 2025
https://github.com/flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
attention cuda gpu jit large-large-models llm-inference nvidia pytorch
Last synced: 05 Jan 2026
https://github.com/gpustack/gpustack
Manage GPU clusters for running AI models
ascend cuda deepseek distributed distributed-inference genai ggml inference llama llamacpp llm llm-inference llm-serving maas metal mindie openai qwen rocm vllm
Last synced: 23 Oct 2025
https://github.com/databricks/dbrx
Code examples and resources for DBRX, a large language model developed by Databricks
databricks gen-ai generative-ai llm llm-inference llm-training mosaic-ai
Last synced: 25 Oct 2025
https://github.com/fasterdecoding/medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Last synced: 14 May 2025
https://github.com/FasterDecoding/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Last synced: 09 May 2025
https://github.com/codelion/optillm
Optimizing inference proxy for LLMs
agent agentic-ai agentic-framework agentic-workflow agents api-gateway chain-of-thought genai large-language-models llm llm-inference llmapi mixture-of-experts moa monte-carlo-tree-search openai openai-api optimization prompt-engineering proxy-server
Last synced: 10 Jun 2025
https://github.com/codelion/openevolve
Open-source implementation of AlphaEvolve
alpha-evolve alphacode alphaevolve coding-agent deepmind deepmind-lab discovery distributed-evolutionary-algorithms evolutionary-algorithms evolutionary-computation genetic-algorithm genetic-algorithms iterative-methods iterative-refinement llm-engineering llm-ensemble llm-inference openevolve optimize
Last synced: 10 Jun 2025
https://github.com/intel/intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
4-bits autoround chatbot chatpdf gaudi3 habana intel-optimized-llamacpp large-language-model llm-cpu llm-inference neural-chat neural-chat-7b rag retrieval speculative-decoding streamingllm
Last synced: 24 Feb 2025
https://github.com/microsoft/aici
AICI: Prompts as (Wasm) Programs
ai inference language-model llm llm-framework llm-inference llm-serving llmops model-serving rust transformer wasm wasmtime
Last synced: 14 May 2025
https://github.com/b4rtaz/distributed-llama
Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.
distributed-computing distributed-llm llama2 llama3 llm llm-inference llms neural-network open-llm
Last synced: 13 Apr 2025
https://github.com/liltom-eth/llama2-webui
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
llama-2 llama2 llm llm-inference
Last synced: 14 May 2025
https://github.com/ray-project/ray-llm
RayLLM - LLMs on Ray
distributed-systems large-language-models llm llm-inference llm-serving llmops ray serving transformers
Last synced: 25 Feb 2025
https://github.com/ray-project/aviary
RayLLM - LLMs on Ray
distributed-systems large-language-models llm llm-inference llm-serving llmops ray serving transformers
Last synced: 05 Mar 2025
https://github.com/sauravpanda/browserai
Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser
agents ai llama llm llm-inference local localllm tts webgpu
Last synced: 14 May 2025
https://github.com/PySpur-Dev/PySpur
Graph-Based Editor for LLM Workflows
agent agents ai javascript llm llm-inference openai python react workflow
Last synced: 14 Sep 2025
https://github.com/cloud-code-ai/browserai
Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser
agents ai llama llm llm-inference local localllm tts webgpu
Last synced: 14 Apr 2025
https://github.com/PySpur-Dev/pyspur
Graph-Based Editor for LLM Workflows
agent agents ai javascript llm llm-inference openai python react workflow
Last synced: 07 Sep 2025
https://github.com/lean-dojo/LeanCopilot
LLMs as Copilots for Theorem Proving in Lean
formal-mathematics lean lean4 llm-inference machine-learning theorem-proving
Last synced: 09 Jul 2025
https://github.com/zhihu/zhilight
A highly optimized LLM inference acceleration engine for Llama and its variants.
cuda deepseek-r1 gpt inference-engine llama llm llm-inference llm-serving model-serving pytorch
Last synced: 15 May 2025
https://github.com/harleyszhang/llm_note
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
cuda-programming kv-cache llm llm-inference transformer-models triton-kernels vllm
Last synced: 23 Aug 2025
https://github.com/SafeAILab/EAGLE
Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
large-language-models llm-inference speculative-decoding
Last synced: 20 Mar 2025
https://github.com/beam-cloud/beta9
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
autoscaler cloudrun cuda developer-productivity distributed-computing faas fine-tuning functions-as-a-service generative-ai gpu large-language-models llm llm-inference ml-platform paas self-hosted serverless serverless-containers
Last synced: 14 Apr 2025
https://github.com/katanemo/archgw
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by the core contributors of Envoy proxy, on Envoy.
ai-gateway envoy envoyproxy gateway generative-ai llm-gateway llm-inference llm-routing llmops llms openai prompt proxy proxy-server routing
Last synced: 21 Oct 2025
https://github.com/inspector-apm/neuron-ai
The PHP Agent Development Kit - powered by Inspector.dev
agent agentic-ai agentic-framework agents ai llm llm-inference llms php vector-database
Last synced: 21 Jun 2025
https://github.com/stoyan-stoyanov/llmflows
LLMFlows - Simple, Explicit and Transparent LLM Apps
ai chatgpt gpt-4 llm llm-inference llmops llms machine-learning openai prompt-engineering python question-answering vector-database
Last synced: 14 May 2025
https://github.com/mukel/llama3.java
Practical Llama 3 inference in Java
chatgpt genai gguf huggingface java llama llama3 llamacpp llm llm-inference llms openai simd transformers
Last synced: 15 May 2025
https://github.com/eastriverlee/llm.swift
LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.
gguf ios llm llm-inference macos swift tvos visionos watchos
Last synced: 15 May 2025
https://github.com/run-ai/genv
GPU environment and cluster management with LLM support
bash container-runtime containers data-science deep-learning docker gpu gpus jupyter-notebook jupyterlab-extension k8s kubernetes llm-inference llms nvidia-gpu ollama ray vscode vscode-extension zsh
Last synced: 16 May 2025
https://github.com/foldl/chatllm.cpp
Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)
Last synced: 15 May 2025
https://github.com/rohan-paul/LLM-FineTuning-Large-Language-Models
LLM (Large Language Model) FineTuning
gpt-3 gpt3-turbo large-language-models llama2 llm llm-finetuning llm-inference llm-serving llm-training mistral-7b open-source-llm pytorch
Last synced: 16 Oct 2025
https://github.com/zeux/calm
CUDA/Metal accelerated language model inference
Last synced: 10 Apr 2025
https://github.com/eastriverlee/LLM.swift
LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.
gguf ios llm llm-inference macos swift tvos visionos watchos
Last synced: 11 Mar 2025
https://github.com/rohan-paul/llm-finetuning-large-language-models
LLM (Large Language Model) FineTuning
gpt-3 gpt3-turbo large-language-models llama2 llm llm-finetuning llm-inference llm-serving llm-training mistral-7b open-source-llm pytorch
Last synced: 04 Apr 2025
https://github.com/nano-collective/nanocoder
A beautiful local-first coding agent running in your terminal - built by the community for the community ⚒
ai ai-agents ai-coding coding-agents llm llm-inference ollama openai openrouter
Last synced: 05 Jan 2026
https://github.com/stanford-mast/blast
Browser-LLM Auto-Scaling Technology
ai-agents browser-automation llm-inference python
Last synced: 11 May 2025
https://github.com/michael-a-kuykendall/shimmy
⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
api-server command-line-tool developer-tools gguf huggingface huggingface-models huggingface-transformers inference-server llama llamacpp llm-inference local-ai lora machine-learning ollama-api openai-compatible rust rust-crate transformers
Last synced: 13 Sep 2025
https://github.com/feifeibear/long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
attention-is-all-you-need deepspeed-ulysses llm-inference llm-training pytorch ring-attention
Last synced: 14 May 2025
https://github.com/Nano-Collective/nanocoder
A beautiful local-first coding agent running in your terminal - built by the community for the community ⚒
ai ai-agents ai-coding coding-agents llm llm-inference ollama openai openrouter
Last synced: 23 Sep 2025
https://github.com/hpcaitech/swiftinfer
Efficient AI Inference & Serving
artificial-intelligence deep-learning gpt inference llama llama2 llm-inference llm-serving
Last synced: 05 Apr 2025
https://github.com/tilmangriesel/chipper
✨ AI interface for tinkerers (Ollama, Haystack RAG, Python)
agent agentic-ai deepseek deepseek-chat deepseek-r1 embedding hugging-face huggingface llama3 llm llm-inference ollama ollama-api ollama-client ollama-gui phi4 rag retrival-augmented-generation
Last synced: 16 May 2025
https://github.com/kenza-ai/sagify
LLMs and Machine Learning done easily
ai-gateway anthropic cohere generative-ai langchain langchain-python large-language-model large-language-models llm llm-inference llmops open-source-llm openai sagemaker
Last synced: 16 May 2025
https://github.com/flagai-open/aquila2
The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.
llm llm-inference llm-training
Last synced: 15 May 2025
https://github.com/vectorch-ai/ScaleLLM
A high-performance inference system for large language models, designed for production environments.
cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer
Last synced: 09 May 2025
https://github.com/vectorch-ai/scalellm
A high-performance inference system for large language models, designed for production environments.
cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer
Last synced: 14 Apr 2025
https://github.com/rizerphe/local-llm-function-calling
A tool for generating function arguments and choosing what function to call with local LLMs
chatgpt-functions huggingface-transformers json-schema llm llm-inference openai-function-call openai-functions
Last synced: 26 Oct 2025
https://github.com/FlagAI-Open/Aquila2
The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.
llm llm-inference llm-training
Last synced: 08 Apr 2025
https://github.com/preternatural-explore/mlx-swift-chat
A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.
ios llm-inference macos mlx mlx-swift swiftui
Last synced: 10 Apr 2025
https://github.com/jax-ml/scaling-book
Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs
jax llm-inference llms roofline tpus
Last synced: 18 Jun 2025
https://github.com/felladrin/minisearch
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
ai ai-search-engine artificial-intelligence generative-ai gpu-accelerated information-retrieval llm llm-inference metasearch metasearch-engine perplexity perplexity-ai question-answering rag retrieval-augmented-generation searxng web-llm web-search webapp wllama
Last synced: 12 Apr 2025
https://github.com/EulerSearch/embedding_studio
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
embeddings embeddings-similarity fine-tuning llm-inference query-parser search-algorithm search-engine search-query-parser semantic-similarity unstructured-data unstructured-search vector-database
Last synced: 06 Aug 2025
https://github.com/felladrin/MiniSearch
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
ai ai-search-engine artificial-intelligence generative-ai gpu-accelerated information-retrieval llm llm-inference metasearch metasearch-engine perplexity perplexity-ai question-answering rag retrieval-augmented-generation searxng web-llm web-search webapp wllama
Last synced: 24 Mar 2025
https://github.com/nvidia/star-attention
Efficient LLM Inference over Long Sequences
attention-mechanism large-language-models llm-inference
Last synced: 16 May 2025
https://github.com/microsoft/sarathi-serve
A low-latency & high-throughput serving engine for LLMs
llama llm-inference pytorch transformer
Last synced: 16 May 2025
https://github.com/zjhellofss/KuiperLLama
校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2
Last synced: 08 Sep 2025
https://github.com/devflowinc/uzi
CLI for running large numbers of coding agents in parallel with git worktrees
agentic-ai ai codegen go golang llm llm-inference parallelization
Last synced: 24 Jun 2025
https://github.com/zjhellofss/kuiperllama
校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2
Last synced: 16 May 2025
https://github.com/ai-hypercomputer/jetstream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer
Last synced: 23 Oct 2025
https://github.com/ugorsahin/TalkingHeads
A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi
browser-automation chatgpt chatgpt-api claude copilot free gemini huggingchat llm-inference python selenium undetected-chromedriver
Last synced: 11 Apr 2025
https://github.com/alipay/painlessinferenceacceleration
Accelerate inference without tears
Last synced: 16 May 2025
https://github.com/ugorsahin/talkingheads
A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi
browser-automation chatgpt chatgpt-api claude copilot free gemini huggingchat llm-inference python selenium undetected-chromedriver
Last synced: 15 May 2025
https://github.com/AI-Hypercomputer/JetStream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer
Last synced: 31 Mar 2025
https://github.com/morpheuslord/hackbot
AI-powered cybersecurity chatbot designed to provide helpful and accurate answers to your cybersecurity-related queries and also do code analysis and scan analysis.
ai automation chatbot cli-chat-app cybersecurity cybersecurity-education cybersecurity-tools llama-api llama2 llama2-7b llamacpp llm-inference runpod
Last synced: 08 Oct 2025
https://github.com/ray-project/ray-educational-materials
This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
deep-learning distributed-machine-learning generative-ai llm llm-inference llm-serving ray ray-data ray-distributed ray-serve ray-train ray-tune
Last synced: 08 May 2025
https://github.com/andrewkchan/deepseek.cpp
CPU inference for the DeepSeek family of large language models in pure C++
cpp deepseek llama llm llm-inference machine-learning transformers
Last synced: 16 May 2025
https://github.com/andrewkchan/yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
cpp cuda inference-engine llama llamacpp llm llm-inference machine-learning mistral
Last synced: 12 Apr 2025
https://github.com/armbues/SiLLM
SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.
apple-silicon dpo large-language-models llm llm-inference llm-training lora mlx
Last synced: 18 Jul 2025
https://github.com/structuredllm/syncode
Efficient and general syntactical decoding for Large Language Models
grammar large-language-models llm llm-inference parser
Last synced: 11 May 2025
https://github.com/JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
benchmark benchmark-mixture benchmarking-framework benchmarking-suite evaluation evaluation-framework foundation-models large-language-model large-language-models large-multimodal-models llm-evaluation llm-evaluation-framework llm-inference mixeval
Last synced: 14 Sep 2025
https://github.com/modelscope/dash-infer
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
cpu cuda guided-decoding llm llm-inference native-engine
Last synced: 12 Apr 2025
https://github.com/inferflow/inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
baichuan2 bloom deepseek falcon gemma internlm llama2 llamacpp llm-inference m2m100 minicpm mistral mixtral mixture-of-experts model-quantization moe multi-gpu-inference phi-2 qwen
Last synced: 07 Apr 2025
https://github.com/expectedparrot/edsl
Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
anthropic data-labeling deepinfra domain-specific-language experiments llama2 llm llm-agent llm-framework llm-inference market-research mixtral open-source openai python social-science surveys synthetic-data
Last synced: 15 May 2025
https://github.com/picovoice/picollm
On-device LLM Inference Powered by X-Bit Quantization
compression efficient-inference gemma generative-ai language-model language-models large-language-model llama llama2 llama3 llm llm-inference llms mistral mixtral model-compression natural-language-processing quantization self-hosted
Last synced: 23 Oct 2025