Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with llm-inference

A curated list of projects in awesome lists tagged with llm-inference .

https://github.com/nomic-ai/gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

ai-chat llm-inference

Last synced: 13 Jan 2025

https://github.com/liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

llm llm-inference llm-serving llm-training llmops

Last synced: 04 Dec 2024

https://github.com/lightning-ai/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

Last synced: 13 Jan 2025

https://github.com/Lightning-AI/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

Last synced: 30 Oct 2024

https://github.com/lightning-ai/lit-gpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

ai artificial-intelligence deep-learning large-language-models llm llm-inference llms

Last synced: 13 Dec 2024

https://github.com/bentoml/openllm

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.

bentoml fine-tuning llama llama2 llama3-1 llama3-2 llama3-2-vision llm llm-inference llm-ops llm-serving llmops mistral mlops model-inference open-source-llm openllm vicuna

Last synced: 20 Jan 2025

https://github.com/mistralai/mistral-inference

Official inference library for Mistral models

llm llm-inference mistralai

Last synced: 13 Jan 2025

https://github.com/mistralai/mistral-inference?tab=readme-ov-file

Official inference library for Mistral models

llm llm-inference mistralai

Last synced: 27 Oct 2024

https://github.com/bentoml/OpenLLM

Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint, locally and in the cloud.

ai bentoml falcon fine-tuning llama llama2 llm llm-inference llm-ops llm-serving llmops mistral ml mlops model-inference mpt open-source-llm openllm stablelm vicuna

Last synced: 26 Oct 2024

https://github.com/sjtu-ipads/powerinfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

bamboo-7b falcon large-language-models llama llm llm-inference local-inference

Last synced: 20 Jan 2025

https://github.com/SJTU-IPADS/PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

bamboo-7b falcon large-language-models llama llm llm-inference local-inference

Last synced: 27 Oct 2024

https://github.com/bentoml/bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!

ai-inference deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 13 Jan 2025

https://github.com/bentoml/BentoML

The most flexible way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Inference Graph/Pipelines, Compound AI systems, Multi-Modal, RAG as a Service, and more!

deep-learning generative-ai inference-platform llm llm-inference llm-serving llmops machine-learning ml-engineering mlops model-inference-service model-serving multimodal python

Last synced: 24 Oct 2024

https://github.com/internlm/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

codellama cuda-kernels deepspeed fastertransformer internlm llama llama2 llama3 llm llm-inference turbomind

Last synced: 20 Jan 2025

https://github.com/superduper-io/superduper

Superduper: Build end-to-end AI applications and agent workflows on your existing data infrastructure and preferred tools - without migrating your data.

ai chatbot data database distributed-ml inference llm-inference llm-serving llmops ml mlops mongodb pretrained-models python pytorch rag semantic-search torch transformers vector-search

Last synced: 14 Jan 2025

https://github.com/InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

codellama cuda-kernels deepspeed fastertransformer internlm llama llama2 llama3 llm llm-inference turbomind

Last synced: 28 Oct 2024

https://github.com/databricks/dbrx

Code examples and resources for DBRX, a large language model developed by Databricks

databricks gen-ai generative-ai llm llm-inference llm-training mosaic-ai

Last synced: 17 Jan 2025

https://github.com/fasterdecoding/medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

llm llm-inference

Last synced: 15 Jan 2025

https://github.com/NVIDIA/GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server

Last synced: 31 Oct 2024

https://github.com/nvidia/generativeaiexamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

gpu-acceleration large-language-models llm llm-inference microservice nemo rag retrieval-augmented-generation tensorrt triton-inference-server

Last synced: 16 Jan 2025

https://github.com/FasterDecoding/Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

llm llm-inference

Last synced: 16 Nov 2024

https://github.com/predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch transformers

Last synced: 20 Jan 2025

https://github.com/liltom-eth/llama2-webui

Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.

llama-2 llama2 llm llm-inference

Last synced: 16 Jan 2025

https://github.com/intel/intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

4-bits chatbot chatpdf cpu gaudi2 gpu habana intel-optimized-llamacpp large-language-model llm-cpu llm-inference neural-chat neural-chat-7b neurips2023 rag retrieval speculative-decoding streamingllm xeon

Last synced: 15 Oct 2024

https://github.com/dstackai/dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.

aws azure cloud fine-tuning gcp gpu k8s kubernetes llm-inference llm-training llmops llms machine-learning orchestration python training

Last synced: 15 Jan 2025

https://github.com/b4rtaz/distributed-llama

Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.

distributed-computing distributed-llm llama2 llama3 llm llm-inference llms neural-network open-llm

Last synced: 16 Jan 2025

https://github.com/flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

cuda flash-attention gpu jit large-large-models llm-inference pytorch

Last synced: 16 Jan 2025

https://github.com/PySpur-Dev/PySpur

Graph-Based Editor for LLM Workflows

agent agents ai javascript llm llm-inference openai python react workflow

Last synced: 08 Jan 2025

https://github.com/PySpur-Dev/pyspur

Graph-Based Editor for LLM Workflows

agent agents ai javascript llm llm-inference openai python react workflow

Last synced: 02 Jan 2025

https://github.com/lean-dojo/LeanCopilot

LLMs as Copilots for Theorem Proving in Lean

formal-mathematics lean lean4 llm-inference machine-learning theorem-proving

Last synced: 20 Nov 2024

https://github.com/pyspur-dev/pyspur

Graph-Based Editor for LLM Workflows

agent ai javascript llm llm-inference openai python react workflow ycombinator

Last synced: 25 Dec 2024

https://github.com/SafeAILab/EAGLE

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

large-language-models llm-inference speculative-decoding

Last synced: 28 Oct 2024

https://github.com/katanemo/archgw

Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by the core contributors of Envoy proxy, on Envoy.

ai-gateway envoy envoyproxy gateway generative-ai llm-gateway llm-inference llm-routing llmops llms openai prompt proxy proxy-server routing

Last synced: 22 Nov 2024

https://github.com/flagai-open/aquila2

The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.

llm llm-inference llm-training

Last synced: 18 Jan 2025

https://github.com/FlagAI-Open/Aquila2

The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.

llm llm-inference llm-training

Last synced: 06 Nov 2024

https://github.com/feifeibear/long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

attention-is-all-you-need deepspeed-ulysses llm-inference llm-training pytorch ring-attention

Last synced: 18 Jan 2025

https://github.com/vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 16 Nov 2024

https://github.com/vectorch-ai/scalellm

A high-performance inference system for large language models, designed for production environments.

cuda efficiency gpu inference llama llama3 llm llm-inference model performance production serving speculative transformer

Last synced: 17 Jan 2025

https://github.com/preternatural-explore/mlx-swift-chat

A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.

ios llm-inference macos mlx mlx-swift swiftui

Last synced: 07 Nov 2024

https://github.com/rizerphe/local-llm-function-calling

A tool for generating function arguments and choosing what function to call with local LLMs

chatgpt-functions huggingface-transformers json-schema llm llm-inference openai-function-call openai-functions

Last synced: 18 Jan 2025

https://github.com/nvidia/star-attention

Efficient LLM Inference over Long Sequences

attention-mechanism large-language-models llm-inference

Last synced: 15 Jan 2025

https://github.com/eastriverlee/LLM.swift

LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.

gguf ios llm llm-inference macos swift tvos visionos watchos

Last synced: 23 Oct 2024

https://github.com/felladrin/minisearch

Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space

ai ai-search-engine artificial-intelligence generative-ai gpu-accelerated information-retrieval llm llm-inference metasearch metasearch-engine perplexity perplexity-ai question-answering rag retrieval-augmented-generation searxng web-llm web-search webapp wllama

Last synced: 15 Jan 2025

https://github.com/ray-project/ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

deep-learning distributed-machine-learning generative-ai llm llm-inference llm-serving ray ray-data ray-distributed ray-serve ray-train ray-tune

Last synced: 15 Nov 2024

https://github.com/ugorsahin/talkingheads

A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi

browser-automation chatgpt chatgpt-api claude copilot free gemini huggingchat llm-inference python selenium undetected-chromedriver

Last synced: 18 Jan 2025

https://github.com/morpheuslord/hackbot

AI-powered cybersecurity chatbot designed to provide helpful and accurate answers to your cybersecurity-related queries and also do code analysis and scan analysis.

ai automation chatbot cli-chat-app cybersecurity cybersecurity-education cybersecurity-tools llama-api llama2 llama2-7b llamacpp llm-inference runpod

Last synced: 20 Jan 2025

https://github.com/ugorsahin/TalkingHeads

A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi

browser-automation chatgpt chatgpt-api claude copilot free gemini huggingchat llm-inference python selenium undetected-chromedriver

Last synced: 07 Nov 2024

https://github.com/harleyszhang/llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes

cuda-programming kv-cache llm llm-inference transformer-models triton-kernels vllm

Last synced: 21 Dec 2024

https://github.com/zjhellofss/kuiperllama

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2

Last synced: 13 Jan 2025

https://github.com/zjhellofss/KuiperLLama

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

cpp cuda inference-engine llama2 llama3 llm llm-inference qwen qwen2

Last synced: 03 Jan 2025

https://github.com/bytedance/abq-llm

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

cuda llm-inference mlsys quantized-networks research

Last synced: 14 Jan 2025

https://github.com/inferflow/inferflow

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

baichuan2 bloom deepseek falcon gemma internlm llama2 llamacpp llm-inference m2m100 minicpm mistral mixtral mixture-of-experts model-quantization moe multi-gpu-inference phi-2 qwen

Last synced: 14 Jan 2025

https://github.com/ai-hypercomputer/jetstream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer

Last synced: 20 Jan 2025

https://github.com/AI-Hypercomputer/JetStream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer

Last synced: 01 Nov 2024

https://github.com/arc53/llm-price-compass

This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider price comparison, gpu benchmarks to price per token calculation, gpu benchmark table

benchmark gpu hacktoberfest inference-comparison llm llm-comparison llm-inference llm-price

Last synced: 18 Jan 2025

https://github.com/expectedparrot/edsl

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

anthropic data-labeling deepinfra domain-specific-language experiments llama2 llm llm-agent llm-framework llm-inference market-research mixtral open-source openai python social-science surveys synthetic-data

Last synced: 18 Jan 2025

https://github.com/C0deMunk33/bespoke_automata

Bespoke Automata is a GUI and deployment pipline for making complex AI agents locally and offline

agents ai automation chatbots developer-tools llm-inference

Last synced: 06 Jan 2025

https://github.com/andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

cpp cuda inference-engine llama llamacpp llm llm-inference machine-learning mistral

Last synced: 18 Jan 2025

https://github.com/unifyai/unify

Build Your AI Workflow in Seconds ⚡

ai claude gpt gpt-4 llama2 llm llm-inference llms mixtral openai python

Last synced: 18 Jan 2025

https://github.com/uiuc-focal-lab/syncode

Efficient and general syntactical decoding for Large Language Models

large-language-models llm llm-inference parser

Last synced: 17 Nov 2024

https://github.com/fasterdecoding/rest

REST: Retrieval-Based Speculative Decoding, NAACL 2024

llm-inference retrieval speculative-decoding

Last synced: 18 Jan 2025

https://github.com/intel/neural-speed

An innovative library for efficient LLM inference via low-bit quantization

cpu fp4 fp8 gaudi2 gpu int4 int8 llamacpp llm-fine-tuning llm-inference low-bit mxformat nf4 sparsity

Last synced: 10 Oct 2024

https://github.com/efeslab/fiddler

Fast Inference of MoE Models with CPU-GPU Orchestration

llm llm-inference local-inference mixtral-8x7b mixture-of-experts

Last synced: 18 Jan 2025

https://github.com/MorpheusAIs/Morpheus

Morpheus - A Network For Powering Smart Agents - Compute + Code + Capital + Community

agents ai compute ethereum llm-inference llms smart-agents smart-contracts

Last synced: 13 Nov 2024

https://github.com/1b5d/llm-api

Run any Large Language Model behind a unified API

chatgpt gptq huggingface langchain llama llamacpp llm llm-inference machine-learning python

Last synced: 08 Jan 2025

https://github.com/cgbur/llama2.zig

Inference Llama 2 in one file of pure Zig

llama llama2 llm llm-inference simd zig ziglang

Last synced: 13 Jan 2025

https://github.com/Infini-AI-Lab/TriForce

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

acceleration efficiency inference llm llm-inference long-context speculative-decoding

Last synced: 19 Nov 2024

https://github.com/armbues/SiLLM

SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.

apple-silicon dpo large-language-models llm llm-inference llm-training lora mlx

Last synced: 25 Nov 2024

https://github.com/bytedance/shadowkv

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

cpu-offload high-throughput llm-inference long-context low-rank research sparse-attention

Last synced: 14 Jan 2025

https://github.com/modelscope/dash-infer

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

cpu cuda guided-decoding llm llm-inference native-engine

Last synced: 12 Jan 2025

https://github.com/chenhunghan/ialacol

🪶 Lightweight OpenAI drop-in replacement for Kubernetes

ai cloudnative cuda ggml gptq gpu helm kubernetes langchain llamacpp llm llm-inference llm-serving openai python

Last synced: 20 Jan 2025

https://github.com/vemonet/libre-chat

🦙 Free and Open Source Large Language Model (LLM) chatbot web UI and API. Self-hosted, offline capable and easy to setup. Powered by LangChain.

chatbot chatgpt langchain large-language-models llama2 llm llm-inference mixtral no-code open-source openapi question-answering self-hosted

Last synced: 17 Jan 2025

https://github.com/ictnlp/truthx

Code for ACL 2024 paper "TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space"

baichuan chatglm chatgpt explainable-ai gpt-4 hallucination hallucinations language-model llama llama2 llama3 llm llm-inference llms mistral representation safety truthfulness

Last synced: 22 Dec 2024

https://github.com/jieyuz2/ecoassistant

EcoAssistant: using LLM assistant more affordably and accurately

chatbot gpt large-language-models llm-inference nlp

Last synced: 29 Dec 2024

https://github.com/aniketmaurya/llm-inference

Large Language Model (LLM) Inference API and Chatbot

chatbot langchain llama llm llm-inference mistral

Last synced: 27 Sep 2024

https://github.com/genai-impact/ecologits

🌱 EcoLogits tracks the energy consumption and environmental footprint of using generative AI models through APIs.

genai generative-ai green-ai green-software llm llm-inference python sustainability sustainable-ai

Last synced: 15 Jan 2025

https://github.com/ooridata/toolio

GenAI & agent toolkit for Apple Silicon Mac, implementing JSON schema-steered structured output (3SO) and tool-calling in Python. For more on 3SO: https://huggingface.co/blog/ucheog/llm-power-steering

agentic ai apple-silicon client-server genai json-schema llm llm-inference mac mlx tool-calling tools

Last synced: 18 Jan 2025

https://github.com/KVignesh122/AssetNewsSentimentAnalyzer

A sentiment analyzer package for financial assets and securities utilizing GPT models.

commodity-trading financial-analysis forex-trading google-search-api investment-analysis llm-inference news-api sentiment-analysis

Last synced: 02 Nov 2024