https://github.com/xxrjun/local-inference
🐑 Run LLM inference locally for various downstream applications.
https://github.com/xxrjun/local-inference
chat-tts fish-speech grafana immersive-translate llm-inference ollama prometheus tts vllm
Last synced: 9 months ago
JSON representation
🐑 Run LLM inference locally for various downstream applications.
- Host: GitHub
- URL: https://github.com/xxrjun/local-inference
- Owner: xxrjun
- Created: 2025-02-02T21:54:56.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-04-09T15:28:45.000Z (9 months ago)
- Last Synced: 2025-04-09T16:42:44.520Z (9 months ago)
- Topics: chat-tts, fish-speech, grafana, immersive-translate, llm-inference, ollama, prometheus, tts, vllm
- Language: Shell
- Homepage:
- Size: 2.53 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Local Inference
- [Local Inference](#local-inference)
- [Tools](#tools)
- [Backend](#backend)
- [Frontend](#frontend)
- [Monitor](#monitor)
- [Others](#others)
- [Open Source Model Collections](#open-source-model-collections)
- [Environment Setup](#environment-setup)
- [Example Usage](#example-usage)
- [Ollama](#ollama)
- [vLLM](#vllm)
- [Open Web UI](#open-web-ui)
- [Test OpenAI Compatible API](#test-openai-compatible-api)
- [Downstream Applications](#downstream-applications)
- [Immersive Translate 沉浸式翻譯](#immersive-translate-沉浸式翻譯)
- [TTS (Text-to-Speech)](#tts-text-to-speech)
> [!NOTE]
> The commands in this guide are intended for **Ubuntu Linux**. If you are using a different platform (e.g., Windows or macOS), please refer to the official documentation of the tool for platform-specific instructions.
## Tools
### Backend
> Further Reading: [Ollama vs. vLLM: Choosing the Best Tool for AI Model Workflows](https://collabnix.com/ollama-vs-vllm-choosing-the-best-tool-for-ai-model-workflows/)
- [Ollama](https://ollama.com/) [[Docs](https://github.com/ollama/ollama/tree/main/docs)]
- [Keep Model Alive in GPU Memory](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately)
- [Handle Concurrent Requests](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests)
- [vLLM](https://github.com/vllm-project/vllm) [[Docs](https://docs.vllm.ai/en/latest/index.html)]
- Monitor [Prometheus and Grafana](https://docs.vllm.ai/en/stable/getting_started/examples/prometheus_grafana.html)
### Frontend
- [Open Web UI](https://github.com/open-webui/open-webui)
- [Environment Variable Configuration](https://docs.openwebui.com/getting-started/env-configuration/)
### Monitor
> [!TIP]
> Check out [prometheus_grafana](./prometheus_grafana) for more details.
- [Grafana](https://github.com/grafana/grafana) [[Docs](https://grafana.com/docs/grafana/latest/)]
- [Prometheus](https://github.com/prometheus/prometheus) [[Docs](https://prometheus.io/docs/introduction/overview/)]
### Others
- [OpenAI & other LLM API Pricing Calculator](https://yourgpt.ai/tools/openai-and-other-llm-api-pricing-calculator) - Calculate the cost of using OpenAI and other Large Language Models (LLMs) APIs
## Open Source Model Collections
- [DeepSeek | Hugging Face](https://huggingface.co/deepseek-ai)
- [Taiwan LLM | Hugging Face](https://huggingface.co/taiwan-llm)
## Environment Setup
First, clone the repository:
```bash
git clone --recurse-submodules https://github.com/xxrjun/local-inference.git
```
Then, create a new [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) environment and install the required dependencies:
```bash
conda env -n local-inference python=3.12
conda activate local-inference
# Install Python dependencies
pip install -r requirements.txt
# Install Ollama on Linux
curl -fsSL https://ollama.com/install.sh | sh
```
## Example Usage
It is recommended to use [tmux](https://github.com/tmux/tmux) to manage multiple sessions.
### Ollama
```bash
tmux new -s ollama-serve
./examples/ollama_serve.sh
```
```bash
tmux new -s ollama-run
./examples/ollama_run.sh
```
### vLLM
```bash
tmux new -s vllm-serve
./examples/vllm_serve.sh
```
### Open Web UI
```bash
tmux new -s open-webui
./examples/open_webui.sh
```
### Test OpenAI Compatible API
Copy `.env.example` to `.env`:
```bash
cp .env.example .env
```
Edit `.env` with the correct values, then run the test script:
```bash
python scripts/test_openai_client.py
```
If the API is working correctly, the output should resemble the following:
```bash
ChatCompletionMessage(content='Hello! How can I help you today? If you have any questions or need assistance, feel free to ask.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], reasoning_content=None)
```
## Downstream Applications
### [Immersive Translate 沉浸式翻譯](https://immersivetranslate.com/)
Refer to [My Immersive Translate Setup Guide](./applications/immersive_translate/) or [Offical Docs](https://immersivetranslate.com/docs/services/ai/),

### TTS (Text-to-Speech)
What is TTS?
- [Text-to-Speech | Huggingface](https://huggingface.co/tasks/text-to-speech)
- [Text-to-speech | NVIDIA](https://www.nvidia.com/en-us/glossary/text-to-speech/)
Refer to the [My TTS Setup Guide](./applications/tts/README.md) for more details.
