{"id":26343664,"url":"https://github.com/tommathewxc/lidia","last_synced_at":"2026-04-28T17:32:55.011Z","repository":{"id":282511804,"uuid":"948833014","full_name":"tommathewXC/lidia","owner":"tommathewXC","description":"A fully customizable, super light-weight, cross-platform GenAI based Personal Assistant that can be run locally on your private hardware!","archived":false,"fork":false,"pushed_at":"2025-03-15T04:35:15.000Z","size":22,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-15T05:20:53.644Z","etag":null,"topics":["deep-neural-networks","genai","huggingface","llm","local-genai","local-llm","local-llm-integration","ocr-recognition","ollama","ollama-python","personal-assistant","speech-to-text","text-to-speech","vllm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tommathewXC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-15T03:58:35.000Z","updated_at":"2025-03-15T04:35:18.000Z","dependencies_parsed_at":"2025-03-15T05:34:24.848Z","dependency_job_id":null,"html_url":"https://github.com/tommathewXC/lidia","commit_stats":null,"previous_names":["tommathewxc/lidia"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tommathewXC%2Flidia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tommathewXC%2Flidia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tommathewXC%2Flidia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tommathewXC%2Flidia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tommathewXC","download_url":"https://codeload.github.com/tommathewXC/lidia/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243826788,"owners_count":20354222,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-neural-networks","genai","huggingface","llm","local-genai","local-llm","local-llm-integration","ocr-recognition","ollama","ollama-python","personal-assistant","speech-to-text","text-to-speech","vllm"],"created_at":"2025-03-16T05:17:41.548Z","updated_at":"2026-04-28T17:32:54.976Z","avatar_url":"https://github.com/tommathewXC.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lidia\n\n![License](https://img.shields.io/badge/license-MIT-blue.svg)\n![Python](https://img.shields.io/badge/python-3.9%2B-brightgreen)\n\nLidia is a voice-enabled AI assistant framework that integrates audio processing, computer vision, and language models into a cohesive multimodal experience. It can understand spoken commands, visually perceive the desktop environment, and respond with natural-sounding speech.\n\n## Table of Contents\n\n- [Features](#-features)\n- [Requirements](#-requirements)\n- [Quick Start](#-quick-start)\n- [Usage](#-usage)\n  - [Basic Usage](#basic-usage)\n  - [LLM Options](#llm-options)\n  - [TTS Options](#tts-options)\n- [Configuration](#-configuration)\n- [Architecture](#-architecture)\n- [How It Works](#-how-it-works)\n- [Demo](#-demo)\n- [Custom Tools and Actions](#-custom-tools-and-actions)\n  - [Understanding the Tool System](#understanding-the-tool-system)\n  - [Existing Tools](#existing-tools)\n  - [Creating a Custom Tool](#creating-a-custom-tool)\n  - [Beyond LangChain](#beyond-langchain)\n- [Model Repository](#-model-repository)\n  - [Current Implementation](#current-implementation)\n  - [Future Extensions](#future-extensions)\n- [Adding New Models](#-adding-new-models)\n  - [Speech Recognition Models](#speech-recognition-models)\n  - [Text-to-Speech Models](#text-to-speech-models)\n  - [LLM Models](#llm-models)\n  - [OCR Models](#ocr-models)\n  - [Image Captioning Models](#image-captioning-models)\n  - [After Adding New Models](#after-adding-new-models)\n  - [Tips for Model Selection](#tips-for-model-selection)\n- [Known Limitations](#-known-limitations)\n- [Contributing](#-contributing)\n- [License](#-license)\n- [Acknowledgments](#-acknowledgments)\n\n## 🌟 Features\n\n- **Voice Interaction**: Speech-to-text and text-to-speech capabilities for natural conversations\n- **Computer Vision**: Real-time desktop monitoring with OCR and image captioning\n- **Multiple LLM Backends**: Support for local models, Ollama, and OpenAI\n- **API Extensions**: Modular design with APIs for screenshot capture and datetime\n- **Orchestration**: Seamless coordination between all components\n- **Custom Tools**: Extensible architecture for adding new capabilities\n\n## 📋 Requirements\n\n- Python 3.9+\n- Required packages listed in `requirements.txt`\n- For local LLM mode: 16GB+ RAM recommended\n- For speech synthesis: Audio output device\n- For speech recognition: Microphone\n\n## 🚀 Quick Start\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/yourusername/lidia.git\ncd lidia\n```\n\n2. Create and activate a virtual environment:\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n```\n\n3. Install dependencies:\n```bash\npip install -r requirements.txt\n```\n\n4. Download the required models:\n```bash\npython install_models.py\n```\n\n5. Run Lidia with default settings:\n```bash\npython main.py\n```\n\n## 💻 Usage\n\n### Basic Usage\n\n```bash\n# Make sure your virtual environment is activated\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Run with default settings\npython main.py\n```\n\n### LLM Options\n\n```bash\n# Use OpenAI's API (requires API key in ~/.accesstokens/openai-api)\npython main.py --llm_mode openai\n\n# Use Ollama (must have Ollama installed with models available)\npython main.py --llm_mode ollama\n\n# Use a local model (requires downloaded models)\npython main.py --llm_mode local\n```\n\n### TTS Options\n\n```bash\n# Use system TTS (faster, less resource-intensive)\npython main.py --tts_mode system\n\n# Use ML-based TTS (higher quality, more resource-intensive)\npython main.py --tts_mode ml\n```\n\n## 🔧 Configuration\n\nConfiguration options can be found in `lidia/config/config.py`. Key settings include:\n\n- **LLM Backend**: Choose between local models, Ollama, or OpenAI\n- **Model Paths**: Customize paths to speech recognition, LLM, and TTS models\n- **Voice Settings**: Adjust voice parameters and assistant name\n- **Screenshot Settings**: Configure monitor index and capture interval\n- **Audio Settings**: Adjust silence timeout and chunk duration\n\n## 📊 Architecture\n\nLidia is organized into several modular components:\n\n- **Audio Module**: Handles speech-to-text and text-to-speech\n  - `audiostreamer.py`: Manages audio input and transcription\n  - `audiosynthesizer.py`: Converts text to speech\n\n- **Vision Module**: Processes visual information\n  - `screenshot.py`: Captures desktop screenshots\n  - `ocr_processor.py`: Extracts text from images\n\n- **LLM Module**: Manages language model interactions\n  - `llmagent.py`: Interfaces with various LLM backends\n\n- **APIs**: Exposes functionality to the LLM\n  - `screenshot_api.py`: Provides screen analysis capabilities\n  - `datetime_api.py`: Offers date and time information\n\n- **Orchestrator**: Coordinates all components for a seamless experience\n\n## 📚 How It Works\n\n1. **Audio Streaming**: The system continuously listens for user input using the microphone\n2. **Speech Recognition**: User speech is transcribed to text via Whisper\n3. **LLM Processing**: Text is sent to the configured LLM with context and tools\n4. **Tool Integration**: The LLM can use tools like screenshot analysis when needed\n5. **Response Generation**: Responses are generated and streamed from the LLM\n6. **Speech Synthesis**: Text responses are converted to speech using the configured TTS system\n\n## 🎬 Demo\n\n[![Lidia Demo](https://img.youtube.com/vi/KN5Jbkyp0z4/maxresdefault.jpg)](https://www.youtube.com/watch?v=KN5Jbkyp0z4)\n*Click the image above to watch the demo video on YouTube*\n\n### Demo Environment\n\nThe demonstration showcases Lidia running on a MacBook Pro, highlighting:\n\n- **Natural Voice Interaction**: Lidia responds to spoken queries with synthesized speech\n- **Screen Analysis**: The assistant can capture and analyze screen content in real-time\n- **Multi-modal Capabilities**: Integration of vision, audio, and language understanding\n- **Tool Usage**: Examples of the assistant utilizing datetime and screenshot tools\n\nThe demo illustrates how Lidia functions in a real-world environment, using the system's voice synthesis capabilities and showing the responsive nature of the assistant running entirely on local hardware.\n\n## 🧰 Custom Tools and Actions\n\nLidia can be extended with custom tools and actions that allow the assistant to perform specific tasks beyond conversation. The current implementation uses LangChain for its tool framework, but this is designed to be configurable.\n\n### Understanding the Tool System\n\nTools in Lidia are Python functions that:\n1. Accept structured inputs\n2. Perform actions (API calls, data processing, etc.)\n3. Return results that the LLM can incorporate into responses\n\nLangChain provides the scaffolding to register these tools and make them accessible to the LLM.\n\n### Existing Tools\n\nLidia comes with two built-in tools:\n\n1. **DateTime API** (`datetime_api.py`): Provides the current date and time\n2. **Screenshot API** (`screenshot_api.py`): Captures and analyzes screen content\n\n### Creating a Custom Tool\n\nHere's how to create your own custom tool:\n\n1. **Create a new API file** in the `lidia/apis/` directory:\n\n```python\n# lidia/apis/weather_api.py\n\"\"\"API for retrieving weather information.\"\"\"\nfrom logging import getLogger\nimport requests\n\nlogger = getLogger(__name__)\n\ndef get_weather(location: str = \"New York\"):\n    \"\"\"Gets the current weather for a location.\"\"\"\n    try:\n        # Replace with your actual weather API call\n        logger.info(f\"Getting weather for {location}\")\n        return f\"Current weather in {location}: Sunny, 22°C\"\n    except Exception as e:\n        logger.error(f\"Error getting weather: {e}\")\n        return f\"Error retrieving weather: {e}\"\n```\n\n2. **Register your tool in `llmagent.py`**:\n\n```python\n# Add the import\nfrom lidia.apis.weather_api import get_weather\n\n# In the _setup_tools method\nclass WeatherInput(BaseModel):\n    \"\"\"Input schema for weather tool.\"\"\"\n    location: str = Field(default=\"New York\", description=\"City or location name\")\n\n# Add to the tools list\ntools = [\n    # Existing tools...\n    StructuredTool(\n        name=\"get_weather\",\n        func=get_weather,\n        description=\"Gets the current weather for a location.\",\n        args_schema=WeatherInput,\n        return_direct=True\n    )\n]\n```\n\n3. **Update the system message** to inform the LLM about the new tool:\n\n```python\nsystem_message = SystemMessage(content=\"\"\"You are an AI assistant that can perceive the environment through vision and interact through speech.\nWhen a user asks about time or date, ALWAYS use the get_current_datetime tool.\nWhen asked to look at or analyze the screen, ALWAYS use the take_screenshot tool.\nWhen asked about weather, ALWAYS use the get_weather tool.\n\nTools available: {tools}\n\nFormat your responses in a natural, conversational way.\"\"\")\n```\n\n### Beyond LangChain\n\nWhile Lidia currently uses LangChain for tools integration, the architecture is designed to be framework-agnostic. Future versions may include:\n\n- Support for alternative tool frameworks\n- Direct LLM tool integration without middleware\n- Tool discovery and registration systems\n- Tool version management\n\n## 🧠 Model Repository\n\nLidia currently uses HuggingFace as its primary model repository. Models are downloaded during the installation process and stored locally for optimal performance.\n\n### Current Implementation\n\nThe system downloads models from HuggingFace for:\n- Speech recognition (Whisper)\n- Text-to-speech (SpeechT5)\n- LLM functionality (when using local mode)\n- OCR processing (TrOCR)\n- Image captioning (BLIP)\n\n### Future Extensions\n\nThe architecture is designed to be model-repository agnostic. Future versions will:\n- Support multiple model repositories beyond HuggingFace\n- Allow easy switching between different model sources\n- Enable custom model integrations from any open-source repository\n- Support model mixing from different sources\n\n## 🧩 Adding New Models\n\nLidia's modular design makes it easy to extend with new models. Below are step-by-step guides for adding models for each component.\n\n### Speech Recognition Models\n\n1. **Identify a suitable Whisper model variant** on HuggingFace (e.g., `openai/whisper-tiny`, `openai/whisper-small`, `openai/whisper-medium`)\n\n2. **Update the model installation script** at `install_models.py`:\n   ```python\n   MODELS = {\n       \"audio\": {\n           \"whisper-base\": \"openai/whisper-base\",\n           \"whisper-medium\": \"openai/whisper-medium\",  # Add your new model here\n           \"speecht5_tts\": \"microsoft/speecht5_tts\",\n           \"speecht5_hifigan\": \"microsoft/speecht5_hifigan\"\n       },\n       # ...\n   }\n   ```\n\n3. **Update the configuration** in `lidia/config/config.py` to use your new model:\n   ```python\n   global_config = {\n       \"speech_to_text\": \"models/audio/whisper-medium\",  # Change to your new model path\n       # ...\n   }\n   ```\n\n### Text-to-Speech Models\n\n1. **Find compatible TTS models** on HuggingFace (SpeechT5 compatible models like `microsoft/speecht5_tts`)\n\n2. **Add the model to the installation script**:\n   ```python\n   MODELS = {\n       \"audio\": {\n           # ...\n           \"speecht5_tts\": \"microsoft/speecht5_tts\",\n           \"speecht5_tts_new\": \"path/to/new/tts/model\",  # Add your new model here\n           \"speecht5_hifigan\": \"microsoft/speecht5_hifigan\"\n       },\n       # ...\n   }\n   ```\n\n3. **Update the configuration** to use your new TTS model:\n   ```python\n   global_config = {\n       # ...\n       \"text_to_speech\": {\n           \"model\": \"models/audio/speecht5_tts_new\",  # Point to your new model\n           \"vocoder\": \"models/audio/speecht5_hifigan\"\n       },\n       # ...\n   }\n   ```\n\n### LLM Models\n\n1. **Select a compatible LLM** from HuggingFace (e.g., models like LLaMA, Mistral, DeepSeek)\n\n2. **Add the model to the installation script**:\n   ```python\n   MODELS = {\n       # ...\n       \"llm\": {\n           \"DeepSeek-R1-Distill-Llama-70B\": \"deepseek-ai/DeepSeek-R1-Distill-Llama-70B\",\n           \"Mistral-7B\": \"mistralai/Mistral-7B-v0.1\"  # Add your new model here\n       },\n       # ...\n   }\n   ```\n\n3. **Update the configuration** to use your model:\n   ```python\n   global_config = {\n       # ...\n       \"llm\": \"models/llm/Mistral-7B\",  # Point to your new model\n       # ...\n   }\n   ```\n\n4. **For Ollama models**, update the Ollama model name:\n   ```python\n   global_config = {\n       # ...\n       \"ollama_model\": \"mistral:7b\",  # Update with equivalent Ollama model\n       # ...\n   }\n   ```\n\n### OCR Models\n\n1. **Find a TrOCR compatible model** on HuggingFace (e.g., `microsoft/trocr-base-handwritten`, `microsoft/trocr-large-printed`)\n\n2. **Add the model to the installation script**:\n   ```python\n   MODELS = {\n       # ...\n       \"ocr\": {\n           \"trocr\": global_config[\"ocr\"][\"model\"],\n           \"trocr_handwritten\": \"microsoft/trocr-base-handwritten\"  # Add your new model here\n       },\n       # ...\n   }\n   ```\n\n3. **Update the configuration** to use your new OCR model:\n   ```python\n   global_config = {\n       # ...\n       \"ocr\": {\n           \"model\": \"models/ocr/trocr_handwritten\"  # Point to your new model\n       },\n       # ...\n   }\n   ```\n\n### Image Captioning Models\n\n1. **Select an image captioning model** from HuggingFace (e.g., `Salesforce/blip-image-captioning-large`)\n\n2. **Add the model to the installation script**:\n   ```python\n   MODELS = {\n       # ...\n       \"image\": {\n           \"blip\": global_config[\"image_captioning\"][\"model\"],\n           \"blip_large\": \"Salesforce/blip-image-captioning-large\"  # Add your new model here\n       },\n       # ...\n   }\n   ```\n\n3. **Update the configuration** to use your new captioning model:\n   ```python\n   global_config = {\n       # ...\n       \"image_captioning\": {\n           \"model\": \"models/image/blip_large\"  # Point to your new model\n       },\n       # ...\n   }\n   ```\n\n### After Adding New Models\n\n1. **Run the model installation script** to download new models:\n   ```bash\n   source venv/bin/activate  # Activate your virtual environment\n   python install_models.py\n   ```\n\n2. **Test the new model** by running Lidia:\n   ```bash\n   python main.py\n   ```\n\n3. **Consider model compatibility**: Always check the model's documentation for compatibility with Lidia's architecture. Some models may require additional transformations or preprocessing steps.\n\n### Tips for Model Selection\n\n- **Balance size and performance**: Larger models generally perform better but require more resources\n- **Check hardware requirements**: Some models need significant GPU memory\n- **Consider inference speed**: Slower models may impact real-time interaction\n- **Verify model license**: Ensure the model license is compatible with your use case\n\n## ⚠️ Known Limitations\n\n- Resource usage can be high when using ML-based TTS and local LLMs\n- Limited multi-language support in the current version\n- OCR may struggle with complex visual content\n\n## 🤝 Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## 📜 License\n\n[MIT License](LICENSE)\n\n## 🙏 Acknowledgments\n\n- HuggingFace for transformer models\n- OpenAI for Whisper and GPT\n- Microsoft for Speech T5 and TrOCR\n- Ollama for local LLM integration\n- LangChain for the tools framework","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftommathewxc%2Flidia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftommathewxc%2Flidia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftommathewxc%2Flidia/lists"}