{"id":48808141,"url":"https://github.com/alexbuildstech/assistivetech","last_synced_at":"2026-04-14T06:10:50.212Z","repository":{"id":335688683,"uuid":"1111756127","full_name":"alexbuildstech/assistivetech","owner":"alexbuildstech","description":"AI-Powered Assistive Navigation System with spatial memory, 3D audio guidance, and self-learning capabilities for visually impaired users","archived":false,"fork":false,"pushed_at":"2026-04-02T00:36:50.000Z","size":6078,"stargazers_count":3,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-02T11:37:34.878Z","etag":null,"topics":["accessibility","ai","assistive-technology","blind","computer-vision","gemini-ai","machine-learning","open-source","python","spatial-audio","visually-impaired"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alexbuildstech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-07T15:23:28.000Z","updated_at":"2026-02-21T11:17:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/alexbuildstech/assistivetech","commit_stats":null,"previous_names":["alexbuildstech/assistivetech"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/alexbuildstech/assistivetech","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexbuildstech%2Fassistivetech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexbuildstech%2Fassistivetech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexbuildstech%2Fassistivetech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexbuildstech%2Fassistivetech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alexbuildstech","download_url":"https://codeload.github.com/alexbuildstech/assistivetech/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexbuildstech%2Fassistivetech/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31784337,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T02:24:21.117Z","status":"ssl_error","status_checked_at":"2026-04-14T02:24:20.627Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accessibility","ai","assistive-technology","blind","computer-vision","gemini-ai","machine-learning","open-source","python","spatial-audio","visually-impaired"],"created_at":"2026-04-14T06:10:47.194Z","updated_at":"2026-04-14T06:10:50.207Z","avatar_url":"https://github.com/alexbuildstech.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Investigation into Persistent Spatial Memory for Assistive Vision\n\n\u003e **An experimental framework exploring the trade-offs between local heuristic state management and cloud-based Vision-Language Models (VLMs).**\n\n**Notice:** This is a research prototype and technical exploration. It is not a consumer-ready tool. The project investigates the integration of VLM-based object detection, persistent local state, and spatial audio to assist with indoor environmental awareness.\n\nThe central hypothesis is that a **locally-persistent object history** can reduce redundant VLM queries in static indoor environments without a corresponding loss in navigation-relevant object retrieval latency. This prototype serves as an environment for testing this hypothesis.\n\n---\n\n## 🤝 Partners \u0026 Acknowledgments\n\nThis research is made possible through the support of industry partners providing the core infrastructure for this project and Nova:\n\n- **[Radxa](https://radxa.com)**: Provided the **ROCK 5C** high-performance SBC, serving as the primary compute node for vision processing and testing for this project.\n- **[DFRobot](https://www.dfrobot.com)**: Provided the **DFRobot Mega 2560** for testing and various actuators for other related research projects.\n- **[Polymaker](https://polymaker.com)**: Provided advanced filaments for physical version testing of this technology and for the Nova humanoid framework.\n\n---\n\n## Quick Start\n\n### 1. Install Dependencies\n```bash\n# System dependencies (Ubuntu/Debian)\nsudo apt update \u0026\u0026 sudo apt install python3 python3-pip mpv\n\n# Python libraries\npip install google-generativeai opencv-python opencv-contrib-python \\\n            sounddevice scipy groq edge-tts pydub pynput \\\n            --break-system-packages\n```\n\n### 2. Configure API Keys\nCopy the template and add your API keys:\n```bash\ncp .env.example .env\nnano .env  # Add your GOOGLE_API_KEY and GROQ_API_KEY\n```\n\n### 3. Execution\n```bash\n# The app loads GOOGLE_API_KEY and GROQ_API_KEY from .env automatically\npython3 main_enhanced.py\n```\n\nFor terminal-only or headless runs:\n```bash\nNOVA_HEADLESS=1 python3 main_enhanced.py\n```\n\n### 4. Default Runtime Behavior\n- Core camera → detection/tracking → audio guidance stays enabled.\n- Voice command support stays enabled when Groq is configured correctly.\n- The following remain available but are now **optional / off by default** for stability:\n  - Hardware serial integration\n  - Persistent learning / recall\n  - HRTF / room reverb path\n  - Free-form chat persona\n- If Groq or Gemini credentials are invalid, the app now degrades more cleanly instead of crashing.\n\n### 5. Hardware Interaction\n- **F**: Trigger VLM-based object detection (single frame)\n- **C**: Initiate voice command recording\n- **S**: Stop voice recording and process command\n- **M**: Cycle through experimental operating modes\n- **Q**: Quit\n\nIn terminal-only headless mode, use **Ctrl+C** to exit.\n\n---\n\n## Technical Objectives \u0026 Current State\n\nThis framework implements:\n- **Heuristic-Guided Object Retrieval**: Uses VLM detections to populate a local state. (Functional; accuracy constrained by model selection and environmental lighting).\n- **Persistent Object History**: Logs object metadata (label, normalized coordinates, timestamp) to a local SQLite store for natural language recall. (Stable core; natural language parsing is heuristic-based).\n- **Spatial Audio Guidance**: A 3D audio engine for direction-finding. (Implemented using HRTF-inspired filters; effectiveness is subjective and lacks formal psychodynamic validation).\n- **Redundant Query Suppression**: A caching mechanism designed to minimize API calls for known static objects. (Currently implements a simple temporal/spatial overlap check).\n\n---\n\n## Known Constraints \u0026 Limitations\n\n- **Tracker Drift**: The local CSRT tracker is susceptible to occlusion and rapid viewpoint changes. No global re-localization is currently implemented.\n- **NLP Brittleness**: Command parsing relies on keyword-matching and simple LLM prompting; it does not yet handle complex, multi-step spatial reasoning.\n- **Latency Bottlenecks**: Round-trip time for cloud VLMs introduces a non-trivial delay (typically 1.5–3s) between environment change and system update.\n- **Coordinate Drift**: Lacks SLAM/Odometry integration. Object \"memory\" is relative to the frame of detection, which degrades as the user moves.\n- **Cloud Dependency**: The core vision and voice experience still depends on valid Gemini and Groq API credentials. Invalid keys now fail more safely, but they still disable major functionality.\n- **Headless Usage**: GUI rendering is optional now, but fully interactive visual control still works best when a display is available.\n\n---\n\n## Technological Curiosity: The Origin of the Approach\n\nThis project originated from a technical curiosity regarding the \"statelessness\" of most consumer assistive vision tools. While commercial systems are excellent at identifying *what* is in front of the user *right now*, they often lack the temporal consistency required to answer questions about the past (e.g., *\"Where did I put my phone two minutes ago?\"*).\n\nThe development process prioritized exploring the limits of low-cost hardware (SBCs) paired with high-performance cloud APIs. Early experiments focused on audio ergonomics—moving away from harsh pink noise toward adaptive, frequency-modulated \"pings\" that encode distance and importance. This project is an ongoing attempt to bridge the gap between real-time tracking and long-term environmental memory.\n\n---\n\n---\n\n## System Architecture\n\nThe framework is designed as a modular pipeline where data flows from environmental perception to spatial indexing and finally to audio-spatial rendering.\n\n- **Sense Phase**: Captures video frames and multiplexes them between the VLM (for semantic identification) and the CSRT tracker (for frame-to-frame continuity).\n- **Index Phase**: Interacts with the local SQLite store to reconcile new detections with historical data, applying temporal decay to stale entries.\n- **Render Phase**: Transforms object coordinates into HRTF-modulated audio signals, producing the directional cues provided to the user.\n\n```mermaid\ngraph TD\n    A[User] --\u003e|Voice/Keyboard| B[Command Processor]\n    B --\u003e C[Vision Module]\n    B --\u003e D[State Management Module]\n    B --\u003e E[Audio Module]\n    \n    C --\u003e|Detections| F[Object Manager]\n    D --\u003e|Persistent State| F\n    F --\u003e|Spatial Coordinates| E\n    E --\u003e|Spatial Audio| A\n    \n    C --\u003e|VLM API| G[(Cloud Backend)]\n    D --\u003e|Local Storage| H[(SQLite DB)]\n    \n    style A fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff\n    style B fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff\n    style F fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff\n```\n\n### Technical Specifications\n\n#### Hardware Environment\n\n| Component | Minimum | Recommended |\n|-----------|---------|-------------|\n| **Compute** | Linux-based x64 system | Radxa Rock 5C (ARM SBC) |\n| **Optics** | USB Webcam (640x480) | 720p+ USB Camera |\n| **Output** | Basic Speakers | Low-latency Stereo Headphones |\n| **Input** | Built-in Microphone | Directional External Mic |\n\n\u003e [!TIP]\n\u003e The system includes ARM-specific optimizations for compute-limited environments.\n\n#### Software Stack\n- **Vision VLM**: Google Gemini (General Robotics variant)\n- **Tracking**: OpenCV (CSRT Implementation)\n- **STT**: Groq (Whisper-based)\n- **TTS**: Microsoft Edge-TTS\n- **Persistence**: SQLite3\n- **Audio Processing**: `sounddevice` + `scipy`\n- **Native Logic**: Python 3.8+\n\n### Heuristic State Management: Under the Hood\n\n#### The Problem\nStateless assistive systems lose all environmental context the moment an object leaves the camera's viewport, requiring repetitive and costly re-scanning.\n\n#### The Implementation\nThis prototype explores **Persistent Object History** to maintain an internal representation of the environment.\n\n1.  **Observation**: Detections are serialized with a label, bounding box, timestamp, and perceptual hash for deduplication.\n2.  **Indexing**: Data is stored in a queryable SQLite database.\n3.  **Recall**: Natural language queries are mapped to database lookups of the most recent known location.\n4.  **Decay Heuristics**: Implements simple rules for merging duplicates and prioritizing recent sighting data over historical logs.\n\n### Project Structure\n\nThe codebase is organized into discrete functional modules to facilitate experimentation:\n\n- `main_enhanced.py`: Main execution loop and event handling.\n- `vision_module.py`: Interface for VLM detection and classical tracking.\n- `learning_module.py`: Logic for SQLite persistence and heuristic decay.\n- `audio_module_multi.py`: 3D audio synthesis and HRTF filtering.\n- `object_manager.py`: Coordinator for tracking multiple identities.\n- `config.py`: Centralized configuration and API management.\n\n---\n\n## Future Development Roadmap\n\nThis roadmap outlines planned features and long-term research trajectories.\n\n### Current Technical Tracks\n- **Hardware Integration**: ESP32 wireless connectivity and haptic feedback research.\n- **Multimodal Feedback**: Integrating small OLED status displays and battery telemetry.\n- **Edge Processing**: Researching offline modes using local Whisper variants and TinyML.\n\n### Research Questions\n- How can coordinate frame consistency be maintained in the absence of a global SLAM system?\n- What are the minimal semantic markers required for a VLM to reconstruct a scene graph from disjointed frames?\n\n---\n\n## Research Context \u0026 Trade-offs\n\nThis project occupies a niche between high-cost commercial assistive devices and generic mobile object-recognition apps.\n\n- **Open Source Transparency**: Unlike closed-source commercial tools, all heuristics and data-handling practices are fully transparent and auditable.\n- **Local Sovereignty**: Prioritizes local processing for spatial indexing and audio rendering, using the cloud only when semantic reasoning is required.\n- **Experimental Interfaces**: Explores non-standard audio-spatial metaphors that are often too niche for broad commercial products.\n\n---\n\n## Resource Utilization\n\n### API Dependency Notes\n- **Google Gemini API**: Optimized for sparse, high-context queries.\n- **Groq Whisper**: High-speed, low-latency speech-to-text.\n- **Edge-TTS**: Cost-effective, natural-sounding voice synthesis.\n\n### Hardware Reference\nA functional prototype can be assembled for approximately **$50–$150**, significantly lower than the entry point for dedicated assistive hardware (e.g., OrCam). This cost reduction is achieved by shifting complex processing to cloud VLMs and using off-the-shelf Linux hardware.\n\n---\n\n## Citation \u0026 Acknowledgments\n\nIf using this framework for research, please cite it as an experimental prototype for spatial state management.\n\n*(Standard contributing, license, and contact info remains below...)*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexbuildstech%2Fassistivetech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falexbuildstech%2Fassistivetech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexbuildstech%2Fassistivetech/lists"}