https://github.com/sakhileln/multimodal-agent

A beginner-friendly project to build a simple multimodal AI agent. 🦾
https://github.com/sakhileln/multimodal-agent

agent agents ai keras nlp nlp-machine-learning numpy opencv pillow spacy-nlp tensorflow

Last synced: 7 months ago
JSON representation

A beginner-friendly project to build a simple multimodal AI agent. 🦾

Host: GitHub
URL: https://github.com/sakhileln/multimodal-agent
Owner: sakhileln
License: gpl-3.0
Created: 2025-03-16T16:53:00.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-03-16T22:20:52.000Z (11 months ago)
Last Synced: 2025-03-16T22:27:27.401Z (11 months ago)
Topics: agent, agents, ai, keras, nlp, nlp-machine-learning, numpy, opencv, pillow, spacy-nlp, tensorflow
Language: Python
Homepage:
Size: 22.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# Simple Multimodal Agent
A beginner-friendly project to build a simple multimodal AI agent that processes text and image inputs and generates basic text outputs. The goal is to learn about AI concepts like natural language processing (NLP) and computer vision while keeping the implementation lightweight and executable on a standard personal computer.

## Project Overview
This agent accepts:
- **Text Input**: A prompt or question (e.g., "What is in this image?").
- **Image Input**: A local image file (e.g., JPG or PNG).
- **Output**: A text response combining insights from both inputs (e.g., "The image shows a dog.").

The project uses Python and open-source libraries to process inputs locally without requiring advanced hardware or external APIs.

## Features
- Processes text using lightweight NLP tools (e.g., spaCy).
- Analyzes images using a pre-trained model (e.g., MobileNet).
- Runs on a CPU-based system with minimal dependencies.
- Simple command-line interface for interaction.

## Limitations
- **Accuracy**: The agent uses `MobileNet`, a small model, which may not always accurately classify complex or ambiguous images. This is a trade-off for simplicity and CPU compatibility.

## Prerequisites
- Python 3.8 or higher
- A standard personal computer (no GPU required)

## Installation
1. Clone the repository:
```bash
git clone https://github.com/sakhileln/multimodal-agent.git
cd multimodal-agent
```
2. Create a virtual environment (optional but recommended):
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

## Dependencies
- `opencv-python`: For image processing.
- `tensorflow` or `keras`: For pre-trained image models (e.g., `MobileNet`).
- `spacy`: For text processing (run `python -m spacy download en_core_web_sm` after installation).
- `numpy`: For general array operations.
- `pillow`: For image file handling.

#### Install them with:
```bash
pip install opencv-python tensorflow spacy numpy pillow
python -m spacy download en_core_web_sm
```

## Usage
- Run the agent with a text prompt and an image file:
```bash
python multimodal_agent.py --text "What is in this image?" --image "path/to/dog.jpg"
```
- Example output:
```bash
The image shows a dog.
```

## Project Structure
- `multimodal_agent.py`: Main script implementing the agent.
- `requirements.txt`: List of dependencies.
- `README.md`: This file.
- `sample_images/`: Directory for test images (e.g., dog.jpg).
- 'notes.md`: Summary of key takeways.

## Development
This project is organized into sprints with GitHub Issues. See the [Issues tab](https://github.com/sakhileln/multimodal-agent/issues) for tasks and progress.

## Contributing

Contributions are welcome! If you'd like to contribute. See the [CONTRIBUTING](CONTRIBUTING.md) file for details.
1. Fork the repository.
2. Create a new branch for your feature/bug fix:
```bash
git checkout -b feature/YourFeature
```
3. Make your changes and test thoroughly.
4. Submit a pull request explaining your changes.

## License
This project is licensed under the GPL v3.0 License. See the [LICENSE](LICENSE) file for details.

## Acknowledgments

- [LangChain](https://www.langchain.com/): For providing robust tools to handle language model operations.
- [Hugging Face](https://huggingface.co/): For providing versatile and high-quality machine learning models.
- [GitHub](https://github.com): For offering a robust platform for collaboration and version control.
- [TensorFlow](https://www.tensorflow.org/): A software library for machine learning and artificial intelligence.
- [OpenCV](https://opencv.org/): For computer vision library.
- [spaCy](https://spacy.io/): For natural language processing.
- [NumPy](https://numpy.org/): A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- [pillow](https://pypi.org/project/pillow/): The Python Imaging Library adds image processing capabilities to your Python interpreter.
- [pip](https://pypi.org/project/pip/): A dependency management tool.

## Contact

- Sakhile III
- [LinkedIn Profile](https://www.linkedin.com/in/sakhile-ndlazi)
- [GitHub Profile](https://github.com/sakhileln)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sakhileln/multimodal-agent

Awesome Lists containing this project

README