{"id":31386925,"url":"https://github.com/tarekabouzeid/data-lab-playground","last_synced_at":"2025-09-28T20:55:36.144Z","repository":{"id":315947467,"uuid":"1025166457","full_name":"tarekabouzeid/data-lab-playground","owner":"tarekabouzeid","description":"A simple Docker based playground that brings together popular open source data and AI tools to help others get started with data lakehouse architecture and GenAI development on their local machines.","archived":false,"fork":false,"pushed_at":"2025-09-21T19:14:40.000Z","size":48,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-21T20:39:28.832Z","etag":null,"topics":["genai","jupyter-notebook","lakehouse-platform","minio","ollama","qdrant-vector-database","rag","spark","trino"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tarekabouzeid.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-23T20:41:43.000Z","updated_at":"2025-09-21T19:14:43.000Z","dependencies_parsed_at":"2025-09-21T20:39:32.465Z","dependency_job_id":"1720c97b-9dc8-4bbd-b8fc-ccf8d333d8a2","html_url":"https://github.com/tarekabouzeid/data-lab-playground","commit_stats":null,"previous_names":["tarekabouzeid/data-lab-playground"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/tarekabouzeid/data-lab-playground","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarekabouzeid%2Fdata-lab-playground","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarekabouzeid%2Fdata-lab-playground/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarekabouzeid%2Fdata-lab-playground/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarekabouzeid%2Fdata-lab-playground/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tarekabouzeid","download_url":"https://codeload.github.com/tarekabouzeid/data-lab-playground/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tarekabouzeid%2Fdata-lab-playground/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":277429564,"owners_count":25816452,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-28T02:00:08.834Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["genai","jupyter-notebook","lakehouse-platform","minio","ollama","qdrant-vector-database","rag","spark","trino"],"created_at":"2025-09-28T20:55:34.328Z","updated_at":"2025-09-28T20:55:36.136Z","avatar_url":"https://github.com/tarekabouzeid.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataLab Playground\n\nA simple Docker-based environment for exploring data analytics and AI tools. Includes basic data processing, storage, and LLM capabilities - all containerized for easy experimentation.\n\n## Architecture Overview\n\n```\n┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐\n│   Jupyter   │  │   Phoenix   │  │   Ollama    │  │    Trino    │\n│  Notebook   │  │AI Observ.   │  │ LLM Server  │  │   Engine    │\n└─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘\n       │                │                │               │\n       └────────────────┼────────────────┼───────────────┘\n                        │                │\n              ┌─────────────┐    ┌─────────────┐\n              │    Spark    │    │    Hive     │\n              │   Cluster   │    │ Metastore   │\n              └─────────────┘    └─────────────┘\n                        │                │\n                        └────────────────┘\n                                │\n                      ┌─────────────┐    ┌─────────────┐\n                      │   MinIO     │    │   Qdrant    │\n                      │  (S3 API)   │    │ Vector DB   │\n                      └─────────────┘    └─────────────┘\n```\n\n\n### Data Processing \u0026 Storage\n- **MinIO**: S3-compatible storage for files\n- **Apache Spark**: Basic data processing capabilities  \n- **Hive Metastore**: Simple metadata management\n- **Trino**: SQL query interface\n\n### AI \u0026 ML Tools\n- **Ollama**: Local LLM server (gemma3:4b model)\n- **Phoenix**: Basic AI monitoring \n- **Qdrant**: Vector database for AI experiments\n- **Jupyter**: Notebook environment with common libraries\n\n### Infrastructure\n- **PostgreSQL**: Database backend \n- **NVIDIA Docker**: GPU support for AI tools\n\n\n## 🚀 Quick Start\n\n### Prerequisites\n- Docker with Docker Compose\n- **NVIDIA GPU (Required)**:\n  - **NVIDIA GPU with 4GB+ VRAM** for gemma3:4b (current default)\n  - **Lower VRAM options available**: gemma3:1b\n  - NVIDIA Container Runtime pre-configured for Docker GPU access\n  - Platform is optimized for GPU acceleration and requires NVIDIA hardware\n- **Minimum System Requirements**:\n  - 8GB+ RAM recommended (12GB+ for optimal performance)\n  - 50GB+ free disk space for models and data\n\n### One-Command Setup\n```bash\n# This script will:\n# 1. Build all custom Docker images (if not already built)\n# 2. Start all services\n# 3. Setup MinIO storage\n# 4. Pull the gemma3:4b LLM model\n./start-platform.sh\n```\n\n### Manual Setup (Alternative)\n```bash\n# 1. Build custom Docker images\ndocker build -t datalab-playground/jupyter ./jupyter\ndocker build -t datalab-playground/spark ./spark\ndocker build -t datalab-playground/trino ./trino\ndocker build -t datalab-playground/hive-metastore ./hive-metastore\n\n# 2. Start all services\ndocker-compose up -d\n\n# 3. Pull LLM model\ndocker exec ollama ollama pull gemma3:4b\n```\n\n## 🎯 Getting Started\n\nSimple steps to explore the tools:\n\n1. **Start**: Run `./start-platform.sh` \n2. **Open Jupyter**: Go to http://localhost:8888 (password: 123456)\n3. **Try the demos**: Open `data_lab_playground.ipynb` for basic examples\n4. **Experiment with RAG**: Try `rag_vector_demo.ipynb` for vector database examples\n5. **Explore UIs**: Check out Qdrant dashboard, Phoenix monitoring, etc.\n\n## 🌐 Service Access Points\n\n| Service | URL | Credentials | Description |\n|---------|-----|-------------|-------------|\n| **Jupyter Notebook** | http://localhost:8888 | password: 123456 | Interactive AI/ML environment |\n| **Phoenix AI Observability** | http://localhost:6006 | None | AI model monitoring \u0026 traces |\n| **Ollama LLM API** | http://localhost:11434 | None | Local LLM inference endpoint |\n| **Qdrant Vector Database** | http://localhost:6333 | None | Vector storage \u0026 similarity search |\n| **Qdrant Web Dashboard** | http://localhost:6333/dashboard | None | Vector database management UI |\n| **Trino Web UI** | http://localhost:8080 | None | SQL query interface |\n| **Spark Master UI** | http://localhost:8081 | None | Spark cluster monitoring |\n| **MinIO Console** | http://localhost:9001 | minioadmin/minioadmin123 | S3 storage management |\n\n## 🤖 AI Tools\n\n### LLM Server (Ollama)\n- **Default Model**: gemma3:4b (~4GB VRAM)\n- **Any Ollama Model**: Different sizes available for various GPU capabilities from [Ollama Library](https://ollama.com/library)\n- **Requirements**: NVIDIA GPU with Docker runtime\n- **API**: `http://localhost:11434`\n\n### Vector Database (Qdrant)  \n- **Purpose**: Store embeddings for RAG experiments\n- **Web UI**: `http://localhost:6333/dashboard`\n- **API**: `http://localhost:6333`\n\n### Monitoring (Phoenix)\n- **Purpose**: Basic AI operation tracing\n- **Web UI**: `http://localhost:6006`\n\n### Notebook Environment (Jupyter)\n- **GenAI Environment**: Pre-installed packages\n- **Core Libraries**: pandas, numpy, matplotlib, seaborn, plotly, scikit-learn\n- **AI/ML Stack**: LangChain ecosystem, transformers, torch, sentence-transformers\n- **Vector \u0026 Database**: qdrant-client, trino, sqlalchemy, boto3, s3fs\n- **Document Processing**: pypdf2, python-docx, beautifulsoup4, tiktoken\n- **Observability**: arize-phoenix, opentelemetry, openinference instrumentation\n- **Development Tools**: ipywidgets, tqdm, rich, typer\n- **Default Kernel**: GenAI Analytics (Python 3.12)\n- **Access**: `http://localhost:8888` (password: 123456)\n\n## 📊 Usage Examples\n\n### Using Ollama LLM in Jupyter\n```python\nimport requests\nimport json\n\n# Chat with the local LLM\ndef chat_with_ollama(prompt, model=\"gemma3:4b\"):\n    response = requests.post('http://ollama:11434/api/generate',\n                           json={\n                               \"model\": model,\n                               \"prompt\": prompt,\n                               \"stream\": False\n                           })\n    return response.json()['response']\n\n# Example usage\nresult = chat_with_ollama(\"Explain data analytics in simple terms\")\nprint(result)\n```\n\n### Building RAG Systems with Qdrant\n```python\nfrom qdrant_client import QdrantClient\nimport ollama\n\n# Connect to vector database\nqdrant = QdrantClient(host=\"qdrant\", port=6333)\n\n# Generate embeddings using Ollama\ndef get_embedding(text):\n    response = ollama.embeddings(\n        model=\"nomic-embed-text\",\n        prompt=text\n    )\n    return response[\"embedding\"]\n\n# Store document chunks in vector database\ndef store_document(text_chunks, collection_name=\"knowledge_base\"):\n    for i, chunk in enumerate(text_chunks):\n        embedding = get_embedding(chunk)\n        qdrant.upsert(\n            collection_name=collection_name,\n            points=[{\n                \"id\": i,\n                \"vector\": embedding,\n                \"payload\": {\"text\": chunk}\n            }]\n        )\n\n# Retrieve relevant context for questions\ndef rag_query(question, collection_name=\"knowledge_base\"):\n    question_embedding = get_embedding(question)\n    results = qdrant.search(\n        collection_name=collection_name,\n        query_vector=question_embedding,\n        limit=3\n    )\n    context = \"\\n\".join([hit.payload[\"text\"] for hit in results])\n    \n    # Use context with LLM\n    prompt = f\"Context: {context}\\n\\nQuestion: {question}\\nAnswer:\"\n    return chat_with_ollama(prompt)\n```\n\n### Phoenix AI Observability\n```python\nimport phoenix as px\nfrom openinference.instrumentation.langchain import LangChainInstrumentor\nfrom phoenix.otel import register\n\n# Configure Phoenix tracing (in GenAI DEV kernel)\ntracer_provider = register(\n    project_name=\"data-analytics\",\n    endpoint=\"http://phoenix:4317\",\n    auto_instrument=True,\n)\n\n```\n\n### Spark with S3 Integration\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder \\\n    .appName(\"DataLab-Playground\") \\\n    .master(\"spark://spark-master:7077\") \\\n    .config(\"spark.hadoop.fs.s3a.endpoint\", \"http://minio:9000\") \\\n    .config(\"spark.hadoop.fs.s3a.access.key\", \"minioadmin\") \\\n    .config(\"spark.hadoop.fs.s3a.secret.key\", \"minioadmin123\") \\\n    .getOrCreate()\n\n# Process data and prepare for AI workloads\ndf = spark.read.parquet(\"s3a://warehouse/data/\")\ndf.write.mode(\"overwrite\").parquet(\"s3a://warehouse/processed/ai_training_data\")\n```\n\n## 🛠️ Basic Configuration\n\n### Default Settings\n- **MinIO**: minioadmin/minioadmin123\n- **Spark**: spark://spark-master:7077  \n- **Ollama Models**: Stored in persistent volume\n\n## 🔧 Platform Management\n\n### Service Health Monitoring\n```bash\n# Check all services status\ndocker-compose ps\n\n# View specific service logs\ndocker logs phoenix\ndocker logs ollama\n\n# Restart services\ndocker-compose restart ollama phoenix\n```\n\n### Available Models\n```bash\n# List available Ollama models\ndocker exec ollama ollama list\n\n# Pull additional LLM models\ndocker exec ollama ollama pull llama3.2\n\n# Pull additional embedding models\ndocker exec ollama ollama pull all-MiniLM-L6-v2\n```\n\n## 🚦 Startup Order\n\nServices start automatically in the right order:\n1. Storage \u0026 databases (PostgreSQL, MinIO)\n2. Data processing (Spark, Trino, Hive)  \n3. AI services (Ollama, Phoenix)\n4. Jupyter notebooks\n\n## 🐛 Common Issues\n\n### GPU Problems\n```bash\n# Check if GPU is detected\ndocker exec ollama nvidia-smi\n\n# Verify Docker GPU support  \ndocker info | grep nvidia\n```\n\n### Service Problems\n```bash\n# Check if services are running\ndocker-compose ps\n\n# View service logs\ndocker logs ollama\ndocker logs phoenix\n```\n\n\n## 🤝 Contributing\n\nFound a bug or have an idea? Feel free to:\n- Open GitHub issues for problems or suggestions\n- Submit pull requests for improvements  \n- Add example notebooks or documentation\n\n## Acknowledgments\n\n**Built with AI Assistance**\n\nThis project was developed with GitHub Copilot (powered by Claude Sonnet 4), demonstrating the power of human-AI partnership in creating comprehensive data platforms.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarekabouzeid%2Fdata-lab-playground","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftarekabouzeid%2Fdata-lab-playground","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarekabouzeid%2Fdata-lab-playground/lists"}