{"id":48986239,"url":"https://github.com/dataspoclab/dataspoc-lens","last_synced_at":"2026-04-20T10:00:41.688Z","repository":{"id":351622496,"uuid":"1190981585","full_name":"dataspoclab/dataspoc-lens","owner":"dataspoclab","description":"Virtual warehouse — SQL + Jupyter + AI over cloud Parquet via DuckDB","archived":false,"fork":false,"pushed_at":"2026-04-15T19:26:12.000Z","size":78,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-15T20:33:28.999Z","etag":null,"topics":["cli","data","data-engineering","data-lake","duckdb","etl","parquet","python","singer","sql"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/dataspoc-lens/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dataspoclab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-24T20:06:57.000Z","updated_at":"2026-04-15T19:26:16.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/dataspoclab/dataspoc-lens","commit_stats":null,"previous_names":["dataspoclab/dataspoc-lens"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/dataspoclab/dataspoc-lens","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataspoclab%2Fdataspoc-lens","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataspoclab%2Fdataspoc-lens/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataspoclab%2Fdataspoc-lens/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataspoclab%2Fdataspoc-lens/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dataspoclab","download_url":"https://codeload.github.com/dataspoclab/dataspoc-lens/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataspoclab%2Fdataspoc-lens/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32042293,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"online","status_checked_at":"2026-04-20T02:00:06.527Z","response_time":94,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","data","data-engineering","data-lake","duckdb","etl","parquet","python","singer","sql"],"created_at":"2026-04-18T13:00:12.481Z","updated_at":"2026-04-20T10:00:41.682Z","avatar_url":"https://github.com/dataspoclab.png","language":"Python","funding_links":[],"categories":["Tools Powered by DuckDB"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eDataSpoc Lens\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/dataspoclab/dataspoc-lens/actions\"\u003e\u003cimg src=\"https://img.shields.io/github/actions/workflow/status/dataspoclab/dataspoc-lens/ci.yml?branch=main\u0026style=flat-square\u0026label=CI\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/dataspoc-lens/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/dataspoc-lens?style=flat-square\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/dataspoclab/dataspoc-lens/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache%202.0-blue?style=flat-square\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/dataspoc-lens/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/pyversions/dataspoc-lens?style=flat-square\" alt=\"Python 3.10+\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cem\u003eSQL over cloud Parquet. Query your data lake from the terminal.\u003c/em\u003e\u003c/p\u003e\n\n## Why Lens?\n\nData teams store Parquet in S3, GCS, or Azure but still spin up heavy warehouses just to run SQL. **DataSpoc Lens** mounts cloud buckets as DuckDB views and gives you an interactive shell, notebooks, AI-powered queries, and local caching -- all from a single CLI. No servers, no infrastructure, no data copying.\n\n## Installation\n\n```bash\npip install dataspoc-lens\n```\n\nCloud and feature extras:\n\n```bash\npip install dataspoc-lens[s3]       # AWS S3\npip install dataspoc-lens[gcs]      # Google Cloud Storage\npip install dataspoc-lens[azure]    # Azure Blob Storage\npip install dataspoc-lens[jupyter]  # JupyterLab integration\npip install dataspoc-lens[ai]       # AI natural language queries\npip install dataspoc-lens[all]      # Everything\n```\n\n## Quick Start\n\n### 1. Initialize and register a bucket\n\n```bash\ndataspoc-lens init\ndataspoc-lens add-bucket s3://my-data-lake\n```\n\nLens discovers tables automatically -- first from Pipe's `.dataspoc/manifest.json`, then by scanning for `*.parquet` files.\n\n### 2. Explore the catalog\n\n```bash\ndataspoc-lens catalog\ndataspoc-lens catalog --detail orders\n```\n\n### 3. Query with SQL\n\n```bash\ndataspoc-lens query \"SELECT * FROM orders LIMIT 10\"\ndataspoc-lens query \"SELECT status, COUNT(*) FROM orders GROUP BY status\"\n```\n\n### 4. Launch the interactive shell\n\n```bash\ndataspoc-lens shell\n```\n\n```\nlens\u003e SELECT customer_id, SUM(total) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 10;\nlens\u003e .tables\nlens\u003e .schema orders\nlens\u003e .export csv /tmp/orders.csv\nlens\u003e .quit\n```\n\n### 5. Configure AI and ask questions\n\nBefore using `ask`, configure an LLM provider:\n\n**Option A -- Local AI (free, no API key):**\n\n```bash\ndataspoc-lens setup-ai\n```\n\n**Option B -- Cloud provider:**\n\n```bash\n# Anthropic (default)\nexport DATASPOC_LLM_API_KEY=sk-ant-...\n\n# OpenAI\nexport DATASPOC_LLM_PROVIDER=openai\nexport DATASPOC_LLM_API_KEY=sk-...\n```\n\nThen ask questions in natural language:\n\n```bash\ndataspoc-lens ask \"how many orders were placed yesterday?\"\ndataspoc-lens ask \"top 10 customers by revenue this month\"\ndataspoc-lens ask --debug \"average order value by month\"\n```\n\nLens sends your table schemas and sample data to the LLM, receives SQL, executes it, and prints the results. Use `--debug` to see the full prompt sent to the LLM.\n\n### 6. Export results\n\nAdd `--export` to any `query` or `ask` command. Format is detected from the file extension:\n\n```bash\ndataspoc-lens query \"SELECT * FROM orders\" --export orders.csv\ndataspoc-lens query \"SELECT * FROM users\" --export users.parquet\ndataspoc-lens ask \"monthly revenue\" --export revenue.json\n```\n\n## Features\n\n### Interactive Shell\n\nSQL REPL with syntax highlighting, autocomplete, and history. Dot commands: `.tables`, `.schema \u003ctable\u003e`, `.buckets`, `.cache \u003ctable\u003e`, `.export \u003cformat\u003e \u003cpath\u003e`, `.help`, `.quit`.\n\n### Notebook\n\nLaunch JupyterLab or Marimo with all tables pre-mounted:\n\n```bash\npip install dataspoc-lens[jupyter]\ndataspoc-lens notebook\n\npip install dataspoc-lens[marimo]\ndataspoc-lens notebook --marimo\n```\n\n### SQL Transforms\n\nNumbered `.sql` files in `~/.dataspoc-lens/transforms/` that run in order:\n\n```bash\ndataspoc-lens transform list\ndataspoc-lens transform run\n```\n\n### Cache\n\nCopy tables locally for offline work and reduced egress costs:\n\n```bash\ndataspoc-lens cache orders              # Cache a table\ndataspoc-lens cache --list              # Check status (fresh/stale)\ndataspoc-lens cache orders --refresh    # Re-download\ndataspoc-lens cache --clear             # Clear all\n```\n\nFreshness: compares your cache timestamp against the manifest's `last_extraction`.\n\n## Commands\n\n```bash\ndataspoc-lens init                          # Initialize configuration\ndataspoc-lens add-bucket \u003curi\u003e              # Register a bucket\ndataspoc-lens catalog                       # List all tables\ndataspoc-lens catalog --detail \u003ctable\u003e      # Show table schema\ndataspoc-lens query \"\u003csql\u003e\"                 # Execute SQL query\ndataspoc-lens query \"\u003csql\u003e\" --export f.csv  # Execute and export\ndataspoc-lens shell                         # Interactive SQL shell\ndataspoc-lens ask \"\u003cquestion\u003e\"              # Natural language query\ndataspoc-lens ask \"\u003cquestion\u003e\" --debug      # Show LLM prompt\ndataspoc-lens setup-ai                      # Install local AI (Ollama)\ndataspoc-lens notebook                      # Launch JupyterLab\ndataspoc-lens notebook --marimo             # Launch Marimo\ndataspoc-lens transform list                # List transform files\ndataspoc-lens transform run                 # Run all transforms\ndataspoc-lens cache \u003ctable\u003e                 # Cache a table locally\ndataspoc-lens cache --list                  # List cached tables\ndataspoc-lens cache --clear                 # Clear cache\ndataspoc-lens ml activate [key]             # Activate DataSpoc ML\ndataspoc-lens ml train --target col --from tbl  # Train a model\ndataspoc-lens ml predict --model m --from tbl   # Generate predictions\ndataspoc-lens ml models                     # List trained models\ndataspoc-lens --version                     # Show version\n```\n\n## Part of the DataSpoc Platform\n\n| Product | Role |\n|---------|------|\n| **[DataSpoc Pipe](https://github.com/dataspoclab/dataspoc-pipe)** | Ingestion: Singer taps to Parquet in cloud buckets |\n| **[DataSpoc Lens](https://github.com/dataspoclab/dataspoc-lens)** (this) | Virtual warehouse: SQL + Jupyter + AI over your data lake |\n| **DataSpoc ML** | AutoML: train and deploy models from your lake |\n\nPipe writes. Lens reads. ML learns.\n\n## Community\n\n- **GitHub Issues** -- [Report bugs or request features](https://github.com/dataspoclab/dataspoc-lens/issues)\n- **Contributing** -- PRs welcome. Run `pytest tests/ -v` before submitting.\n\n## License\n\n[Apache-2.0](LICENSE) -- free to use, modify, and distribute.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataspoclab%2Fdataspoc-lens","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataspoclab%2Fdataspoc-lens","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataspoclab%2Fdataspoc-lens/lists"}