https://github.com/dataspoclab/dataspoc-lens
Virtual warehouse — SQL + Jupyter + AI over cloud Parquet via DuckDB
https://github.com/dataspoclab/dataspoc-lens
cli data data-engineering data-lake duckdb etl parquet python singer sql
Last synced: about 1 month ago
JSON representation
Virtual warehouse — SQL + Jupyter + AI over cloud Parquet via DuckDB
- Host: GitHub
- URL: https://github.com/dataspoclab/dataspoc-lens
- Owner: dataspoclab
- License: apache-2.0
- Created: 2026-03-24T20:06:57.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-15T19:26:12.000Z (about 2 months ago)
- Last Synced: 2026-04-15T20:33:28.999Z (about 2 months ago)
- Topics: cli, data, data-engineering, data-lake, duckdb, etl, parquet, python, singer, sql
- Language: Python
- Homepage: https://pypi.org/project/dataspoc-lens/
- Size: 76.2 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
- awesome-duckdb - DataSpoc Lens - Virtual warehouse over cloud Parquet. SQL shell, Jupyter/Marimo notebooks, AI natural language queries, and local cache — all powered by DuckDB. (Tools Powered by DuckDB)
README
DataSpoc Lens
SQL over cloud Parquet. Query your data lake from the terminal.
## Why Lens?
Data teams store Parquet in S3, GCS, or Azure but still spin up heavy warehouses just to run SQL. **DataSpoc Lens** mounts cloud buckets as DuckDB views and gives you an interactive shell, notebooks, AI-powered queries, and local caching -- all from a single CLI. No servers, no infrastructure, no data copying.
## Installation
```bash
pip install dataspoc-lens
```
Cloud and feature extras:
```bash
pip install dataspoc-lens[s3] # AWS S3
pip install dataspoc-lens[gcs] # Google Cloud Storage
pip install dataspoc-lens[azure] # Azure Blob Storage
pip install dataspoc-lens[jupyter] # JupyterLab integration
pip install dataspoc-lens[ai] # AI natural language queries
pip install dataspoc-lens[all] # Everything
```
## Quick Start
### 1. Initialize and register a bucket
```bash
dataspoc-lens init
dataspoc-lens add-bucket s3://my-data-lake
```
Lens discovers tables automatically -- first from Pipe's `.dataspoc/manifest.json`, then by scanning for `*.parquet` files.
### 2. Explore the catalog
```bash
dataspoc-lens catalog
dataspoc-lens catalog --detail orders
```
### 3. Query with SQL
```bash
dataspoc-lens query "SELECT * FROM orders LIMIT 10"
dataspoc-lens query "SELECT status, COUNT(*) FROM orders GROUP BY status"
```
### 4. Launch the interactive shell
```bash
dataspoc-lens shell
```
```
lens> SELECT customer_id, SUM(total) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
lens> .tables
lens> .schema orders
lens> .export csv /tmp/orders.csv
lens> .quit
```
### 5. Configure AI and ask questions
Before using `ask`, configure an LLM provider:
**Option A -- Local AI (free, no API key):**
```bash
dataspoc-lens setup-ai
```
**Option B -- Cloud provider:**
```bash
# Anthropic (default)
export DATASPOC_LLM_API_KEY=sk-ant-...
# OpenAI
export DATASPOC_LLM_PROVIDER=openai
export DATASPOC_LLM_API_KEY=sk-...
```
Then ask questions in natural language:
```bash
dataspoc-lens ask "how many orders were placed yesterday?"
dataspoc-lens ask "top 10 customers by revenue this month"
dataspoc-lens ask --debug "average order value by month"
```
Lens sends your table schemas and sample data to the LLM, receives SQL, executes it, and prints the results. Use `--debug` to see the full prompt sent to the LLM.
### 6. Export results
Add `--export` to any `query` or `ask` command. Format is detected from the file extension:
```bash
dataspoc-lens query "SELECT * FROM orders" --export orders.csv
dataspoc-lens query "SELECT * FROM users" --export users.parquet
dataspoc-lens ask "monthly revenue" --export revenue.json
```
## Features
### Interactive Shell
SQL REPL with syntax highlighting, autocomplete, and history. Dot commands: `.tables`, `.schema `, `.buckets`, `.cache `, `.export `, `.help`, `.quit`.
### Notebook
Launch JupyterLab or Marimo with all tables pre-mounted:
```bash
pip install dataspoc-lens[jupyter]
dataspoc-lens notebook
pip install dataspoc-lens[marimo]
dataspoc-lens notebook --marimo
```
### SQL Transforms
Numbered `.sql` files in `~/.dataspoc-lens/transforms/` that run in order:
```bash
dataspoc-lens transform list
dataspoc-lens transform run
```
### Cache
Copy tables locally for offline work and reduced egress costs:
```bash
dataspoc-lens cache orders # Cache a table
dataspoc-lens cache --list # Check status (fresh/stale)
dataspoc-lens cache orders --refresh # Re-download
dataspoc-lens cache --clear # Clear all
```
Freshness: compares your cache timestamp against the manifest's `last_extraction`.
## Commands
```bash
dataspoc-lens init # Initialize configuration
dataspoc-lens add-bucket # Register a bucket
dataspoc-lens catalog # List all tables
dataspoc-lens catalog --detail # Show table schema
dataspoc-lens query "" # Execute SQL query
dataspoc-lens query "" --export f.csv # Execute and export
dataspoc-lens shell # Interactive SQL shell
dataspoc-lens ask "" # Natural language query
dataspoc-lens ask "" --debug # Show LLM prompt
dataspoc-lens setup-ai # Install local AI (Ollama)
dataspoc-lens notebook # Launch JupyterLab
dataspoc-lens notebook --marimo # Launch Marimo
dataspoc-lens transform list # List transform files
dataspoc-lens transform run # Run all transforms
dataspoc-lens cache # Cache a table locally
dataspoc-lens cache --list # List cached tables
dataspoc-lens cache --clear # Clear cache
dataspoc-lens ml activate [key] # Activate DataSpoc ML
dataspoc-lens ml train --target col --from tbl # Train a model
dataspoc-lens ml predict --model m --from tbl # Generate predictions
dataspoc-lens ml models # List trained models
dataspoc-lens --version # Show version
```
## Part of the DataSpoc Platform
| Product | Role |
|---------|------|
| **[DataSpoc Pipe](https://github.com/dataspoclab/dataspoc-pipe)** | Ingestion: Singer taps to Parquet in cloud buckets |
| **[DataSpoc Lens](https://github.com/dataspoclab/dataspoc-lens)** (this) | Virtual warehouse: SQL + Jupyter + AI over your data lake |
| **DataSpoc ML** | AutoML: train and deploy models from your lake |
Pipe writes. Lens reads. ML learns.
## Community
- **GitHub Issues** -- [Report bugs or request features](https://github.com/dataspoclab/dataspoc-lens/issues)
- **Contributing** -- PRs welcome. Run `pytest tests/ -v` before submitting.
## License
[Apache-2.0](LICENSE) -- free to use, modify, and distribute.