https://github.com/cpgrant/wsl-ml-lakehouse-stack
WSL-friendly local lakehouse stack: Spark, Ray, Airflow, Kafka, MinIO, dbt, Beam, Terraform + smoketests
https://github.com/cpgrant/wsl-ml-lakehouse-stack
airflow dbt docker docker-compose kafka lakehouse minio ray spark terraform wsl
Last synced: about 1 month ago
JSON representation
WSL-friendly local lakehouse stack: Spark, Ray, Airflow, Kafka, MinIO, dbt, Beam, Terraform + smoketests
- Host: GitHub
- URL: https://github.com/cpgrant/wsl-ml-lakehouse-stack
- Owner: cpgrant
- License: other
- Created: 2025-08-26T13:12:56.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-08-26T15:20:25.000Z (about 2 months ago)
- Last Synced: 2025-08-26T17:58:13.946Z (about 2 months ago)
- Topics: airflow, dbt, docker, docker-compose, kafka, lakehouse, minio, ray, spark, terraform, wsl
- Language: Shell
- Size: 41 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
````markdown
# WSL ML Lakehouse StackA one-command **local ML lakehouse** running entirely on Docker + WSL2.
Includes: **MinIO (S3), Spark 3.5, Ray 2.x, Kafka, Airflow 2.9, dbt Core, Apache Beam, Terraform, a Jupyter PySpark notebook — and a simple async web crawler** for demo data ingestion.> ✅ Tested on Windows 11 + WSL2 (Ubuntu 22.04, Docker Desktop WSL backend).
> Default ports: MinIO `9000/9001`, Ray `8265`, Spark `7077/8080/8081`, Airflow `8085`, Kafka `29092`, Jupyter `8888`.---
## Quickstart
```bash
# Clone and prepare
git clone wsl-ml-lakehouse-stack
cd wsl-ml-lakehouse-stack
cp .env.example .env # edit values as needed# Start everything
docker compose pull
docker compose up -d# Run smoke tests
./smoketests_all.sh
````You should see ✅ for all components.
---
## Smoke Tests
Run individually:
```bash
./smoketest_01_containers.sh
./smoketest_02_ray.sh
./smoketest_03_spark.sh
./smoketest_04_spark_minio.sh
./smoketest_05_kafka.sh
./smoketest_06_airflow.sh
./smoketest_07_dbt.sh
./smoketest_08_beam.sh
./smoketest_09_terraform.sh
./smoketest_10_delta_minio.sh
./smoketest_11_crawler.sh
```The **crawler smoketest** fetches a few pages from
[`https://quotes.toscrape.com`](https://quotes.toscrape.com) and writes JSON lines to MinIO at:```
s3://crawl/raw///smoketest.jsonl
```---
## Documentation
* Full install and troubleshooting guide: [docs/install_notes.md](./docs/install_notes.md)
* Day-2 ops, GPU support, and CI sanity checks also covered there.---
## Security Notes
* **Change default credentials** (`minioadmin`, `admin/admin`) before using outside local dev.
* Avoid exposing host ports on shared networks without TLS and auth.---
## License
[MIT](./LICENSE)
```
---
```