https://github.com/cpgrant/wsl-ml-lakehouse-stack
WSL ML Lakehouse Stack: One-command local lakehouse (Ray, Spark, Kafka, MinIO, Airflow, dbt, Beam, Terraform, Jupyter). Includes a sample async web crawler that writes directly to MinIO/S3.
https://github.com/cpgrant/wsl-ml-lakehouse-stack
airflow dbt docker docker-compose kafka lakehouse minio ray spark terraform wsl
Last synced: 3 months ago
JSON representation
WSL ML Lakehouse Stack: One-command local lakehouse (Ray, Spark, Kafka, MinIO, Airflow, dbt, Beam, Terraform, Jupyter). Includes a sample async web crawler that writes directly to MinIO/S3.
- Host: GitHub
- URL: https://github.com/cpgrant/wsl-ml-lakehouse-stack
- Owner: cpgrant
- License: other
- Created: 2025-08-26T13:12:56.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-08-28T07:25:42.000Z (10 months ago)
- Last Synced: 2025-09-04T06:15:47.635Z (10 months ago)
- Topics: airflow, dbt, docker, docker-compose, kafka, lakehouse, minio, ray, spark, terraform, wsl
- Language: Shell
- Homepage:
- Size: 78.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
````markdown
# WSL ML Lakehouse Stack
A one-command **local ML lakehouse** running entirely on Docker + WSL2.
Includes: **MinIO (S3), Spark 3.5, Ray 2.x, Kafka, Airflow 2.9, dbt Core, Apache Beam, Terraform, a Jupyter PySpark notebook — and a simple async web crawler** for demo data ingestion.
> ✅ Tested on Windows 11 + WSL2 (Ubuntu 22.04, Docker Desktop WSL backend).
> Default ports: MinIO `9000/9001`, Ray `8265`, Spark `7077/8080/8081`, Airflow `8085`, Kafka `29092`, Jupyter `8888`.
---
## Quickstart
```bash
# Clone and prepare
git clone wsl-ml-lakehouse-stack
cd wsl-ml-lakehouse-stack
cp .env.example .env # edit values as needed
# Start everything
docker compose pull
docker compose up -d
# Run smoke tests
./smoketests_all.sh
````
You should see ✅ for all components.
---
## Smoke Tests
Run individually:
```bash
./smoketest_01_containers.sh
./smoketest_02_ray.sh
./smoketest_03_spark.sh
./smoketest_04_spark_minio.sh
./smoketest_05_kafka.sh
./smoketest_06_airflow.sh
./smoketest_07_dbt.sh
./smoketest_08_beam.sh
./smoketest_09_terraform.sh
./smoketest_10_delta_minio.sh
./smoketest_11_crawler.sh
```
The **crawler smoketest** fetches a few pages from
[`https://quotes.toscrape.com`](https://quotes.toscrape.com) and writes JSON lines to MinIO at:
```
s3://crawl/raw///smoketest.jsonl
```
---
## Documentation
* Full install and troubleshooting guide: [docs/install_notes.md](./docs/install_notes.md)
* Day-2 ops, GPU support, and CI sanity checks also covered there.
---
## Security Notes
* **Change default credentials** (`minioadmin`, `admin/admin`) before using outside local dev.
* Avoid exposing host ports on shared networks without TLS and auth.
---
## License
[MIT](./LICENSE)
```
---
```