https://github.com/markjacksonfishing/pipedreams

A play on pipelines, with a focus on making data accessible and insightful.
https://github.com/markjacksonfishing/pipedreams

backend data-engineering data-processing data-visualization deployment etl frontend machine-learning python streamlit

Last synced: 19 days ago
JSON representation

A play on pipelines, with a focus on making data accessible and insightful.

Host: GitHub
URL: https://github.com/markjacksonfishing/pipedreams
Owner: markjacksonfishing
License: mit
Created: 2024-10-31T16:50:46.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-11-18T15:23:04.000Z (5 months ago)
Last Synced: 2025-04-02T14:14:31.874Z (24 days ago)
Topics: backend, data-engineering, data-processing, data-visualization, deployment, etl, frontend, machine-learning, python, streamlit
Language: Python
Homepage: https://www.anuclei.com
Size: 7.92 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# PipeDreams - CSV Data Explorer

![PipeDreams Header](images/istockphoto-1502938892-612x612.jpg)

PipeDreams is a data exploration and visualization tool designed to be simple, flexible, and powerful. Upload any CSV file, perform basic ETL transformations, visualize your data, and gain insights with built-in machine learning—all from a user-friendly Streamlit interface. If no file is uploaded, a default dataset (`customers-100000.csv`) is used for demonstration.

## Features

- **CSV Upload**: Upload any CSV file for immediate analysis and visualization.
- **Default Dataset**: If no file is uploaded, the app loads a sample dataset (`customers-100000.csv`) located in the `data/` directory.
- **ETL Transformations**: Clean and transform data, remove missing values, and auto-convert data types.
- **Data Visualization**: Interactive charts (scatter, bar, line, histogram, and box plots) to gain insights from your data.
- **Clustering Analysis**: Use KMeans clustering to identify natural groupings within the data, helping to segment and classify.
- **Predictive Analysis**: A synthetic column (`Annual Purchase Amount`) is included for testing linear regression, allowing users to explore predictive analysis features.

## Getting Started with Docker

You can run PipeDreams using Docker to avoid setting up dependencies locally. The pre-built Docker image is available on [DockerHub](https://hub.docker.com/repository/docker/anuclei/pipedreams).

### Pulling the Docker Image

Pull the latest Docker image from DockerHub:

```bash
docker pull anuclei/pipedreams:latest
```

### Running the Docker Container

Run the application with Docker, exposing it on port 8501:

```bash
docker run -p 8501:8501 anuclei/pipedreams:latest
```

Once the container is running, open your browser and go to `http://localhost:8501` to access the application.

## Kubernetes Deployment

PipeDreams can also be deployed on a Kubernetes cluster. This deployment scenario uses Minikube for local Kubernetes clusters and includes configurations for high availability and autoscaling.

For detailed instructions and YAML configurations, refer to the [Kubernetes Deployment Guide](k8s/k8s.md) in the `k8s` directory.

## Manual Installation

If you prefer not to use Docker, you can set up the app manually.

### Prerequisites

- **Python** (version 3.6 or higher)

### Installation

1. **Clone the repository**:
```bash
git clone https://github.com/markjacksonfishing/pipedreams.git
cd pipedreams
```

2. **Set up a virtual environment**:
- **MacOS/Linux**:
```bash
python3 -m venv venv
source venv/bin/activate
```
- **Windows**:
```bash
python -m venv venv
.\venv\Scripts\activate
```

3. **Install dependencies**:
```bash
pip install -r requirements.txt
```

4. **Run the application**:
- **MacOS/Linux**:
```bash
source venv/bin/activate
streamlit run app.py
```
- **Windows**:
```bash
.\venv\Scripts\activate
streamlit run app.py
```

The application will open in your default web browser at `http://localhost:8501` and will look like this:
![PipeDreams Browser](images/running_broswer.jpeg)

5. **Deactivate the virtual environment** (when finished):
```bash
deactivate
```

## How to Use

1. Start the application by following the setup steps above (or run it via Docker).
2. **Upload a CSV file** using the file uploader in the app, or view the **default dataset** if no file is uploaded.
3. Explore the data with built-in ETL transformations and interactive visualizations.
4. Perform **clustering analysis** and **predictive analysis** on available data.

### Advanced Insights: Clustering and Predictive Analysis

- **Clustering Analysis**: Select features for clustering, and the app will automatically group data into clusters using KMeans. This can reveal natural groupings in the data, such as customer segments.
- **Predictive Analysis**: Select features and a target variable (e.g., the synthetic `Annual Purchase Amount`) for linear regression. The app will generate a prediction model, display a mean squared error metric, and show an interactive scatter plot comparing actual vs. predicted values.

### Default Dataset: `customers-100000.csv`

The default dataset, `customers-100000.csv`, is located in the `data/` directory. If no CSV file is uploaded, this dataset will automatically load, allowing users to test the ETL transformations, visualizations, clustering, and predictive analysis features without needing their own data file.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/markjacksonfishing/pipedreams

Awesome Lists containing this project

README