{"id":18183637,"url":"https://github.com/markjacksonfishing/pipedreams","last_synced_at":"2026-04-27T20:32:52.491Z","repository":{"id":260649881,"uuid":"881479860","full_name":"markjacksonfishing/pipedreams","owner":"markjacksonfishing","description":"A play on pipelines, with a focus on making data accessible and insightful.","archived":false,"fork":false,"pushed_at":"2024-11-18T15:23:04.000Z","size":8308,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-15T10:11:48.057Z","etag":null,"topics":["backend","data-engineering","data-processing","data-visualization","deployment","etl","frontend","machine-learning","python","streamlit"],"latest_commit_sha":null,"homepage":"https://www.anuclei.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/markjacksonfishing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-31T16:50:46.000Z","updated_at":"2024-11-18T15:13:32.000Z","dependencies_parsed_at":"2024-11-01T16:20:55.622Z","dependency_job_id":"4f5001c6-4dc5-4de2-acbd-aed4e9262cbe","html_url":"https://github.com/markjacksonfishing/pipedreams","commit_stats":null,"previous_names":["markjacksonfishing/pipedreams"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/markjacksonfishing/pipedreams","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markjacksonfishing%2Fpipedreams","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markjacksonfishing%2Fpipedreams/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markjacksonfishing%2Fpipedreams/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markjacksonfishing%2Fpipedreams/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/markjacksonfishing","download_url":"https://codeload.github.com/markjacksonfishing/pipedreams/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/markjacksonfishing%2Fpipedreams/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32354567,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-27T20:07:02.737Z","status":"ssl_error","status_checked_at":"2026-04-27T20:07:00.910Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backend","data-engineering","data-processing","data-visualization","deployment","etl","frontend","machine-learning","python","streamlit"],"created_at":"2024-11-02T20:03:41.676Z","updated_at":"2026-04-27T20:32:52.475Z","avatar_url":"https://github.com/markjacksonfishing.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PipeDreams - CSV Data Explorer\n\n![PipeDreams Header](images/istockphoto-1502938892-612x612.jpg)\n\nPipeDreams is a data exploration and visualization tool designed to be simple, flexible, and powerful. Upload any CSV file, perform basic ETL transformations, visualize your data, and gain insights with built-in machine learning—all from a user-friendly Streamlit interface. If no file is uploaded, a default dataset (`customers-100000.csv`) is used for demonstration.\n\n## Features\n\n- **CSV Upload**: Upload any CSV file for immediate analysis and visualization.\n- **Default Dataset**: If no file is uploaded, the app loads a sample dataset (`customers-100000.csv`) located in the `data/` directory.\n- **ETL Transformations**: Clean and transform data, remove missing values, and auto-convert data types.\n- **Data Visualization**: Interactive charts (scatter, bar, line, histogram, and box plots) to gain insights from your data.\n- **Clustering Analysis**: Use KMeans clustering to identify natural groupings within the data, helping to segment and classify.\n- **Predictive Analysis**: A synthetic column (`Annual Purchase Amount`) is included for testing linear regression, allowing users to explore predictive analysis features.\n\n## Getting Started with Docker\n\nYou can run PipeDreams using Docker to avoid setting up dependencies locally. The pre-built Docker image is available on [DockerHub](https://hub.docker.com/repository/docker/anuclei/pipedreams).\n\n### Pulling the Docker Image\n\nPull the latest Docker image from DockerHub:\n\n```bash\ndocker pull anuclei/pipedreams:latest\n```\n\n### Running the Docker Container\n\nRun the application with Docker, exposing it on port 8501:\n\n```bash\ndocker run -p 8501:8501 anuclei/pipedreams:latest\n```\n\nOnce the container is running, open your browser and go to `http://localhost:8501` to access the application.\n\n## Kubernetes Deployment\n\nPipeDreams can also be deployed on a Kubernetes cluster. This deployment scenario uses Minikube for local Kubernetes clusters and includes configurations for high availability and autoscaling.\n\nFor detailed instructions and YAML configurations, refer to the [Kubernetes Deployment Guide](k8s/k8s.md) in the `k8s` directory.\n\n## Manual Installation\n\nIf you prefer not to use Docker, you can set up the app manually.\n\n### Prerequisites\n\n- **Python** (version 3.6 or higher)\n\n### Installation\n\n1. **Clone the repository**:\n   ```bash\n   git clone https://github.com/markjacksonfishing/pipedreams.git\n   cd pipedreams\n   ```\n\n2. **Set up a virtual environment**:\n   - **MacOS/Linux**:\n     ```bash\n     python3 -m venv venv\n     source venv/bin/activate\n     ```\n   - **Windows**:\n     ```bash\n     python -m venv venv\n     .\\venv\\Scripts\\activate\n     ```\n\n3. **Install dependencies**:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n4. **Run the application**:\n   - **MacOS/Linux**:\n     ```bash\n     source venv/bin/activate\n     streamlit run app.py\n     ```\n   - **Windows**:\n     ```bash\n     .\\venv\\Scripts\\activate\n     streamlit run app.py\n     ```\n\n   The application will open in your default web browser at `http://localhost:8501` and will look like this:\n![PipeDreams Browser](images/running_broswer.jpeg)\n\n5. **Deactivate the virtual environment** (when finished):\n   ```bash\n   deactivate\n   ```\n\n## How to Use\n\n1. Start the application by following the setup steps above (or run it via Docker).\n2. **Upload a CSV file** using the file uploader in the app, or view the **default dataset** if no file is uploaded.\n3. Explore the data with built-in ETL transformations and interactive visualizations.\n4. Perform **clustering analysis** and **predictive analysis** on available data.\n\n### Advanced Insights: Clustering and Predictive Analysis\n\n- **Clustering Analysis**: Select features for clustering, and the app will automatically group data into clusters using KMeans. This can reveal natural groupings in the data, such as customer segments.\n- **Predictive Analysis**: Select features and a target variable (e.g., the synthetic `Annual Purchase Amount`) for linear regression. The app will generate a prediction model, display a mean squared error metric, and show an interactive scatter plot comparing actual vs. predicted values.\n\n### Default Dataset: `customers-100000.csv`\n\nThe default dataset, `customers-100000.csv`, is located in the `data/` directory. If no CSV file is uploaded, this dataset will automatically load, allowing users to test the ETL transformations, visualizations, clustering, and predictive analysis features without needing their own data file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarkjacksonfishing%2Fpipedreams","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarkjacksonfishing%2Fpipedreams","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarkjacksonfishing%2Fpipedreams/lists"}