{"id":13692152,"url":"https://github.com/VIDA-NYU/PipelineVis","last_synced_at":"2025-05-02T19:31:27.024Z","repository":{"id":42501598,"uuid":"212124377","full_name":"VIDA-NYU/PipelineVis","owner":"VIDA-NYU","description":"Pipeline Profiler is a tool for visualizing machine learning pipelines generated by AutoML tools.","archived":false,"fork":false,"pushed_at":"2023-09-13T21:10:19.000Z","size":3836,"stargazers_count":84,"open_issues_count":13,"forks_count":7,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-04-15T12:23:16.127Z","etag":null,"topics":["automl","jupyter","machine-learning","visualization"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VIDA-NYU.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-10-01T14:57:18.000Z","updated_at":"2024-10-02T15:17:47.000Z","dependencies_parsed_at":"2024-01-06T14:09:44.833Z","dependency_job_id":"0d463d2d-65ee-4d06-b29a-d93aee8605d0","html_url":"https://github.com/VIDA-NYU/PipelineVis","commit_stats":{"total_commits":266,"total_committers":5,"mean_commits":53.2,"dds":0.05639097744360899,"last_synced_commit":"8e1126b534317401907ba7fe92c58152965d64ed"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2FPipelineVis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2FPipelineVis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2FPipelineVis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2FPipelineVis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VIDA-NYU","download_url":"https://codeload.github.com/VIDA-NYU/PipelineVis/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252095225,"owners_count":21693877,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl","jupyter","machine-learning","visualization"],"created_at":"2024-08-02T17:00:54.199Z","updated_at":"2025-05-02T19:31:26.365Z","avatar_url":"https://github.com/VIDA-NYU.png","language":"JavaScript","funding_links":[],"categories":["JavaScript","[↑](#contents) Data Analytics"],"sub_categories":[],"readme":"# PipelineProfiler\n\nAutoML Pipeline exploration tool compatible with Jupyter Notebooks. Supports Auto-Sklearn, Alpha-AutoML and D3M pipeline format.\n\n[![arxiv badge](https://img.shields.io/badge/arXiv-2005.00160-red)](https://arxiv.org/abs/2005.00160)\n\n![System screen](https://github.com/VIDA-NYU/PipelineVis/raw/master/imgs/system.png)\n\n(Shift click to select multiple pipelines)\n\n**Paper**: [https://arxiv.org/abs/2005.00160](https://arxiv.org/abs/2005.00160)\n\n**Video**: [https://youtu.be/2WSYoaxLLJ8](https://youtu.be/2WSYoaxLLJ8)\n\n**Blog**: [Medium post](https://towardsdatascience.com/exploring-auto-sklearn-models-with-pipelineprofiler-5b2c54136044)\n\n## Demo\n\nLive demo (Google Colab):\n- [Heart Stat Log data](https://colab.research.google.com/drive/1k_h4HWUKsd83PmYMEBJ87UP2SSJQYw9A?usp=sharing)\n- [auto-sklearn classification](https://colab.research.google.com/drive/1_2FRIkHNFGOiIJt-n_3zuh8vpSMLhwzx?usp=sharing)\n\nIn Jupyter Notebook:\n```Python\nimport PipelineProfiler\ndata = PipelineProfiler.get_heartstatlog_data()\nPipelineProfiler.plot_pipeline_matrix(data)\n```\n\nYou can also find multiple examples of PipelineProfiler in the [Alpha-AutoML repository](https://github.com/VIDA-NYU/alpha-automl/tree/devel/examples), an extensible AutoML system for multiple ML tasks.\n\n## Install\n\n### Option 1: install via pip:\n~~~~\npip install pipelineprofiler\n~~~~\n\n### Option 2: Run the docker image:\n~~~~\ndocker build -t pipelineprofiler .\ndocker run -p 9999:8888 pipelineprofiler\n~~~~\n\nThen copy the access token and log in to jupyter in the browser url:\n~~~~\nlocalhost:9999\n~~~~\n\n## Data preprocessing\n\nPipelineProfiler reads data from the D3M Metalearning database. You can download this data from: https://metalearning.datadrivendiscovery.org/dumps/2020/03/04/metalearningdb_dump_20200304.tar.gz\n\nYou need to merge two files in order to explore the pipelines: pipelines.json and pipeline_runs.json.  To do so, run\n~~~~\npython -m PipelineProfiler.pipeline_merge [-n NUMBER_PIPELINES] pipeline_runs_file pipelines_file output_file\n~~~~\n\n## Pipeline exploration\n\n```Python\nimport PipelineProfiler\nimport json\n```\n\nIn a jupyter notebook, load the output_file \n\n```Python\nwith open(\"output_file.json\", \"r\") as f:\n    pipelines = json.load(f)\n```\n\nand then plot it using:\n\n```Python\nPipelineProfiler.plot_pipeline_matrix(pipelines[:10])\n```\n\n## Data postprocessing\n\nYou might want to group pipelines by problem type, and select the top k pipelines from each team. To do so, use the code:\n\n```Python\ndef get_top_k_pipelines_team(pipelines, k):\n    team_pipelines = defaultdict(list)\n    for pipeline in pipelines:\n        source = pipeline['pipeline_source']['name']\n        team_pipelines[source].append(pipeline)\n    for team in team_pipelines.keys():\n        team_pipelines[team] = sorted(team_pipelines[team], key=lambda x: x['scores'][0]['normalized'], reverse=True)\n        team_pipelines[team] = team_pipelines[team][:k]\n    new_pipelines = []\n    for team in team_pipelines.keys():\n        new_pipelines.extend(team_pipelines[team])\n    return new_pipelines\n\ndef sort_pipeline_scores(pipelines):\n    return sorted(pipelines, key=lambda x: x['scores'][0]['value'], reverse=True)    \n\npipelines_problem = {}\nfor pipeline in pipelines:  \n    problem_id = pipeline['problem']['id']\n    if problem_id not in pipelines_problem:\n        pipelines_problem[problem_id] = []\n    pipelines_problem[problem_id].append(pipeline)\nfor problem in pipelines_problem.keys():\n    pipelines_problem[problem] = sort_pipeline_scores(get_top_k_pipelines_team(pipelines_problem[problem], k=100))\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVIDA-NYU%2FPipelineVis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVIDA-NYU%2FPipelineVis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVIDA-NYU%2FPipelineVis/lists"}