{"id":19749464,"url":"https://github.com/mskcc/penguin","last_synced_at":"2026-02-08T09:32:05.976Z","repository":{"id":244461466,"uuid":"423628515","full_name":"mskcc/Penguin","owner":"mskcc","description":"ecDNA detection pipeline for MSK Sequencing Data","archived":false,"fork":false,"pushed_at":"2025-11-24T21:59:39.000Z","size":55224,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-28T09:14:41.716Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mskcc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-01T22:01:12.000Z","updated_at":"2025-11-24T21:59:46.000Z","dependencies_parsed_at":"2024-06-14T21:46:25.879Z","dependency_job_id":"afd95327-9b8d-4d12-b97f-b1aada08caf2","html_url":"https://github.com/mskcc/Penguin","commit_stats":null,"previous_names":["mskcc/ecdna-echo","mskcc/penguin"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mskcc/Penguin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FPenguin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FPenguin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FPenguin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FPenguin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mskcc","download_url":"https://codeload.github.com/mskcc/Penguin/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mskcc%2FPenguin/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29226470,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-08T09:15:18.648Z","status":"ssl_error","status_checked_at":"2026-02-08T09:14:33.745Z","response_time":57,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T02:26:36.279Z","updated_at":"2026-02-08T09:32:05.970Z","avatar_url":"https://github.com/mskcc.png","language":"Jupyter Notebook","readme":"# PeNGUIN\nPredicting ecDNA Novelties in Genes Using IMPACT NGS Data\n\nA Pipeline to Analyze ecDNA in collaboration with BoundlessBio\n\n### Workflow Overview\n\nBelow is a high-level workflow diagram summarizing the steps in the PeNGUIN ecDNA pipeline:\n\n![PeNGUIN Workflow](Penguin_Workflow.png)\n\n### Project Locations on Juno\n\nFor users running this pipeline on Juno, the main directories are:\n\n- **Pipeline Directory (Penguin code and workflow):**  \n  `/juno/cmo/bergerlab/sumans/Project_ecDNA/Production/penguin`\n\n- **Project Resources and Legacy Code:**  \n  `/juno/cmo/bergerlab/sumans/Project_ecDNA/Production`\n\nThese locations contain the full workflow, reference files, utility scripts, and older versions of the pipeline.\n\n### Dependencies\n\nThe environment yml file for the scripts may be found in ```/envs/echo.yml```\nThe environment yml file for the analysis notebooks may be found in ```/envs/ecDNA_analysis.yml```\n\nYou can get all the dependencies for the scripts with \n\n```\nconda env create --name ecDNA --file=envs/echo.yml\nconda activate ecDNA\npip install git+https://github.com/mskcc/facetsAPI#facetsAPI\n```\n\nYou can get all the dependencies for analysis with \n\n```\nconda env create --name ecDNA_analysis --file=envs/ecDNA_analysis.yml\nconda activate ecDNA_analysis\n```\n\nNote: You may need to ask for permission to get facetsAPI access. Please visit https://github.com/mskcc/facetsAPI and contact Adam Price if you need access.\n\n\n### Step 0: Prepare Inputs and Configure the Project\n\nFor this step, you only need two things:\n\n• A list of DMP sample IDs (one ID per line in a text file)  \n• A config file\n\nUse the global config file already provided in the parent directory:\n\n`penguin/global_config_bash.rc`\n\nOpen this file and edit only one field:\n\n- `projectName` – set this to whatever name you want for the run.\n\nOnce you set the projectName, all downstream outputs will automatically be created inside:\n\n`penguin/data/projects/[projectName]`\n\nNo other changes are required in the config file unless you want to customize paths later.\n\n### Step 1: Run the Parallelized ECHO Caller\n\n```\ncd scripts\nsh generateecDNAResults.sh $config_file $list_of_samples \n```\n\n### Step 2: Merge ECHO Results\n\n\n```\nsh merge_echo_results.sh $config_file\n```\n\n### Step 3 Run the Parallelized FACETS Caller\n\n```\nsh submit_facets_on_cluster.sh $config_file\n```\n\n### Step 4 Merge FACETS Results\n\n```\nsh merge_facets_results.sh ../global_config_bash.rc\n```\n\n### Step 5 Generate Final Report\n\n```\nsh generate_final_report.sh ../global_config_bash.rc\n```\n\n### Results\n\nThe final results for your run will be created automatically inside:\n\n`penguin/data/projects/[projectName]`\n\nIf you want to directly review the final merged reports, you can find them here:\n\n`penguin/data/projects/[projectName]/output/merged`\n\nAdditional useful folders include:\n\n- **Logs:**  \n  `penguin/data/projects/[projectName]/log`\n\n- **Flags:**  \n  `penguin/data/projects/[projectName]/flag`\n\n- **Manifest and stats:**  \n  `penguin/data/projects/[projectName]/manifest`\n\nEach run will populate these directories based on the projectName you set in the config file.\n\n### Visualization Notebooks\n\nThis pipeline offers several visualization notebooks in ```\\notebooks``` to jumpstart analysis. \n\n```echo_visualize.ipynb``` is for general visualizations, analyzing ecDNA prevalence in cancer types, genes that are commonly ecDNA positive, and the effect of ecDNA on clinical factors.\n\n```diagnosis_km_curves.ipynb``` is for creating KM curves using CDSI data. Plot curves for each cancer type and analyze cox models.\n\n```case_study.ipynb``` is for analyzing a single gene in a single cancer. Plot copy number and segment length, cox models / KM curves for the specific gene, and analyze patient timelines.\n\n```treatment.ipynb``` is for analyzing a treatment for a specific gene's amplification and ecDNA positivity. Plot PFS and OS KM curves, and analyze cox models. \n\nEach notebook has a settings section that the user should edit before each run.\n\nTo run the notebooks on Juno, first switch to the analysis environment listed in Dependencies. Run ```jupyter lab``` in the ```\\notebooks``` folder. You should get a link like ```http://localhost:[NUM]/lab?token=[TOKEN]``` then in a separate window run ```ssh -N -L [NUM]:localhost:[NUM] [user]@terra```. Copy the link to a browser, and edit settings in each notebook before running.\n\n### Helpful Links\n\n[For cBioPortal API Information](https://docs.cbioportal.org/web-api-and-clients/)\n\n[About Data Access Tokens](https://docs.cbioportal.org/deployment/authorization-and-authentication/authenticating-users-via-tokens/)\n\n[FACETS API](https://github.com/mskcc/facetsAPI)\n\n[About Boundless Bio](https://boundlessbio.com/what-we-do/)\n\n### Troubleshooting\n\n- You can find log files in the log directory, by default ```[dataDir]/log/log_[projectName]```. In the main directory, ```call_submit_on_cluster...``` has information on the call to submit each ECHO job. The ```echoCalls``` folder contains log files for each ECHO call. ```facets_multiple_call...``` has information on the call to submit each FACETS job. the ```facetsCalls``` folder contains log files for each FACETS gene level call. The end of each file is a date timestamp to allow for troubleshooting across multiple different runs.\n\n- To Pull \u0026 Build singularity image on HPC:\n    ```\n    export singularity_cache=$HOME/.singularity/cache\n\n    echo $singularity_cache\n\n    singularity build --docker-login ${singularity_cache}/boundlessbio-echo-preprocessor-release-v2.3.1.img docker://boundlessbio/echo-prep\n\n    singularity build --docker-login ${singularity_cache}/boundlessbio-ecs-v2.0.0.img  docker://boundlessbio/ecs:release-v2.0.0\n    ```\n\n- To remove chr Prefix from one of the reference files:\n\n    ```\n    sed 's/^chr//' hg19-blacklist.v2.bed \u003e hg19-blacklist.v2_withoutPrefix.bed\n    ```","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmskcc%2Fpenguin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmskcc%2Fpenguin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmskcc%2Fpenguin/lists"}