{"id":21334264,"url":"https://github.com/ucsd-ccbb/c-view","last_synced_at":"2025-07-12T11:30:33.540Z","repository":{"id":40493168,"uuid":"336392274","full_name":"ucsd-ccbb/C-VIEW","owner":"ucsd-ccbb","description":"This software implements a high-throughput data processing pipeline to identify and charaterize SARS-CoV-2 variant sequences in specimens from COVID-19 positive hosts or environments. ","archived":false,"fork":false,"pushed_at":"2023-08-01T23:56:54.000Z","size":16584,"stargazers_count":9,"open_issues_count":22,"forks_count":2,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-03-20T07:10:33.842Z","etag":null,"topics":["epidemiology","sequencing","viral-genomics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ucsd-ccbb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-02-05T21:23:06.000Z","updated_at":"2023-05-16T23:29:39.000Z","dependencies_parsed_at":"2023-02-15T03:31:45.953Z","dependency_job_id":null,"html_url":"https://github.com/ucsd-ccbb/C-VIEW","commit_stats":null,"previous_names":[],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucsd-ccbb%2FC-VIEW","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucsd-ccbb%2FC-VIEW/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucsd-ccbb%2FC-VIEW/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucsd-ccbb%2FC-VIEW/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ucsd-ccbb","download_url":"https://codeload.github.com/ucsd-ccbb/C-VIEW/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225815222,"owners_count":17528375,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["epidemiology","sequencing","viral-genomics"],"created_at":"2024-11-21T23:18:55.177Z","updated_at":"2024-11-21T23:18:55.753Z","avatar_url":"https://github.com/ucsd-ccbb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# C-VIEW: COVID-19 VIral Epidemiology Workflow\n\nThis software implements a high-throughput data processing pipeline to identify and charaterize SARS-CoV-2 variant sequences in specimens from COVID-19 positive hosts or environments.  It is based on https://github.com/niemasd/SD-COVID-Sequencing and built for use with Amazon Web Services (AWS) EC2 machine instances and S3 data storage.\n\n# Table of Contents\n1. [Installing the Pipeline](#installing-the-pipeline)\n2. [Creating a Cluster](#creating-a-cluster)\n3. [Running the Pipeline](#running-the-pipeline)\n\n\n## Installing the Pipeline\n\n**Note: Usually it will NOT be necessary to install the pipeline from scratch.**  The most\ncurrent version of the pipeline is pre-installed on the so-labeled\nAmazon Web Services snapshot in region us-west-2 (Oregon),\nand this snapshot can be used directly to [create a cluster](#Creating-a-Cluster).\n\nIf a fresh installation *is* required, take the following steps:\n\n1. On AWS, launch a new ubuntu 20.04 instance\n   1. Note that it MUST be version 20.04, not the latest available ubuntu version (e.g. 22.04) because 20.04 is the latest version supported by AWS ParallelCluster.\n   2. Select type t2.medium\n   3. Add a 35 GB root drive and a 300 GB EBS drive\n   4. Set the security group to allow SSH via TCP on port 22 and all traffic via all protocols on all ports\n2. `ssh` onto new instance to set up file system and mount\n   1. Run `lsblk` to find the name of the 300 GB EBS drive. For the remainder, of this section, assume `lsblk` shows that the name of the 300 GB volume is `xvdb`. \n   2. Run `sudo mkfs -t xfs /dev/xvdb` to make a filesystem on the new drive \n   3. Run `sudo mkdir /shared` to create a location for the installation \n   4. Run `sudo mount /dev/xvdb /shared` to mount the 300 GB volume to the new location\n   5. Run ``sudo chown `whoami` /shared`` to grant the current user permissions to the new location\n3. Install anaconda and python\n   1. Run `cd /shared`\n   2. Run `wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh`\n   3. Run `bash Anaconda3-2020.11-Linux-x86_64.sh`\n   4. Answer `yes` when asked to accept the license agreement\n   5. Enter `/shared/workspace/software/anaconda3` when asked for the install location\n   6. Answer `yes` when asked whether to have the installer run conda init\n   7. Log out of the `ssh` session and then back in to allow the conda install to take effect\n4. Install C-VIEW\n   1. Run `cd /shared`\n   2. Download `install.sh`\n   3. Run `bash install.sh`\n   4. Answer yes whenever asked for permission to proceed\n5. On AWS, make a snapshot of the newly installed 300 GB volume\n\nNote that the pipeline uses the following external software programs, which are \ninstalled via the `install.sh` script:\n\n* [Minimap2 2.17-r941](https://github.com/lh3/minimap2/releases/tag/v2.17)\n* [samtools 1.11](https://github.com/samtools/samtools/releases/tag/1.11)\n* [Qualimap 2.2.2-dev](https://bitbucket.org/kokonech/qualimap/src/master/)\n* [ivar 1.3.1](https://github.com/andersen-lab/ivar/releases/tag/v1.3.1)\n* [Pangolin (variable version)](https://github.com/cov-lineages/pangolin)\n* [ViralMSA 1.1.11](https://github.com/niemasd/ViralMSA/releases/tag/1.1.11)\n* [q30 dev](https://github.com/artnasamran/q30)\n* [samhead 1.0.0](https://github.com/niemasd/samhead/releases/tag/1.0.0)\n* [pi_from_pileup 1.0.3](https://github.com/Niema-Docker/pi_from_pileup/releases/tag/1.0.3)\n* git 2.7.4 or higher\n\n\n## Creating a Cluster\n\nThe pipeline is designed to run on a version 3 or later AWS ParallelCluster. \nBegin by ensuring that ParallelCluster is installed on your local machine; if it\nis not, take these steps:\n\n1. Set up a `conda` environment and and install ParallelCluster \n   1. Run `conda create --name parallelcluster3 python=3`\n   2. Run `conda activate parallelcluster3`\n   3. Run `python3 -m pip install --upgrade aws-parallelcluster`\n2. In the `parallelcluster3` environment, install Node Version Manager and Node.js, which are (apparently) required by AWS Cloud Development Kit (CDK)\n   1. Run `curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash`\n   2. Run `chmod ug+x ~/.nvm/nvm.sh`\n   3. Run `source ~/.nvm/nvm.sh`\n   4. Run `nvm install --lts`\n   5. Check your install by running `node --version` and `pcluster version`\n\nNext, ensure you have a pem file registered with AWS and that you have run `aws configure`\nlocally to set up AWS command line access from your local machine.  Then \nprepare a cluster configuration yaml file using the below template:\n\n```\nRegion: us-west-2\nImage:\n  Os: ubuntu2004\nSharedStorage:\n  - MountDir: /shared\n    Name: custom\n    StorageType: Ebs\n    EbsSettings:\n      Size: 300\n      SnapshotId: \u003csnapshot of current cview release, e.g. snap-09264bf02660b54ad \u003e\nHeadNode:\n  InstanceType: t2.medium\n  Networking:\n    SubnetId: subnet-06ff527fa2d5827a3\n# subnet-06ff527fa2d5827a3 is parallelcluster:public-subnet\n  Ssh:\n    KeyName: \u003cname of your pem file without extension, e.g. my_key for a file named my_key.pem \u003e\nScheduling:\n  Scheduler: slurm\n  SlurmQueues:\n    - Name: default-queue\n      ComputeSettings:\n        LocalStorage:\n          RootVolume:\n            Size: 500\n      Networking:\n        SubnetIds:\n          - subnet-06ff527fa2d5827a3\n# subnet-06ff527fa2d5827a3 is parallelcluster:public-subnet\n      ComputeResources:\n        - Name: default-resource\n          MaxCount: 15\n          InstanceType: r5d.24xlarge\n```\n\nTo create a new cluster from the command line, run\n\n```\npcluster create-cluster \\\n    --cluster-name \u003cyour-cluster-name\u003e \\\n    --cluster-configuration \u003cyour-config-file-name\u003e.yaml\n```\n\n(If you experience an error referencing Node.js, you may need to once again\nrun `source ~/.nvm/nvm.sh` to ensure it is accessible from your shell.)  The \ncluster creation progress can be monitored from the `CloudFormation`-\u003e`Stacks` section of the AWS Console.\n\nOnce the cluster is successfully created, log in to the head node.  To avoid \nhaving to use its public IPv4 DNS, one can run\n\n`pcluster ssh --cluster-name \u003cyour_cluster_name\u003e -i /path/to/keyfile.pem`\n\nwhich fills in the cluster IP address and username automatically.\n\nFrom the head node, run `aws configure` to set up the head node with credentials for accessing the \nnecessary AWS S3 resources.\n\n\n## Running the Pipeline\n\nThe pipeline is initiated on the head node of the cluster by calling the \n`run_cview.sh` script with an input csv file provided by the user, e.g.:\n\n`bash /shared/workspace/software/cview/pipeline/run_cview.sh /shared/runfiles/cview_test_run.csv`\n\nThis file should have a header line in the following format:\n\n`function,organization,seq_run,merge_lanes,primer_set,fq,read_cap,sample,timestamp,istest`\n\nIt must then contain one or more data lines, each of which will trigger a run of the specified part(s) of the pipeline on a specified sequencing run dataset.\n\nThe fields are:\n\n|Field Name| Allowed Values                                                                                                                       |Description|\n|----------|--------------------------------------------------------------------------------------------------------------------------------------|-----------|\n|`function`| cumulative_pipeline, pipeline, variants_qc, variants, sample, qc, lineages, phylogeny, cummulative_lineages, or cumulative_phylogeny | Specifies the type of functionality that should be run. See details below |\n|`organization`| ucsd or helix                                                                                                                        |Specifies the organization from which all the samples in the current sequencing run are assumed to originate.  Helix sequencing runs can be combined only with data from other helix sequencing runs at the lineage and/or alignment-building steps.|\n|`seq_run`| a string such as \"210409_A00953_0272_AH57WJDRXY\"                                                                                     |Specifies the sequencing center's identifier of the sequencing run to be processed, if relevant to the function provided.|\n|`merge_lanes`| true or false                                                                                                                        |Indicates whether the pipeline should attempt to merge sample read data across fastq files from multiple lanes.|\n|`primer_set`| artic or swift_v2                                                                                                                    |Specifies the primer set to use in trimming per-sample sorted bam files.|\n|`fq`| se or pe                                                                                                                             |Indicates whether the pipeline should be run on only R1 reads in the sequencing run or on R1 and R2 reads.|\n|`read_cap`| a positive integer or all                                                                                                            |Specifies the maximum number of mapped reads per sample that should be used in the per-sample variant-calling and consensus-sequence-building functionality.|\n|`sample`| a string such as \"SEARCH-10003__D101802__I22__210608_A00953_0321_BH7L5LDSX2__S470_L002\"                                              |Specifies, for the sample to be processed, the part of the read one file name coming before `_R1_001.fastq.gz`.|\n|`timestamp`| a string such as \"2021-07-09_22-44-27\"                                                                                               |Specifies the timestamp associated with the particular processing run that should be used.|\n|`is_test`| true or false                                                                                                                        |Indicates whether data should be pulled from and written to the test S3 bucket (if true) or the production S3 bucket (if false)|\n\nThe functions supported by the pipeline are:\n\n| Function               | Description                                                                                                                                                                                                                                                          |\n|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `cumulative_pipeline`  | This is the primary usage. Runs variant calling, consensus sequence generation, and QC for a specified sequencing run, followed by lineage calling and alignment building on the cumulative set of all QC-passing consensus sequences ever processed by the pipeline |\n| `pipeline`             | Runs all pipeline functionality (including lineage calling and alignment building) for a specified sequencing run                                                                                                                                                    |\n| `variants_qc`          | Runs variant calling, consensus sequence generation, and QC for a specified sequencing run                                                                                                                                                                |\n| `variants`             | Runs variant calling and consensus sequence generation on all samples in the specified sequencing run                                                                                                                                                                |\n| `sample`               | Runs variant calling and consensus sequence generation on the specified sample in the specified sequencing run for the specified timestamp                                                                                                                           |\n| `qc`                   | Runs QC on all outputs from the specified sequencing run processed under the specified timestamp                                                                                                                                                                     |\n| `lineages`             | Runs lineage calling on all QC-passing consensus sequences in the specified sequencing run                                                                                                                                                                           |\n| `phylogeny`            | Runs both lineage calling and alignment building on all QC-passing consensus sequences in the specified sequencing run                                                                                                                                               |\n| `cumulative_lineages`  | Runs lineage calling on the cumulative set of all QC-passing consensus sequences ever processed by the pipeline                                                                                                                                                      |\n| `cumulative_phylogeny` | Runs both lineage calling and alignment building on the cumulative set of all QC-passing consensus sequences ever processed by the pipeline                                                                                                                          |\n\nFor all functions except `sample`, some of the input fields are ignored, as shown in the table below:\n\n| function             |organization|seq_run|merge_lanes|primer_set|fq|read_cap|sample|timestamp|istest|\n|----------------------|------------|------|----------|---------|---|-------|-----|------|------|\n| cumulative_pipeline  |ucsd or helix|e.g 210409_A00953_0272_AH57WJDRXY|true or false|artic or swift_v2|se or pe|all or positive integer|ignored|ignored|true or false|\n| pipeline             |ucsd or helix|e.g 210409_A00953_0272_AH57WJDRXY|true or false|artic or swift_v2|se or pe|all or positive integer|ignored|ignored|true or false|\n| variants_qc          |ucsd or helix|e.g 210409_A00953_0272_AH57WJDRXY|true or false|artic or swift_v2|se or pe|all or positive integer|ignored|ignored|true or false|\n| variants             |ucsd or helix|e.g 210409_A00953_0272_AH57WJDRXY|true or false|artic or swift_v2|se or pe|all or positive integer|ignored|ignored|true or false|\n| sample               |ucsd or helix|e.g 210409_A00953_0272_AH57WJDRXY|true or false|artic or swift_v2|se or pe|all or positive integer|e.g. SEARCH-17043__D101859__L01__210409_A00953_0272_AH57WJDRXY__S82_L001|e.g. 2021-04-15_16-13-59|true or false|\n| qc                   |ucsd or helix|e.g 210409_A00953_0272_AH57WJDRXY|ignored|ignored|se or pe|ignored|ignored|e.g. 2021-04-15_16-13-59|true or false|\n| lineages             |ucsd or helix|e.g 210409_A00953_0272_AH57WJDRXY|ignored|ignored|ignored|ignored|ignored|ignored|true or false|\n| phylogeny            |ucsd or helix|e.g 210409_A00953_0272_AH57WJDRXY|ignored|ignored|ignored|ignored|ignored|ignored|true or false|\n| cumulative_lineages  |ucsd or helix|ignored|ignored|ignored|ignored|ignored|ignored|ignored|true or false|\n| cumulative_phylogeny |ucsd or helix|ignored|ignored|ignored|ignored|ignored|ignored|ignored|true or false|\n\nAn example input file might look like:\n\n```\nfunction,organization,seq_run,merge_lanes,primer_set,fq,read_cap,sample,timestamp,istest\ncumulative_pipeline,ucsd,210608_A00953_0321_BH7L5LDSX2,false,swift_v2,pe,2000000,NA,NA,false\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucsd-ccbb%2Fc-view","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fucsd-ccbb%2Fc-view","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucsd-ccbb%2Fc-view/lists"}