https://github.com/soedinglab/binning_benchmarking

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/soedinglab/binning_benchmarking
Owner: soedinglab
Created: 2024-08-07T14:38:14.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-07-31T16:08:35.000Z (10 months ago)
Last Synced: 2025-07-31T19:44:04.980Z (10 months ago)
Language: Jupyter Notebook
Size: 30 MB
Stars: 4
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# binning_benchmarking

Binning benchmarking involves the following steps

coassembly: `read correction -> assembly -> mapping -> generate_abundance_matrix -> binning -> assessment`

single-sample: `read correction -> single-sample_assembly -> single-sampleread_mapping -> generate_abundance_matrix -> binning -> assessment`

multi-sample: `read correction -> sample-wise_assembly -> pool_allsampleassembly_contigs -> mapping -> generate_abundance_matrix -> binning -> split_bins -> remove_redundantbins -> assessment`

## Read correction

Reads correction is done by CoCo (https://github.com/soedinglab/CoCo). The correction is efficitive if k-mer counts are computed using reads pooled from all samples.

**Concatenate reads for CoCo correction**

`cat *_reads.fq > all_reads.fq`

**Compute k-mer counts using dsk tool** (https://github.com/GATB/dsk)

`dsk -file all_reads.fq -kmer-size 41` (output: all_reads.h5)

**Correct reads using CoCo**

`coco correction --reads all_reads.fq --counts all_reads.h5 --outdir allreads_coco_corrected` (output: all_reads.corr.reads.fq)

**Split reads by sample origin**

`splitreadsbysample ` (output: .fastq)

## Assembly
MEGAHIT must be installed (https://github.com/voutcn/megahit.git)

### pooled assembly
`megahit --12 *_reads.fastq -t 64 --presets meta-sensitive -o megahit_out`

### sample-wise assembly
`megahit --12 _reads.fastq -t 64 --presets meta-sensitive -o _megahit_out`

`cat *_megahit_out/final.contigs.fa > allcontigs_concatenatedallsamples.fa` (concatenate all sample assemblies into one master file, make sure that headers of sample-wise assemblies are prefixed with sample id spearated by 'C')

## Mapping
Strobealign is the fast and accurate aligner. We used to obtain the abundance matrix. (https://github.com/ksahlin/strobealign.git)

`mkdir samfiles`

### pooled assembly

`strobealign -t 64 --aemb megahit_out/final.contigs.fa --eqx --interleaved .fastq > samfiles/abundances_.tsv`

`strobealign -t 64 megahit_out/final.contigs.fa --eqx --interleaved .fastq | samtools view -h -o samfiles/_strobealign.sam`

### concatenated sample-wise assembly
`strobealign -t 64 --aemb allcontigs_concatenatedallsamples.fa --eqx --interleaved ${sample_id}.fastq > samfiles/abundances_.tsv`

`strobealign -t 64 allcontigs_concatenatedallsamples.fa --eqx --interleaved ${sample_id}.fastq |samtools view -h -o samfiles/_strobealign.sam`

If you have used other aligners (eg. bowtie2, bwa-mem), use our in-house script

`samtools view samfiles/_strobealign.sam | aligner2counts samfiles --only-mapids`

## Generate abundance matrix
`python util/get_abundance_tsv.py -i -l -m `

contig_length is a tab separated `.txt` file that should contain contig ids and length (contig_id\tlength). This file can be generated using `convertfasta_multi2single` executable (see README.md in `util/`).

inputdir is the directory of sample-wise abundance.tsv file. `abundances_.tsv`

### Sort alignment files
`samtools sort samfiles/_strobealign.sam -o samfiles/_strobealign_sorted.bam`

## Binning
Refer to benchmarking_scripts.ipynb. Ensure the order of contigs in the abundance matrix matches the assembly FASTA file.

## Split bins (multi-sample binning)
By default, most deep learning methods can split bins by sample id in multi-sample binning mode (McDevol, VAMB and GenomeFace). But tools such as COMEBin and MetaBAT2 don't have an option for it. To perform splitting, use our script in `util/`.

`python splitfasta_bysampleids.py --input_dir --output_dir --format `

This script assumes that sample id is located in-between `S` and `C` characters. For example, from a contig id `S1C141_284`, it will detect `1` as the sample id.

## Remove redundancy (multi-sample binning)
For this benchmarking, we mapped bins to source genomes for AMBER assessment as described in README.md in `util/`. However, it can be performed with de-replication approach `dRep` (https://github.com/MrOlm/drep). We leave the choice to the user.

## Assessment
### CheckM2
CheckM2 is a neural network-based method that estimates bin completeness and purity. (https://github.com/chklovski/CheckM2.git)

`checkm2 predict --input _results -o _results/checkm2_results --thread 24 -x fasta`

### AMBER
For the binning of contigs from gold-standard sets, we used AMBER assessment. (https://github.com/CAMI-challenge/AMBER.git)

`amber.py _cluster.tsv -g gsa_pooled_mapping_short.binning -o amber_results`

where gsa_pooled\_mapping\_short.binning files for marine, strain-madness and plant-associated datasets were obtained from the CAMI2 assessment study.

### CheckM
CheckM is used to validate MetaBAT2 and MetaWRAP bin_refinement results. (https://github.com/Ecogenomics/CheckM.git)

`checkm lineage_wf _results _results/checkm_results -x fasta -t 24`

## Reassembly (post-binning refinement)

**Extract reads for each bin**

For combined read fastq and mapfiles

`extractreads -f (binformat|fasta)`

For sample-wise processing

for sample in samplelist;
do
extractreads fullpath/binfastafolder ${sample}_mapids ${sample}.fastq -f (binformat|fasta);
done

`allsample_mapids` is a text file containing mapped read_id and contig_id separated by `tab`. This file can be generated using `aligner2counts` executable (see README.md in `util/`) for each sample as `_mapids`. Concatenate these sample-wise mapids to obtain `allsample_mapids`. From extracreads run, you will get `.fastq`. The extracted .fastq file will have reads from all samples.

`spades.py --12 .fastq --trusted-contigs .fasta --only-assembler --careful -o _assembly/ -t 12 -m 128`
`SPAdes` must be installed.

Refer to `workflow_reassemble` workflow to run the entire steps in a single run.

## Plotting
Refer to plots.ipynb

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/soedinglab/binning_benchmarking

Awesome Lists containing this project

README