https://github.com/biogenies/ampbenchmark

Last synced: 11 months ago
JSON representation

Host: GitHub
URL: https://github.com/biogenies/ampbenchmark
Owner: BioGenies
Created: 2022-04-21T19:27:04.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2024-07-29T10:04:47.000Z (almost 2 years ago)
Last Synced: 2025-04-03T06:41:59.589Z (about 1 year ago)
Language: R
Homepage: http://biogenies.info/AMPBenchmark/
Size: 2.9 MB
Stars: 9
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

          ---

output: github_document

---      

# AMPBenchmark

AMPBenchmark is a part of our initative for the improvement of benchmarking standards in the field of antimicrobial peptide (AMP) prediction.

## How to use the public data?

1. Download the benchmark sequence data: 

    - [Dropbox link](https://www.dropbox.com/scl/fi/6hxboi6xy1jm1q1ie6vyg/AMPBenchmark_public.fasta?rlkey=3egb368kyh347fdfamcfd75m0&st=ld02vyiv&dl=0).

    - [GitHub link](https://raw.githubusercontent.com/BioGenies/AMPBenchmark/main/data/AMPBenchmark_public.fasta?token=GHSAT0AAAAAABS4SIUMO3EI6JSQJJ2OC62WYUT5E6A).

2. Download the training sequence data for all methods and replications:

    - [Dropbox link](https://www.dropbox.com/scl/fo/f8kdfgoa8htsvpc79v0u2/ANOcYXz3fSRyE5kEumDDsVs?rlkey=a0su8jyn5nsjnzs2gkqya5n24&st=xd69dycx&dl=0).

3. Train your model using each of the training data set (class of a sequence is denoted by AMP=1 for AMPs and AMP=0 for negative samples, see [Sequence data](https://github.com/BioGenies/AMPBenchmark#sequence-data) section for details.)

4. Benchmark trained models against our data. Make sure to use a subset of sequences for appropriate replication (replication number is denoted by, e.g. rep=1, see [Sequence data](https://github.com/BioGenies/AMPBenchmark#sequence-data) section for details.)

5. Submit the results in the format described below to the [AMPBenchmark web server](http://biogenies.info/AMPBenchmark/).

### Data submission format

| ID                      | training_sampling |AMP_probability |

|-------------------------|-------------------|----------------|

| DBAASP_10018_AMP=1_rep1 | dbAMP             |0.97            |

| DBAASP_3217_AMP=1_rep1  | dbAMP             |0.61            |

| ...                     | ...               |...             |

 - **ID**: must contain the sequence ID, as provided in the FASTA headers of the input sequences. 

 - **training_sampling**: has to contain the type of negative sampling method used to train the model. Possible values are: *AMAP*, *AmpGram*, *ampir-mature*, *AMPlify*, *AMPScannerV2*, *CS-AMPPred*, *dbAMP*, *Gabere&Noble*, *iAMP-2L*, *Wang-et-al*, *Witten&Witten*. Remember that a proper benchmark requires you to train your model using every provided sampling method and evaluate it using all sampling methods using appropriate replication.

 - **AMP_probability**: has to be in the range between 0 and 1.

 

Example data for a random classifier can be downloaded from [Dropbox](https://www.dropbox.com/scl/fi/xqeqdsygkxjg5qt2b7ezg/sample_data.csv?rlkey=ql7gtoumuecwbg5tr0frl81bb&st=w7pdevvn&dl=0).

### Sequence data

The input data is hosted on [Dropbox](https://www.dropbox.com/scl/fi/6hxboi6xy1jm1q1ie6vyg/AMPBenchmark_public.fasta?rlkey=3egb368kyh347fdfamcfd75m0&st=wj8wc93f&dl=0) and [GitHub](https://raw.githubusercontent.com/BioGenies/AMPBenchmark/main/data/AMPBenchmark_public.fasta?token=GHSAT0AAAAAABS4SIUMO3EI6JSQJJ2OC62WYUT5E6A). Note that this single file contains data for all replications which should be used separately with appropriate replications of training sets. 

The training data sets are hosted on [Dropbox](https://www.dropbox.com/scl/fo/f8kdfgoa8htsvpc79v0u2/ANOcYXz3fSRyE5kEumDDsVs?rlkey=a0su8jyn5nsjnzs2gkqya5n24&st=vpcy0lyc&dl=0) and follow the same naming convention.  

There are two types of the input sequences:

 - positive sequence (e.g., **DBAASP_10718**\_*AMP=1*\_rep1): **IDinDBAASP**\_*class*\_replicateID.

 - negative sequences (e.g., **Seq1896_sampling\_method=Gabere&Noble**\_*AMP=0*\_rep4): **IDandSamplingMethod**\_*class*\_replicateID.

 

AMP sequences are derived from the [DBAASP database](https://dbaasp.org/).

md5 sum of the **AMPBenchmark_public.fasta**: 58f1424c057aaeb64bc632cad6038cad.

```{r echo = FALSE, results = 'asis'}

source("https://raw.githubusercontent.com/BioGenies/NegativeDatasets/main/docs/rmd_scripts.R")

cat(negative_sampling_citation())

```

```{r echo = FALSE, results = 'asis'}

cat(negative_sampling_links())

```

  

```{r echo = FALSE, results = 'asis'}

cat(negative_sampling_contact())

```

## Changelog

 - 2024/07/29: updated dropbox links.

 - 2023/01/11: fixed data processing.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/biogenies/ampbenchmark

Awesome Lists containing this project

README