Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/LUMC/ribosome-profiling-analysis-framework

Ribosome profiling analysis framework as described in de Klerk E., Fokkema I.F.A.C et al (2015). Alignment, triplet periodicity analysis, detecting ORFs.
https://github.com/LUMC/ribosome-profiling-analysis-framework

Last synced: 2 months ago
JSON representation

Ribosome profiling analysis framework as described in de Klerk E., Fokkema I.F.A.C et al (2015). Alignment, triplet periodicity analysis, detecting ORFs.

Awesome Lists containing this project

README

        

RIBOSOME PROFILING ANALYSIS FRAMEWORK; de Klerk E., Fokkema I.F.A.C et al (2015).
FOOTPRINT ALIGNMENT, TRIPLET PERIODICITY ANALYSIS, PEAK CALLING.
================================================================================

This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/
or send a letter to:
Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

The latest version of the scripts and this README file are available at:
http://lumc.github.io/ribosome-profiling-analysis-framework/

================================================================================
WORKFLOW
================================================================================

This framework has been developed on Linux and tested on various Linux systems.
It may run on Windows, but this is not supported. Several provided examples for
running the scripts on a large number of files are written in Bash and will
therefore not run on Windows without additional software, like Cygwin.
This framework requires the installation of PHP-CLI (Command Line Interface),
available from http://php.net/.
This framework has been developed and tested using PHP 5.4 and up, but may work
with earlier versions.

OVERVIEW OF THE FRAMEWORK
--------------------------------------------------------------------------------
- Alignment of ribosome footprint reads
- Analysis of read length distribution of ribosome footprints
- Creating wiggle files with genomic coordinates from transcriptome alignment
- Merging wiggle files from transcriptome and genome alignments
- Retrieving coordinates relative to the annotated open reading frames (ORFs)
using Mutalyzer
- Analysis of triplet periodicity
- Merging biological replicates prior to peak calling (identification of
translation initiation sites (TISs))
- Peak calling: identification of primary ORFs, alternative ORFs and upstream
ORFs
- Merging result files
- Analysis of switches in TIS usage
- TIS location and motif analysis

SAMPLE NAMING
--------------------------------------------------------------------------------
Throughout the documentation of this framework, we provide example commands.
The examples provided here are from the study of de Klerk E., Fokkema I.F.A.C.
et al. (2015), in which two different treatments were performed (harringtonin
and cycloheximide) in two different timepoints, with three biological
replicates. Different treatments or timepoints are named A to Z, with replicates
named in numbers (1 to 9). We recommend using similar, or equal, sample naming.
This framework will read the sample names from the beginning of the file names,
if separated with an underscore or period.
Examples of possible FASTQ file names:
A1_pool7213_TATAT_L002_R1_001.fastq (sample name: A1)
D3.fastq (sample name: D3)
I2_ACACAC_L008_R1_001.fastq (sample name: I2)

REFERENCE SEQUENCES
--------------------------------------------------------------------------------
This framework is based on the Mus Musculus mm10 genome assembly but it can also
be adapted for the Homo Sapiens hg19 and hg38 genome assemblies. Other
assemblies are currently not supported due to their absence in the Mutalyzer
tool, which is used to translate genomic coordinates into transcriptomic
coordinates and vice versa. See https://v2.mutalyzer.nl for more information.

The transcriptome reference sequence was downloaded from
ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.rna.fna.gz
Last modified date: 2013-05-08.

If mouse.rna.fna.gz is not available, you can download the separate files (named
mouse.1.rna.fna.gz, mouse.2.rna.fna.gz and so on) and combine them using bash
with:

for file in mouse.?.rna.fna.gz; do cat $file >> mouse.rna.fna.gz; done

The genome reference sequence was downloaded from
ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/
Last modified date: 2012-02-09.

ALIGNMENT OF RIBOSOME FOOTPRINT READS
--------------------------------------------------------------------------------
To prevent loss of reads or misalignment of short reads (30 nt) adjacent or
spanning an exon-exon junction, reads are first mapped to the transcriptome
reference, and unaligned reads are subsequently mapped to the genome reference.
For more information about the alignment method, please see de Klerk E., Fokkema
I.F.A.C. et al (2015).

1) The SAM file obtained from the transcriptome alignment reports positions of
mapped reads relative to the start of the transcript. To be able to merge the
results of the genome and the transcriptome alignment these positions need to be
converted to genomic positions. Transcriptome sequences do not always properly
align to the genome, for instance due to the presence of insertions, deletions,
chimeric alignments, or they align to random chromosomes. Because the
coordinates of the reads mapped to the transcriptome reference cannot be
correctly converted to genomic coordinates for transcripts that do not properly
align, first the transcriptome reference needs to be filtered for these
transcripts. For the remaining transcripts, the
mm10_transcript_positions_create.php script is used to create a file that
indicates the location of the exons of the transcripts on the genome.
For more information on this script and the alignment, please see the help file:
help/mm10_transcript_positions_create.txt.

Usage:
./mm10_transcript_positions_create.php TRANSCRIPTOME_ALIGNMENT_SAM_FILE

Produces:
- mm10_transcript_positions.txt
This file contains the genomic coordinates of all exons for transcripts that
align to the genome, excluding random chromosomes, without insertions or
deletions or chimeric alignment. These can therefore be used for the
transcriptome alignment.
- transcriptome_alignment_unsupported_transcripts.txt
Contains all unsupported transcripts and the reasons for rejection.

2) The unsupported transcripts are then removed from the transcriptome
reference, using the following bash code:

cut -f 1 transcriptome_alignment_unsupported_transcripts.txt | grep -v '^#' \
> transcriptome_alignment_unsupported_transcripts_list.txt
tr '\n' '+' < mouse.rna.fna | sed 's/+>/\n>/g' \
| grep -vFf transcriptome_alignment_unsupported_transcripts_list.txt \
| tr '+' '\n' > mouse.rna.fna.no_unsupported.fa

The resulting file (mouse.rna.fna.no_unsupported.fa) is then used to create a
Bowtie index file suitable for alignment using bowtie.

3) Before alignment, the adapter sequences need to be removed from the reads.
There are various tools available which do this, such as cutadapt.
Make sure that after this step, your files are named *.trunc.

4) The following set of commands runs Bowtie to align the reads to the
transcriptome reference, store the unaligned reads, and finally align the
unaligned reads to the genome reference.
For the alignments on the transcriptome, only the forward mapped reads are
selected, and then the SAM files are filtered, requiring a minimum of 25
nucleotides per read. SAM files obtained from the alignment on the genome are
also filtered, requiring a minimum of 25 nucleotides per read. SAM files from
the genome alignment are converted to BAM files, which are then sorted and
converted into mpileup files. The mpileup files are then converted to 5' end
wiggle files, one for each strand. For the creation of 5' end wiggle files from
the genome alignment, use the SageWiggle.py script included in this framework.

In the commands below, replace "mouse.rna.fa.no_unsupported.fa.bowtieindex" with
the correct location of the bowtie index file of your transcriptome reference
sequence file, filtered for unsupported transcripts (see step 2).
Replace "mm10.reference.fa.bowtieindex" with the correct location of the
bowtie index file of your genome reference sequence file.

for file in *.trunc; do bowtie -k 1 -m 20 -n 1 --best --strata -p 8 -S --chunkmbs 2000 --norc --un "${file}.transcriptome_unaligned" mouse.rna.fa.no_unsupported.fa.bowtieindex "$file" "${file}.transcriptome_aligned.sam" 2>> results.transcriptome_alignment.txt; done;
for file in *.transcriptome_unaligned; do bowtie -k 1 -m 2 -n 1 -p 8 -S --best --strata --chunkmbs 2000 mm10.reference.fa.bowtieindex "$file" "${file}.genome_aligned.sam" 2>> results.transcriptome_unaligned.genome_alignment_mm10.txt; done;
for file in *.transcriptome_unaligned.genome_aligned.sam; do echo $file; head -n 100 "$file" | grep "^@" > "${file}.M25.sam"; awk '($6 > 25M)' "$file" >> "${file}.M25.sam"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam; do echo $file; samtools view -bS "$file" > "${file}.bam"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam; do echo $file; samtools sort "$file" "${file}.sorted"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam; do echo $file; samtools index "$file"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam; do echo $file; samtools mpileup -d 10000000 "$file" > "${file}.mpileup"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam.mpileup; do echo $file; python sageWiggle.py -i "$file" -o "${file}.F.wig5" "${file}.R.wig5"; done;
for file in *.transcriptome_aligned.sam; do echo $file; awk '($2 == 0)' "$file" > "${file}.mappedforward.sam"; done;
for file in *.transcriptome_aligned.sam.mappedforward.sam; do echo $file; awk '($6 > 25M)' "$file" > "${file}.M25.sam"; done;
for file in *.transcriptome_aligned.sam.mappedforward.sam.M25.sam; do echo $file; cut -f 1,3,4 "$file" > "${file}.3col.sam"; done

5) Multiple scripts have transcript names but need either strand or gene
information, or vice versa. To provide this, download a file from the UCSC table
browser, listing the transcripts, the strand and the genes.

To build this list:
- Open the UCSC table browser: http://genome.ucsc.edu/cgi-bin/hgTables
- Select genome, assembly, track: "RefSeq Genes".
- Table: "refGene", output format: "selected fields from primary and related
tables".
- Click "get output".
- On this next page, select "name", "strand" and "name2" in the field list, and
click "get output".
- Save in a file and sort. On linux, this can be done with the sort command.
- In this README, this file is referred to as mm10_gene_list.txt.

ANALYSIS OF READ LENGTH DISTRIBUTION OF RIBOSOME FOOTPRINTS
--------------------------------------------------------------------------------
6) The SAM files generated by the alignment of the ribosome footprint reads can
be used to retreive information on the read length distribution. A simple bash
script that calculates the read length distribution for reads aligning to the
transcriptome (NM and NR separately) and reads aligning to the genome, is
described in help/read_length_distribution.txt.

The read length distribution can also be calculated per gene. For that, use the
get_read_length_per_gene.php script. The script takes gene IDs in an input file
and determines the read length distribution for its transcripts per sample, from
both transcriptome and genome alignment SAM files. It requires the
mm10_gene_list.txt file to see which transcripts belongs to which gene, the
mm10_transcript_positions.txt file to check the strand and the positions of the
transcripts, and the input SAM files.

Usage:
./get_read_length_per_gene.php FILE_WITH_GENES_TO_ANALYZE mm10_gene_list \
mm10_transcript_positions.txt SAM_FILE [SAM_FILE [SAM_FILE [...]]]

Example:
./get_read_length_per_gene.php genes_to_analyze.txt mm10_gene_list.txt \
mm10_transcript_positions.txt *.M25.sam

Produces:
- For each transcript of the given genes, one
gene_transcript_read_length_distribution.txt file, with the read lengths in
the first column, and the number of reads with this length per sample in the
next columns.
- One summary file, ALL_ALL_read_length_distribution.txt, with all read lengths
summed up.

This README will continue describing the steps to follow for triplet periodicity
analysis and ORF analysis (peak calling), regardless of whether the read length
distribution has been analyzed or not.

GROUPING READS FROM SAM FILES OBTAINED FROM TRANSCRIPTOME ALIGNMENT
AND BASIC STATISTICS
--------------------------------------------------------------------------------
7) Mapped reads in the SAM files are first grouped together to generate a file
that sums the coverage per transcript per position.
For this, use the pack_sam_files.php file.

Usage:
./pack_sam_files.php SAM_FILE [SAM_FILE [SAM_FILE [...]]]

Example:
./pack_sam_files.php *.transcriptome_aligned.*3col.sam

Produces:
- For each given SAM file, a SAM.packed file.
Contains the transcript ID, the position relative to the start of the tran-
script, and the total coverage for that position, i.e. the number of reads
aligned starting at this position.

8) The packed SAM files can be used to run some statistics, using
packed_sam2coverage_per_gene.php.

Usage:
./packed_sam2coverage_per_gene.php mm10_gene_list.txt \
PACKED_SAM_FILE [PACKED_SAM_FILE [...]]

Example:
./packed_sam2coverage_per_gene.php mm10_gene_list.txt *.packed

Produces:
- One file which name depends on the given SAM file(s), ending in the suffix
.coverage_per_gene.txt, containing (per SAM file) which transcripts can not
be matched to a gene, listing the top 5 unknown transcripts based on their
coverage, and a table containing coverage per gene per sample. The
genes are sorted on total coverage in all samples together.

Other statistics can be run directly on the packed SAM files, for instance how
many reads are mapped to non-coding reads.

CREATING WIGGLE FILES WITH GENOMIC COORDINATES FROM TRANSCRIPTOME ALIGNMENT
--------------------------------------------------------------------------------
9) The packed SAM files are used to generate wiggle files. For this, use the
packed_sam2wiggle.php script, which needs the transcript positions file created
in 1).

Usage:
./packed_sam2wiggle.php mm10_transcript_positions.txt \
PACKED_SAM_FILE [PACKED_SAM_FILE [PACKED_SAM_FILE [...]]]

Example:
./packed_sam2wiggle.php mm10_transcript_positions.txt *.packed

Produces:
- 4 wiggle files per SAM file;
+ F unfiltered,
+ F filtered (NR and XR removed),
+ R unfiltered,
+ R filtered (NR and XR removed).

Unfiltered 5' wiggle files are used for visualization purposes only, whereas
filtered 5' wiggle files, containing positions of reads mapping only to coding
transcripts, are used for triplet periodicity analysis and peak calling.
Positions mapping to non-coding transcripts won't be processed with this
analysis framework.

MERGING WIGGLE FILES FROM TRANSCRIPTOME AND GENOME ALIGNMENTS
--------------------------------------------------------------------------------
10) The merging of the wiggle files from the alignment on the transcriptome and
the alignment on the genome is done using the "wiggelen" package written by
Martijn Vermaat and Jeroen Laros. The merge_wigglefiles.sh script is a wrapper
around the wiggelen package.
For detailed information, see the help file help/merge_wigglefiles.txt.

Usage:
./merge_wigglefiles.sh
(it should be run in the directory where all the wiggle files reside)

Produces:
- 4 wiggle files per sample, using the transcriptome and genome alignment wiggle
files for each created file:
+ F unfiltered,
+ F filtered (NR and XR removed),
+ R unfiltered,
+ R filtered (NR and XR removed).

Unfiltered 5' wiggle files are used for visualization purposes only, whereas
filtered 5' wiggle files, containing positions of reads mapping only to coding
transcripts, are used for triplet periodicity analysis and peak calling.
Positions mapping to non-coding transcripts won't be processed with this
analysis framework.

RETRIEVING COORDINATES RELATIVE TO THE ANNOTATED OPEN READING FRAMES (ORFs)
USING MUTALYZER
--------------------------------------------------------------------------------
11) To translate the genomic coordinates to coordinates relative to the
annotated Translation Initiation Site (TIS), we use the Mutalyzer service.
Mutalyzer's position converter has a batch-feature that allows for the upload of
large files containing genomic postions. Through email you will be notified when
your batch job is done, after which you can download the result file.
Wiggle files are converted to Mutalyzer position converter batch files using the
wig2batchfile.php script. Please note that since Mutalyzer is a service aimed
at genetic variation, all positions are converted into genetic variants.
Batch files are created for the filtered 5' wiggle files only.

Usage:
./wig2batchfile.php WIGGLE_FILE

Example:
for file in *.filtered.wig5; do ./wig2batchfile.php "$file"; done

Produces:
- One Mutalyzer batch input file.

12) The position converter batch files are manually loaded into Mutalyzer:
https://v2.mutalyzer.nl/batchPositionConverter

Do not run more than a few files at a time. It will considerably slow down the
process, and will increase the chance of failures, after which you might need to
re-upload your file. Wait until Mutalyzer reports that the file has been
uploaded before uploading another file.

The test installation is slightly faster but there is no guarantee that it is
always online.

13) When done, download the result files from Mutalyzer. If you're not sure
anymore which results file belongs to which batchfile, you can match these files
by using the match_mutalyzer_output_to_batchfile.sh script.

Usage:
./match_mutalyzer_output_to_batchfile.sh \
MUTALYZER_RESULTS_FILE [MUTALYZER_RESULTS_FILE [...]] \
BATCH_FILE [BATCH FILE [...]]

Example:
./match_mutalyzer_output_to_batchfile.sh batch-job-* *mutalyzer_batchfile.txt

This script will rename the Mutalyzer results files according to the batch input
files.

ANALYSIS OF TRIPLET PERIODICITY
--------------------------------------------------------------------------------
14) The analyze_triplet_periodicity.php script reports the number of reads
relative to the annotated start codon. A standard feature of ribosome profiling
data is the presence of a major peak at position -12 relative to the annotated
start codon (harringtonin-treated samples), and an higher amount of reads whose
5' ends map to the first nucleotide of a codon (cycloheximide-treated samples).
In the output file of the analyze_triplet_periodicity.php script, numbers in
positions %1, %2 and %3 represent the sum of reads whose 5' end mapped to the
first, second and third nucleotide of the annotated codons, respectively.

Usage:
./analyze_triplet_periodicity.php MUTALYZER_RESULTS_FILE WIGGLE_FILE \
mm10_gene_list.txt STRAND
(Valid values for STRAND: +, -, F and R)

Example:
./analyze_triplet_periodicity.php \
A1.merged_wiggle.F.filtered.wig5_mutalyzer_batchfile_results.txt \
A1.merged_wiggle.F.filtered.wig5 mm10_gene_list.txt F \
> A1.merged_wiggle.F.filtered.wig5.periodicity.txt

Produces:
- Direct output to the screen (which should then be saved to a file if needed)
with detailed statistics about the mapping of the positions to the
transcriptome by Mutalyzer, followed by the list of positions relative to the
TIS, and the total coverage for each position. Of the positions in the coding
region, only the first 6 are shown individually. All positions in the coding
region, including the positions up to 15 bases upstream of the coding region,
are grouped by their location in the annotated codons (first, second and third
base), and displayed as positions %1, %2 and %3 respectively, with the summed
coverages to assess the overall triplet periodicity.
Positions starting with an asterisk (*) are positions in the 3' UTR. The given
number is the relative distance to the stop codon: *1 is the first base after
the stop codon.

The script does the following things:
+ Read out the coverage information from the wiggle file.
+ Read the gene list to find out which genes (transcripts, actually) are on
which strand.
+ Read the Mutalyzer result file to see which locations map to which
transcripts, determining the position within the transcript.
* Filter low coverage positions (< 3).
* Mappings on non-coding transcripts are ignored.
* Mappings on transcripts on the other strand are ignored.
* Transcripts not in the gene list are ignored and reported.
* Filter positions without transcript mapping.
* Filter positions with only intronic mapping.
* Filter positions too far from translation start or stop sites (> 500bp).
* The remaining possible mappings are processed.
Please note that the "coding region" is defined as the region starting at
-15 nucleotides from the ATG to the end of the open reading frame.
Please note that this script does not see the difference between the UTR and
the intergenic sequence. Both are referred to as "UTR mappings".
- 5' UTR or 3' UTR mappings, while mappings to the coding region are also
available, are discarded.
- If there are 5' UTR and 3' UTR mappings, but none in the coding
region, we assume an intergenic situation and if one of the two
positions is clearly closer to the coding region than the
other (> 100bp closer), then the closest position is picked.
If not, the position is discarded, and reported to the user.
- After this, all mappings are counted by their coverage and
reported.

MERGING BIOLOGICAL REPLICATES PRIOR TO PEAK CALLING
(IDENTIFICATION OF TRANSLATION INITIATION SITES (TISs))
--------------------------------------------------------------------------------
15) To increase the statistical power and to lower background signal, merge the
biological replicates of the harringtonin-treated samples. Both the wiggle files
as well as the Mutalyzer result files, need to be merged.
Examples, assuming A and C are harringtonin samples:

Merging the wiggle files:

for sample in A C;
do
for strand in F R;
do
wiggelen merge ${sample}?.merged_wiggle.${strand}.filtered.wig5 \
| sed 's/^\([0-9]\+\) \([0-9]\+\)/\1\t\2/' \
> merged_${sample}.merged_wiggle.${strand}.filtered.wig5;
rm ${sample}?.merged_wiggle.${strand}.filtered.wig5.idx;
done
done

Merging the Mutalyzer files:

for sample in A C;
do
for strand in F R;
do
head -n 1 ${sample}1.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt \
> merged_${sample}.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt
grep -h "^chr" ${sample}?.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt \
| sort -g | uniq \
>> merged_${sample}.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt
done
done

The order of the variants in the Mutalyzer results file is sorted differently,
but this has no effect on the following analysis results.

PEAK CALLING: IDENTIFICATION OF PRIMARY ORFs, ALTERNATIVE ORFs and UPSTREAM ORFs
--------------------------------------------------------------------------------
16) The find_ORFs.php script is based on a local dynamic peak calling algorithm,
which tries to identify peaks (TISs) relative to the coverage of the surrounding
region.
NOTE: Run this script on the merged replicates of harringtonin samples, only.

Usage:
./find_ORFs.php MUTALYZER_RESULTS WIGGLE_FILE mm10_gene_list.txt STRAND
(Valid values for STRAND: +, -, F and R)

Example:
./find_ORFs.php \
merged_A.merged_wiggle.F.filtered.wig5_mutalyzer_batchfile_results.txt \
merged_A.merged_wiggle.F.filtered.wig5 mm10_gene_list.txt F

Produces:
- *.ORF_analysis_results_stats.txt
Statistics on the ORF finding run; information on the loaded gene list file
and wiggle file, mentions the number of positions left out on each filtering
step, and mentions the number of genes left with found translation initiation
sites.
- *.ORF_analysis_results.txt
The results file shows, per gene, the number of positions found, the number
actually analyzed as candidate peaks, and the number of TISs found. These are
then reported with the genomic position, the coverage, and per transcript, the
position relative to the annotated TIS on the transcript.
- *.ORF_analysis_results_after_cutoff.txt
This file has the same format of the results file, but shows only TISs found
in the annotated coding region, at least 5000 bases from the annotated TIS.
(the value of 5000 can be configured in the script)
For more information on this cutoff, please see de Klerk E., Fokkema I.F.A.C.
et al (2015).

The script does the following things:
+ Read out the coverage information from the wiggle file
+ Read the gene list to find out which genes (transcripts, actually) are on
which strand
+ Read the Mutalyzer result file to see which locations map to which
transcripts, determining the position within the transcript
* Filter low coverage positions (< 3)
* Mappings on non-coding transcripts are ignored
* Mappings on transcripts on the other strand are ignored
* Transcripts not in the gene list are ignored and reported (so they could be
fixed == added to the gene info file)
* Filter positions without transcript mapping
* Filter positions with only intronic mapping
* Filter positions too far from translation start or stop sites (> 500bp)
* The remaining possible mappings are processed.
Please note that the "coding region" is defined as the region starting at
-15 nucleotides from the ATG to the end of the open reading frame.
Please note that this script does not see the difference between the UTR and
the intergenic sequence. Both are referred to as "UTR mappings".
- 3' UTR mappings, while mappings to the coding region or in the 5' UTR are
also available, are discarded.
- After this, all positions are stored, per gene, for the next step.
+ Peaks are searched for by walking through the positions, starting upstream.
Each position, with at least a coverage of 20, is analyzed.
(Please note, that 20 has been chosen for analysis of merged biological
replicates. Otherwise, 10 was used.)
* Its coverage should be higher than that of the positions located 3, 6, 9, 12
and 15 nucleotides upstream of it.
* Its coverage should be at least as high as that of the positions located 1
and 2 nucleotides downstream of it.
* It should show a triplet periodicity and a clear "harringtonin pattern":
- The first nucleotide of the analyzed codon should have a coverage of at
least 60% of the total coverage of that codon (configurable setting).
- The following 5 codons should also not have a higher maximum coverage than
this position's coverage.
- If any of the following 5 codons have a maximum coverage higher than 10%
of the coverage of the position we're analyzing, that codon must not show
a conflicting triplet periodicity pattern (see first two points above).
* If all these rules apply, the position is stored as a possible TIS.
+ Per gene, from all its found possible TISs, we will take the one with the
highest coverage as a reference, and discard any other candidate TISs that do
not have at least a coverage (on that position) of 10% of the reference
(highest candidate).
+ All remaining candidate TISs will be reported.
+ Genes that have no candidate peaks left, are ignored.
+ For all genes remaining, the positions of the candidate TISs are reported
relative to all known transcripts of the gene in question. The original number
of positions with high coverage (>10) is mentioned along with the number of
remaining candidate TISs.
Results are displayed as: genomic position, coverage, position on transcript.
The last column may be repeated, depending on the number of known transcripts.
+ Additionally, a separate file reports all candidate peaks found at a position
higher than 5Kb from a known TIS, so they can easily be checked more
thoroughly if they are false positives.

MERGING RESULT FILES
--------------------------------------------------------------------------------
17) To be able to statistically compare the possible switches in TIS usage, the
results of the ORF analyses need to be combined into one file, per strand (F and
R). The merge_ORF_files.php script merges the files into one summary file, but
does not take care of the strands, therefore make sure you don't mix forward and
reverse files with each other.

Usage:
./merge_ORF_files.php ORF_FILE [ORF_FILE [ORF_FILE [...]]]

Example:
./merge_ORF_files.php *.F.*.ORF_analysis_results.txt

Produces:
- One file which name depends on the given input files, ending in the suffix
.merged_ORF_analyses.txt. The format is described below.

The script analyzes the file names to isolate the sample IDs. It assumes that
the sample IDs are the only differences between the input file names. The sample
IDs are also used in the headers of the output.

While grouping, it ignores the positions of the peaks on the transcripts. It
reports the chromosomal position, the coverage per sample, and the gene name.
Every peak that has been called in at least one sample, is displayed.

NOTE: If the original wiggle files of the individual biological replicates are
present in the same directory, these are all read out while merging the ORF
analysis result files, and the resulting file will report the individual
coverages of all biological replicates, per TIS peak.
If the wiggle files are not found, the coverages are taken from the ORF analysis
result files, which means when a coverage of 0 is reported, it simply means that
the position was not recognized as a TIS in that sample. The actual measured
coverage in that sample may not be 0.

Example output:
# Chromosome Position A1 A2 A3 C1 C2 C3 Gene
chr1 4857920 32 14 27 6 3 10 Tcea1
chr1 4857929 24 6 24 2 4 6 Tcea1

ANALYSIS OF SWITCHES IN TIS USAGE
--------------------------------------------------------------------------------
The output file of the merge_ORF_files.php script is used to test switches in
TIS usage. For this, the lme4.0 R package is used.

The input file for the following R code is a tab delimited file. The first line
containins a list of all sample names (in the below example, 6), separated by a
tab. All following lines contain as first column the gene symbols and then, per
sample, the actual individual coverages. The file is imported into R as a
dataframe, containing 6 columns and n rows (n=number of called peaks).

Example input for R:
A1 A2 A3 C1 C2 C3
Tcea1 32 14 27 6 3 10
Tcea1 24 6 24 2 4 6

Below is the R code used for the analysis.

library(lme4.0)
load("TIS_analysis.Rdata")
data_samples<-data_small[,2:7]
Myotubes <- grepl("C", names(data_samples))
coeff_summary_fit_col3_col4 <- list()
for(i in 1: length(list_data.multinom))
{
print(i)
mydata <- list_data.multinom [[i]]
loc <- mydata[,1]
N <- nrow( mydata)
M <- 6
weight <- colSums(mydata[,2:7])
if ((!weight[1] && !weight[2] && !weight[3]) || (!weight[4] && !weight[5] && !weight[6])) {
print("No coverage available in at least one condition")
next
}
dat <- as.vector(as.matrix(mydata[,2:7]))
loc_all <- rep(loc, M)
Myotubes_all <- rep(Myotubes, each=N)
weight_all <- rep(weight, each=N)
subj_all <- factor(rep(colnames(mydata[,2:7]), each=N))
fit<- lmer( dat/weight_all ~ 0 + Myotubes_all*loc_all - Myotubes_all + (loc_all|subj_all), family=binomial(), weight=weight_all)
fit0 <- lmer( dat/weight_all ~ 0+loc_all + (loc_all|subj_all), family=binomial(), weight=weight_all)
coeff_summary_fit_col3_col4[[i]] <- coef(summary(fit))[(1+nrow(coef(summary(fit)))/2):nrow(coef(summary(fit))),3:4]
}
sink("coeff_summary_fit_col3_col4.csv")
coeff_summary_fit_col3_col4
sink()

TIS LOCATION AND MOTIF ANALYSIS
--------------------------------------------------------------------------------
18) The generate_stats_peaks_per_location.php script is used to (i) categorize
TISs based on their location relative to different gene regions, (ii) report
codon motifs, (iii) report the sequence of potential uORFs (sequence between
each TIS located in the 5'UTR and the annotated TIS. The script uses the ORF
analysis results files of the merged replicates. It takes all ORF analysis
results files, both the results after the cutoff as the ones before, and sums up
the number of peaks found per location: 5' UTR, annotated TIS, coding, 3' UTR,
or multiple.
"Multiple" means that the TIS location was mapping to multiple transcripts, and
in at least two of these, the location was different. An exception here is when
one of the locations where the TIS maps to, is "annotated_TIS". Then, it is
assumed that the other transcripts with other positions are not the translated
transcripts.

Note that to determine the locations of the reads, the positions that are given
by the find_ORFs.php script have to be increased by 12 bases, since these are
the 5' ends of the reads. This means that positions -12 through -10 are counted
as annotated TIS. Lower numbers are counted as 5'UTR, while higher numbers
(> -10) are counted as coding. Positions starting with an asterisk (*) are
counted as 3'UTR.

This also means that read start positions at the end of the coding region, less
than 12 bp away from the 3'UTR, are counted as coding while in fact in reality
they represent a read that lies in the 3'UTR. This can not be detected however,
because we don't know the length of the coding region of the transcripts.

To be able to report the TIS codon motifs and the sequence between each 5'UTR
TIS and the annotated TIS, this script downloads the RefSeq sequences of the
transcripts that the read was aligned to, automatically from the NCBI website
using the URL format:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_007386.2.gb&rettype=gb

To prevent repeated downloads on successive runs of the script, it requires an
"NM cache" directory where all downloaded sequences are stored and can be re-
used by successive runs. The name of the directory is currently configured in
the $_SETT settings array within the script. If the directory is not found or
not writable, the script will complain and refuse execution. When that happens,
check if the directory can be made writable. Otherwise create an empty directory
and set its name in the settings in the script source code.

Usage:
./generate_stats_peaks_per_location.php ORF_FILE [ORF_FILE [ORF_FILE [...]]]

Example:
./generate_stats_peaks_per_location.php *ORF_analysis_results*

Produces:
- *.ORF_analysis_results.stats_peaks_per_location.txt (one per sample)
Reports the total number of TISs found as well as the total coverage of found
TISs per gene region (5'UTR, annotated TIS, coding, 3'UTR, multiple). Both are
reported since a high number of identified TISs may not correlate to also
having high coverage in this region.
The statistics are reported separately for all peaks before the cutoff and all
peaks including the ones after the cutoff.
- *.ORF_analysis_results.peaks_classification.txt (one per sample)
Reports all found TISs before the set cutoff, sorted on category, strand and
genomic position. Other fields reported are the "real" genomic position of the
TIS, calculated by adding (or substracting, for the negative strand) an offset
of 12 nucleotides to the 5' read genomic positions, the gene, the summed
coverage of the replicates in that position, the transcript RefSeq ID (only
one reported), the position on that transcript including the position with the
offset of 12 applied, optional status messages, and the sequence of the codon
at the reported TIS.
For TISs reported in the category "multiple", the information in the last four
columns is missing.
In case the RefSeq file has no CDS annotated, instead of the motif sequence
the error "could_not_parse_CDS" is given.
In case the RefSeq file has no 5'UTR annotated (CDS starts at position 1), or
in case the RefSeq file has not enough 5'UTR annotated to fetch the full motif
sequence (CDS starts at a position smaller than the distance of the TIS to the
annotated TIS), the status is set to "no_5UTR" or "unannotated_5UTR",
respectively, and a "slice" of the genomic sequence is downloaded. This slice
starts from the corrected genomic position, and 75000 bases downstream. From
this file, this script attempts to find the correct mRNA and CDS tags, takes
the sequence until the annotated CDS, removing any annotated introns before
the CDS. When the mRNA tag can not be found, the status also mentions the code
"no_mRNA_definition". The motif is still reported, but the output file
described below will not contain the sequence until the annotated TIS.
- *.ORF_analysis_results.peaks_classification_5UTR.txt (one per sample)
Reports all TISs found in the 5'UTR, sorted on strand and genomic position.
The fields reported are very similar to the fields mentioned in the previous
file, with the following exceptions: the motif field is missing, and the
following two fields are added: the DNA sequence starting at the TIS until the
annotated TIS, and the translation of this sequence to protein sequence. If a
found TIS can be mapped to different transcripts at different distances from
the annotated TIS, all these transcripts are reported on a separate line.
The protein sequence can contain a *, which indicates a stop-codon. A ?
indicates an incomplete or unrecognizable codon. This should happen only if
the sequence is in a different frame compared with the annotated TIS; in this
case the DNA sequence given will end with an incomplete codon and the trans-
lated sequence will end with a ?.
Any messages that might be reported in the status column, are explained above
in the description of the previous file.

The script is checking the names of input files to automatically ignore its own
result files, the statistics files generated by the find_ORFs.php script and
merged ORF files generated by the merge_ORF_files.php script. Input file names
need to contain the strand information ('.F.' or '.R.') and must end with either
'ORF_analysis_results.txt' or 'ORF_analysis_results_after_cutoff.txt'. Files are
grouped by their file name's prefix (until the strand information). As such, per
sample the script can find four files; two forward strand and two reverse
strand; each a file with peaks before cutoff, and one with peaks after the
cutoff.
If files are passed for which the file name is not recognized, they are reported
and the script will stop processing.

19) The result files of the generate_stats_peaks_per_location.php script contain
the motif sequences of every found TIS. The code to analyze the motif sequences
used is below. If needed, change the first line to use the sample names you wish
to analyze.

for sample in A C;
do
echo "Analyzing sample ${sample}:" > merged_${sample}.motif_analysis_results.txt;
for category in 5UTR unannotated_5UTR annotated_TIS coding 3UTR multiple;
do
if [[ $category == '5UTR' ]];
then
TOTAL=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
TOTAL_COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
elif [[ $category == 'unannotated_5UTR' ]];
then
TOTAL=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
TOTAL_COVERAGE=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
else
TOTAL=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
TOTAL_COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
fi
echo -e "${category}\tTotal TISs:\t${TOTAL}\tTotal coverage:\t${TOTAL_COVERAGE}";
echo -e "Motif\tTISs\tCoverage\tPercTISs\tPercCoverage";
if [[ $category == '5UTR' ]];
then
MOTIFS=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
elif [[ $category == 'unannotated_5UTR' ]];
then
MOTIFS=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
else
MOTIFS=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
fi
for motif in $MOTIFS;
do
if [[ $category == '5UTR' ]];
then
TISs=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep "^${motif}$" | wc -l`;
COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
elif [[ $category == 'unannotated_5UTR' ]];
then
TISs=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep "^${motif}$" | wc -l`;
COVERAGE=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
else
TISs=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep "^${motif}$" | wc -l`;
COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
fi
PERC_TISs=`echo "scale=2; $TISs * 100 / $TOTAL" | bc`;
PERC_COVERAGE=`echo "scale=2; $COVERAGE * 100 / $TOTAL_COVERAGE" | bc`;
echo -e "${motif}\t${TISs}\t${COVERAGE}\t${PERC_TISs}%\t${PERC_COVERAGE}%";
done
done >> merged_${sample}.motif_analysis_results.txt;
done

Produces:
- Per sample, one file containing per TIS category (5UTR, unannotated_5UTR,
annotated_TIS, coding, 3UTR, multiple): the analyzed motif, the number of TISs
with this motif, the total coverage of TISs with this motif, the percentage of
TISs with this motif, and the percentage of reads on TISs with this motif.