{"id":20153557,"url":"https://github.com/seqan/anise_basil","last_synced_at":"2026-03-07T20:31:55.012Z","repository":{"id":27675077,"uuid":"31161223","full_name":"seqan/anise_basil","owner":"seqan","description":"Methods for the detection and assembly of novel sequence in high-throughput sequencing data","archived":false,"fork":false,"pushed_at":"2017-08-17T21:15:06.000Z","size":1060,"stargazers_count":6,"open_issues_count":3,"forks_count":0,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-02-11T02:26:36.749Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seqan.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-02-22T11:41:02.000Z","updated_at":"2024-12-06T06:03:25.000Z","dependencies_parsed_at":"2022-09-03T03:42:22.718Z","dependency_job_id":null,"html_url":"https://github.com/seqan/anise_basil","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/seqan/anise_basil","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fanise_basil","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fanise_basil/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fanise_basil/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fanise_basil/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seqan","download_url":"https://codeload.github.com/seqan/anise_basil/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fanise_basil/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30229743,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T19:01:10.287Z","status":"ssl_error","status_checked_at":"2026-03-07T18:59:58.103Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T23:19:43.197Z","updated_at":"2026-03-07T20:31:54.961Z","avatar_url":"https://github.com/seqan.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/seqan/anise_basil.svg?branch=master)](https://travis-ci.org/seqan/anise_basil)\n\n# Anise \u0026 Basil\n\nBASIL is a method to detect breakpoints for structural variants (including insertion breakpoints) from aligned paired HTS reads in BAM format.\nANISE is a method for the assembly of large insertions from paired reads in BAM format and a list candidate insert breakpoints as generated by BASIL.\n\n## Quickstart\n\n### Obtaining and Compiling\n\nThe following instructions explain how to obtain, compile, and use ANISE on a Linux and Mac Os X system.\nYou will need to install a C++ compiler with sufficient C++11 format (a fairly rent Linux distribution or copy of the Xcode developer tools should work) and [CMake].\nAlso, you have to install [Boost] (\u003e= 1.51.0).\n\nFor obtaining the software, use the following instructions for getting ANISE/BASIL and SeqAn.\n\n```\n~ # git clone https://github.com/seqan/anise_basil.git\n~ # cd anise_basil\nanise_basil # git checkout master\nanise_basil # git submodule init\nanise_basil # git submodule update --recursive\n```\n\nThen, compile the program.\n\n```\nanise_basil # cd build\nbuild # cmake ..\nbuild # make -j 4 anise basil\n```\n\nYou can control the number of cores to use for compiling with the `-j` parameter to `make`, thus the code above gives an example for the compilation with 4 cores.\nYou can now have a look at the command line help for both programs:\n\n```\n# ./bin/basil -h\n# ./bin/anise -h\n```\n\n### A First Example\n\nThe directory examples contains a reference file `ref.fa` and two reads files `left.fq.gz` and `right.fq.gz`.\nThe reference has a length of 10kb and the reads are sequenced from a donor that has a 2kb insertion into the reference.\n\nThe first step is to map the reads to the reference, e.g. using BWA, and to get a sorted and indexed BAM file, e.g. using Samtools.\n\n```\n# bwa index ref.fa\n# bwa aln -f left.fq.gz.sai ref.fa left.fq.gz\n# bwa aln -f right.fq.gz.sai ref.fa right.fq.gz\n# bwa sampe ref.fa left.fq.gz.sai right.fq.gz.sai left.fq.gz right.fq.gz \\\n    | samtools view -Sb - | samtools sort - simulated\n# samtools index simulated.bam\n```\n\nNext, use BASIL to analyze the BAM file for tentative insertion sites.\n\n```\n# basil -ir ref.fa -im simulated.bam -ov basil.vcf\n...\n# cat basil.vcf\n##fileformat=VCFv4.1\n##source=BASIL\n##reference=ref.fa\n##INFO=\u003cID=IMPRECISE,Number=0,Type=Flag,Description=\"Imprecise structural va...\n##INFO=\u003cID=SVTYPE,Number=1,Type=String,Description=\"Type of structural varia...\n##INFO=\u003cID=OEA_ONLY,Number=0,Type=Flag,Description=\"Breakpoint support by OE...\n##ALT=\u003cID=INS,Description=\"Insertion of novel sequence\"\u003e\n##FORMAT=\u003cID=GSCORE,Number=1,Type=String,Description=\"Sum of Geometric score...\n##FORMAT=\u003cID=CLEFT,Number=1,Type=String,Description=\"Clipped alignments supp...\n##FORMAT=\u003cID=CRIGHT,Number=1,Type=String,Description=\"Clipped alignments sup...\n##FORMAT=\u003cID=OEALEFT,Number=1,Type=String,Description=\"One-end anchored alig...\n##FORMAT=\u003cID=OEARIGHT,Number=1,Type=String,Description=\"One-end anchored ali...\n##contig=\u003cID=1,length=10000\u003e\n#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  indi...\n1       5001    site_0  T       \u003cINS\u003e   .       PASS    IMPRECISE;SVTYPE=INS...\n```\n\nA shortdescription of the fields in the VCF file is given below in the section File Formats.\n\nUsually, the next step is to filter the VCF to sites with a minimal support,\ne.g. for 30x coverage:\n\n```\n# filter_basil.py -i basil.vcf -o basil.filtered.vcf --min-oea-each-side 10\n```\n\nThis file now serves as the input to ANISE:\n\n```\n# anise -ir ref.fa -im simulated.bam -iv basil.filtered.vcf -of anise.fa\n...\n```\n\nThe result file now contains one assembled insert with some annotation in the FASTA meta line.\n\n```\n# cat anise.fa\n\u003esite_0_contig_0 REF=1 POS=5001 STEPS=6  ANCHORED_LEFT=yes ANCHORED_RIGHT=yes SPANNING=yes STOPPED=no_more_reads\nGGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTCCAGGGCATGG\nTCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTACCTGAAGAGCCCCCGTAAGCC\nGTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAAGTGTCTTGGCAGACCGTATCTTCACAGGGC\nCGAAGCACTTTTGGCAGGCTTATAAACGCCCAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGA\nCGCGGGGTGTGTCGGCCCTATCCCTTGCGCTTACAGACTTTATTTCTTCGTGAGGGAGTTGACCCATGCA\n```\n\nAn easy to use tool for verifying this example is [EMBOSS Needle] which you can use to compare `anise.fa` with the actual insertion in `example/ins.fa`.\n\nNote that BASIL will in general detect all kinds of breakpoints, e.g. for inversions on real-world data.\nANISE will try to assemble around these breakpoints but the assembly processes from the left and right will (generally) not meet.\nA good heuristic is to only keep those ANISE contigs where the assembly process met.\n\nThis might exclude some sites where the assembly process broke, e.g. because of repeats, but it gives you a good start.\nYou can always look at the remaining ANISE contigs later.\n\nYou can use the script extract_spanning.awk to obtain the contigs marked as \"spanning\" by ANISE.\n\n```\n# extract_spanning anise.fa \u003e anise.filtered.fa\n```\n\nNote that ANISE also assembles sequence left and right of the insert.\nThis can be used to anchor the inserted sequence to the reference using BLAT.\nThe executables blat and pslPretty are available from the [UCSC download site].\n\nLet us now use BLAT to align the contig back to our reference.\n\n```\n# blat ref.fa anise.filtered.fa matches.psl\n```\n\nThe file matches.psl only contains one line.\nWe can now visualize this alignment using pslPretty:\n\n```\n\u003esite_0_contig_0:0+2592 of 2592 1:4716+5308 of 10000\nGGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTC\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\nGGGCTTCGCCTAGGGTCTCGGGAGAAATCTAGGGACCCCAATCTATTAGACGAACACGTC\n\nCAGGGCATGGTCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTAC\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\nCAGGGCATGGTCAGGTATACACCTTCCGACTAGACGTGTTCGAAGATTCGGGAAAATTAC\n\nCTGAAGAGCCCCCGTAAGCCGTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAA\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\nCTGAAGAGCCCCCGTAAGCCGTAGTAGAAGAGGACACTTCATTTAAACAATACCGAAAAA\n\nGTGTCTTGGCAGACCGTATCTTCACAGGGCCGAAGCACTTTTGGCAGGCTTATAAACGCC\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\nGTGTCTTGGCAGACCGTATCTTCACAGGGCCGAAGCACTTTTGGCAGGCTTATAAACGCC\n\nCAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGACGCGGGGTGT...TCAGGTT\n||||||||||||||||||||||||||||||||||||||||||||               |\nCAGAATGAAGCACTCGCCATAGGTGGAAACCTTTAAGCGACGCG-----2000------T\n\nTGGGTCCGCGCAGCGCCAACGATTTCAACCGGGAGACGTTCGTTCATGATGAGAAGACGG\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\nTGGGTCCGCGCAGCGCCAACGATTTCAACCGGGAGACGTTCGTTCATGATGAGAAGACGG\n\n...\n```\n\nWhen using ANISE on real data, it might be useful to extract the best match for each query from the BLAT file by its BLAT score.\nYou can do this using the script `best_blat.py`.\n`best_blat.py` computes the BLAT identity and score as computed on the BLAT web front-end; these values are not written out by command line blat.\n\nNote that the script prepends the BLAT identity, query coverage, target coverage, and BLAT score to the first four columns of the output TSV file.\n\n```\n# best_blat.py -b matches.psl | column -t\n#identity  query_coverage  target_coverage  blat_score  matches  mismatches...\n95.9       22.8            5.9              591         592      0         ...\n```\n## File Formats\n\n### BASIL VCF fields\n\nA typical line in BASIL might look as follows.\n\n```\n1 5001 site_0 T \u003cINS\u003e   . PASS IMPRECISE;SVTYPE=INS GSCORE:CLEFT:CRIGHT:OEALEFT:OEARIGHT    46.4256:10:12:35:32\n```\n\nThe first seven columns are as usually in VCF files (ref name, 1-based position, reference base, abbreviation for long insertion, no assigned quality, passing all filters, imprecise insertion SV).\n\nThe eigth column contains the names of the score values given in the ninth column:\n\n* `GSCORE` Geometric mean of the sum of \"1 + $score\" for all of the following scores.\n* `CLEFT` Number of clipping signatures supporting the site from the left side.\n* `CRIGHT` Number of clipping signatures supporting the site from the right side.\n* `OEALEFT` Number of OEA alignments supporting the site from the left.\n* `OEARIGHT` Number of OEA alignmetns supproting the site from the right.\n\nGenerally, one should filter for a minimum support of OEA records on each side, e.g. a value of 10 makes sense for a 30x coverage and showed good results on simulated data.\n\nFor a ranking, GSCORE is a suitable measure but we did not develop any statistical model for BASIL matches and it is a mean of pseudocounts only.\nIt carries no statistically precise meaning.\n\n### ANISE FASTA meta tags.\n\nA typical FASTA line as written by ANISE looks as follows (we deliberately introduced a line break here for text wrapping).\n\n```\n\u003esite_0_contig_0 REF=1 POS=5001 STEPS=6  ANCHORED_LEFT=yes ANCHORED_RIGHT=yes SPANNING=yes STOPPED=no_more_reads\n```\n\nThis means that the contig with the name \"site_0_contig_0\" was generated for a site as called by BASIL on the reference named \"1\" at 1-based position 5,000 after 6 ANISE assembly steps.\nThe site is expected to be anchored on the left and right side, meaning that there should be at least two reads in the site that aligned on the forward strand in the input mapping and two reads that aligned on the reverse strand in the input mapping.\n\nAlso, ANISE could find a path from the left to the right end only using overlap and paired link information, i.e. the assembly did not stop without meeting.\nFurthermore, the assembly was stopped since there were not sufficient reads mapped on this contig to continue assembly.\nIf the value of STOPPED was \"too_many_reads\", too many (thousands!) of reads mapped on the site which is an indicator for the site containing low-complexity regions or highly repetetive sequence.\nANISE gives up in this case.\n\n## Frequently Asked Questions\n\n* Q: ANISE does not start at step 1 and stops without sensible results.\n* A: ANISE tries to continue where it left off earlier.  For this, it creates a directory with temporary file.  If the execution stops before the first assembly step, the directory might contain corrupted state. \n* Solution: Simply remove the temporary directory.  If the output file name is `output.fa`, the temporary directory will be called `output.fa.tmp`.\n\n## References\n\n* Holtgrewe, M., Kuchenbecker, L., \u0026 Reinert, K. (2015).\n  [Methods for the Detection and Assembly of Novel Sequence in High-Throughput Sequencing Data].\n  *Bioinformatics*, btv051.\n\n## Contact\n\nFor questions, comments, or suggestions, please file a [GitHub issue] or an email to [Manuel Holtgrewe]\n\n[Boost]: http://www.boost.org\n[CMake]: http://www.cmake.org\n[EMBOSS Needle]: https://www.ebi.ac.uk/Tools/psa/emboss_needle/nucleotide.html\n[GitHub issue]: https://github.com/seqan/anise_basil/issues\n[Lemon]: http://lemon.cs.elte.hu/trac/lemon\n[Manuel Holtgrewe]: mailto:manuel.holtgrewe@fu-berlin.de\n[Methods for the Detection and Assembly of Novel Sequence in High-Throughput Sequencing Data]: http://bioinformatics.oxfordjournals.org/content/early/2015/02/01/bioinformatics.btv051.short\n[UCSC download site]: http://hgdownload.soe.ucsc.edu/downloads.html#source_downloads\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseqan%2Fanise_basil","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseqan%2Fanise_basil","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseqan%2Fanise_basil/lists"}