{"id":22632346,"url":"https://github.com/lmdu/pyfastx","last_synced_at":"2025-05-15T04:07:37.512Z","repository":{"id":39176233,"uuid":"176523301","full_name":"lmdu/pyfastx","owner":"lmdu","description":"a python package for fast random access to sequences from plain and gzipped FASTA/Q files","archived":false,"fork":false,"pushed_at":"2024-12-26T21:46:26.000Z","size":9868,"stargazers_count":278,"open_issues_count":27,"forks_count":23,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-15T01:58:38.696Z","etag":null,"topics":["assembly","bioinformatics","biology","dna","fasta","fastq","genome","python","sequence"],"latest_commit_sha":null,"homepage":"https://pyfastx.readthedocs.io","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lmdu.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-19T13:55:46.000Z","updated_at":"2025-04-12T12:22:53.000Z","dependencies_parsed_at":"2023-02-17T16:30:29.059Z","dependency_job_id":"c6483e08-d924-4af0-bfa8-648c57a83f7c","html_url":"https://github.com/lmdu/pyfastx","commit_stats":{"total_commits":595,"total_committers":8,"mean_commits":74.375,"dds":0.04537815126050415,"last_synced_commit":"ce4ffc8f2207d20f4c49ed5f6fa97c067a2a269b"},"previous_names":[],"tags_count":83,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lmdu%2Fpyfastx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lmdu%2Fpyfastx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lmdu%2Fpyfastx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lmdu%2Fpyfastx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lmdu","download_url":"https://codeload.github.com/lmdu/pyfastx/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254270646,"owners_count":22042859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assembly","bioinformatics","biology","dna","fasta","fastq","genome","python","sequence"],"created_at":"2024-12-09T02:17:10.055Z","updated_at":"2025-05-15T04:07:32.482Z","avatar_url":"https://github.com/lmdu.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"pyfastx\n#######\n\n.. image:: https://github.com/lmdu/pyfastx/actions/workflows/wheel.yml/badge.svg\n   :target: https://github.com/lmdu/pyfastx/actions/workflows/wheel.yml\n   :alt: Action\n\n.. image:: https://readthedocs.org/projects/pyfastx/badge/?version=latest\n   :target: https://pyfastx.readthedocs.io/en/latest/?badge=latest\n   :alt: Readthedocs\n\n.. image:: https://codecov.io/gh/lmdu/pyfastx/branch/master/graph/badge.svg\n   :target: https://codecov.io/gh/lmdu/pyfastx\n   :alt: Codecov\n\n.. image:: https://img.shields.io/pypi/v/pyfastx.svg\n   :target: https://pypi.org/project/pyfastx\n   :alt: PyPI\n\n.. image:: https://img.shields.io/pypi/wheel/pyfastx.svg\n   :target: https://pypi.org/project/pyfastx\n   :alt: Wheel\n\n.. image:: https://api.codacy.com/project/badge/Grade/80790fa30f444d9d9ece43689d512dae\n   :target: https://www.codacy.com/manual/lmdu/pyfastx?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=lmdu/pyfastx\u0026amp;utm_campaign=Badge_Grade\n   :alt: Codacy\n\n.. image:: https://img.shields.io/pypi/implementation/pyfastx\n   :target: https://pypi.org/project/pyfastx\n   :alt: Language\n\n.. image:: https://img.shields.io/pypi/pyversions/pyfastx.svg\n   :target: https://pypi.org/project/pyfastx\n   :alt: Pyver\n\n.. image:: https://img.shields.io/pypi/dm/pyfastx\n   :target: https://pypi.org/project/pyfastx\n   :alt: Downloads\n\n.. image:: https://img.shields.io/pypi/l/pyfastx\n   :target: https://pypi.org/project/pyfastx\n   :alt: License\n\n.. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat\n   :target: http://bioconda.github.io/recipes/pyfastx/README.html\n   :alt: Bioconda\n\n**Citation:** \n`Lianming Du, Qin Liu, Zhenxin Fan, Jie Tang, Xiuyue Zhang, Megan Price, Bisong Yue, Kelei Zhao. Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files. Briefings in Bioinformatics, 2021, 22(4):bbaa368 \u003chttps://doi.org/10.1093/bib/bbaa368\u003e`_.\n\n.. contents:: Table of Contents\n\nIntroduction\n============\n\nThe ``pyfastx`` is a lightweight Python C extension that enables users to randomly access to sequences from plain and **gzipped** FASTA/Q files. This module aims to provide simple APIs for users to extract seqeunce from FASTA and reads from FASTQ by identifier and index number. The ``pyfastx`` will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the ``pyfastx`` can parse standard (*sequence is spread into multiple lines with same length*) and nonstandard (*sequence is spread into one or more lines with different length*) FASTA format. This module used `kseq.h \u003chttps://github.com/attractivechaos/klib/blob/master/kseq.h\u003e`_ written by `@attractivechaos \u003chttps://github.com/attractivechaos\u003e`_ in `klib \u003chttps://github.com/attractivechaos/klib\u003e`_ project to parse plain FASTA/Q file and zran.c written by `@pauldmccarthy \u003chttps://github.com/pauldmccarthy\u003e`_ in project `indexed_gzip \u003chttps://github.com/pauldmccarthy/indexed_gzip\u003e`_ to index gzipped file for random access.\n\nThis project was heavily inspired by `@mdshw5 \u003chttps://github.com/mdshw5\u003e`_'s project `pyfaidx \u003chttps://github.com/mdshw5/pyfaidx\u003e`_ and `@brentp \u003chttps://github.com/brentp\u003e`_'s project `pyfasta \u003chttps://github.com/brentp/pyfasta\u003e`_.\n\nFeatures\n========\n\n- Single file for the Python extension\n- Lightweight, memory efficient for parsing FASTA/Q file\n- Fast random access to sequences from ``gzipped`` FASTA/Q file\n- Read sequences from FASTA file line by line\n- Calculate N50 and L50 of sequences in FASTA file\n- Calculate GC content and nucleotides composition\n- Extract reverse, complement and antisense sequences\n- Excellent compatibility, support for parsing nonstandard FASTA file\n- Support for FASTQ quality score conversion\n- Provide command line interface for splitting FASTA/Q file\n\nInstallation\n============\n\nCurrently, ``pyfastx`` supports Python 3.6, 3.7, 3.8, 3.9, 3.10, 3.11. Make sure you have installed both `pip \u003chttps://pip.pypa.io/en/stable/installing/\u003e`_ and Python before starting.\n\nYou can install ``pyfastx`` via the Python Package Index (PyPI)\n\n::\n\n    pip install pyfastx\n\nUpdate ``pyfastx`` module\n\n::\n\n\tpip install -U pyfastx\n\nFASTX\n=====\n\nNew in ``pyfastx`` 0.8.0.\n\nPyfastx provide a simple and fast python binding for kseq.h to iterate over sequences or reads in fasta/q file. The FASTX object will automatically detect the input sequence format (fasta or fastq) to return different tuple.\n\nFASTA sequences iteration\n-------------------------\n\nWhen iterating over sequences on FASTX object, a tuple ``(name, seq)`` will be returned.\n\n.. code:: python\n\n    \u003e\u003e\u003e fa = pyfastx.Fastx('tests/data/test.fa.gz')\n    \u003e\u003e\u003e for name,seq in fa:\n    \u003e\u003e\u003e     print(name)\n    \u003e\u003e\u003e     print(seq)\n\n    \u003e\u003e\u003e #always output uppercase sequence\n    \u003e\u003e\u003e for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):\n    \u003e\u003e\u003e     print(item)\n\n    \u003e\u003e\u003e #Manually specify sequence format\n    \u003e\u003e\u003e for item in pyfastx.Fastx('tests/data/test.fa', format=\"fasta\"):\n    \u003e\u003e\u003e     print(item)\n\nIf you want the sequence comment, you can set comment to True, New in ``pyfastx`` 0.9.0.\n\n.. code:: python\n\n    \u003e\u003e\u003e fa = pyfastx.Fastx('tests/data/test.fa.gz', comment=True)\n    \u003e\u003e\u003e for name,seq,comment in fa:\n    \u003e\u003e\u003e     print(name)\n    \u003e\u003e\u003e     print(seq)\n    \u003e\u003e\u003e     print(comment)\n\nThe comment is the content of header line after the first white space or tab character.\n\nFASTQ reads iteration\n---------------------\n\nWhen iterating over reads on FASTX object, a tuple ``(name, seq, qual)`` will be returned.\n\n.. code:: python\n\n    \u003e\u003e\u003e fq = pyfastx.Fastx('tests/data/test.fq.gz')\n    \u003e\u003e\u003e for name,seq,qual in fq:\n    \u003e\u003e\u003e     print(name)\n    \u003e\u003e\u003e     print(seq)\n    \u003e\u003e\u003e     print(qual)\n\nIf you want the read comment, you can set comment to True, New in ``pyfastx`` 0.9.0.\n\n.. code:: python\n\n    \u003e\u003e\u003e fq = pyfastx.Fastx('tests/data/test.fq.gz', comment=True)\n    \u003e\u003e\u003e for name,seq,qual,comment in fq:\n    \u003e\u003e\u003e     print(name)\n    \u003e\u003e\u003e     print(seq)\n    \u003e\u003e\u003e     print(qual)\n    \u003e\u003e\u003e     print(comment)\n\nThe comment is the content of header line after the first white space or tab character.\n\nFASTA\n=====\n\nRead FASTA file\n---------------\n\nRead plain or gzipped FASTA file and build index, support for random access to FASTA.\n\n.. code:: python\n\n    \u003e\u003e\u003e import pyfastx\n    \u003e\u003e\u003e fa = pyfastx.Fasta('test/data/test.fa.gz')\n    \u003e\u003e\u003e fa\n    \u003cFasta\u003e test/data/test.fa.gz contains 211 seqs\n\n.. note::\n    Building index may take some times. The time required to build index depends on the size of FASTA file. If index built, you can randomly access to any sequences in FASTA file. The index file can be reused to save time when you read seqeunces from FASTA file next time.\n\nFASTA records iteration\n-----------------------\n\nThe fastest way to iterate plain or gzipped FASTA file without building index, the iteration will return a tuple contains name and sequence.\n\n.. code:: python\n\n    \u003e\u003e\u003e import pyfastx\n    \u003e\u003e\u003e for name, seq in pyfastx.Fasta('test/data/test.fa.gz', build_index=False):\n    \u003e\u003e\u003e     print(name, seq)\n\nYou can also iterate sequence object from FASTA object like this:\n\n.. code:: python\n\n    \u003e\u003e\u003e import pyfastx\n    \u003e\u003e\u003e for seq in pyfastx.Fasta('test/data/test.fa.gz'):\n    \u003e\u003e\u003e     print(seq.name)\n    \u003e\u003e\u003e     print(seq.seq)\n    \u003e\u003e\u003e     print(seq.description)\n\nIteration with ``build_index=True`` (default) return sequence object which allows you to access attributions of sequence. New in pyfastx 0.6.3.\n\n\nGet FASTA information\n---------------------\n\n.. code:: python\n\n    \u003e\u003e\u003e # get sequence counts in FASTA\n    \u003e\u003e\u003e len(fa)\n    211\n\n    \u003e\u003e\u003e # get total sequence length of FASTA\n    \u003e\u003e\u003e fa.size\n    86262\n\n    \u003e\u003e\u003e # get GC content of DNA sequence of FASTA\n    \u003e\u003e\u003e fa.gc_content\n    43.529014587402344\n\n    \u003e\u003e\u003e # get GC skew of DNA sequences in FASTA\n    \u003e\u003e\u003e # New in pyfastx 0.3.8\n    \u003e\u003e\u003e fa.gc_skew\n    0.004287730902433395\n\n    \u003e\u003e\u003e # get composition of nucleotides in FASTA\n    \u003e\u003e\u003e fa.composition\n    {'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}\n\n    \u003e\u003e\u003e # get fasta type (DNA, RNA, or protein)\n    \u003e\u003e\u003e fa.type\n    'DNA'\n\n    \u003e\u003e\u003e # check fasta file is gzip compressed\n    \u003e\u003e\u003e fa.is_gzip\n    True\n\nGet longest and shortest sequence\n---------------------------------\n\nNew in ``pyfastx`` 0.3.0\n\n.. code:: python\n\n    \u003e\u003e\u003e # get longest sequence\n    \u003e\u003e\u003e s = fa.longest\n    \u003e\u003e\u003e s\n    \u003cSequence\u003e JZ822609.1 with length of 821\n\n    \u003e\u003e\u003e s.name\n    'JZ822609.1'\n\n    \u003e\u003e\u003e len(s)\n    821\n\n    \u003e\u003e\u003e # get shortest sequence\n    \u003e\u003e\u003e s = fa.shortest\n    \u003e\u003e\u003e s\n    \u003cSequence\u003e JZ822617.1 with length of 118\n\n    \u003e\u003e\u003e s.name\n    'JZ822617.1'\n\n    \u003e\u003e\u003e len(s)\n    118\n\nCalculate N50 and L50\n---------------------\n\nNew in ``pyfastx`` 0.3.0\n\nCalculate assembly N50 and L50, return (N50, L50), learn more about `N50,L50 \u003chttps://www.molecularecologist.com/2017/03/whats-n50/\u003e`_\n\n.. code:: python\n\n\t\u003e\u003e\u003e # get FASTA N50 and L50\n\t\u003e\u003e\u003e fa.nl(50)\n\t(516, 66)\n\n\t\u003e\u003e\u003e # get FASTA N90 and L90\n\t\u003e\u003e\u003e fa.nl(90)\n\t(231, 161)\n\n\t\u003e\u003e\u003e # get FASTA N75 and L75\n\t\u003e\u003e\u003e fa.nl(75)\n\t(365, 117)\n\nGet sequence mean and median length\n-----------------------------------\n\nNew in ``pyfastx`` 0.3.0\n\n.. code:: python\n\n\t\u003e\u003e\u003e # get sequence average length\n\t\u003e\u003e\u003e fa.mean\n\t408\n\n\t\u003e\u003e\u003e # get seqeunce median length\n\t\u003e\u003e\u003e fa.median\n\t430\n\nGet sequence counts\n-------------------\n\nNew in ``pyfastx`` 0.3.0\n\nGet counts of sequences whose length \u003e= specified length\n\n.. code:: python\n\n\t\u003e\u003e\u003e # get counts of sequences with length \u003e= 200 bp\n\t\u003e\u003e\u003e fa.count(200)\n\t173\n\n\t\u003e\u003e\u003e # get counts of sequences with length \u003e= 500 bp\n\t\u003e\u003e\u003e fa.count(500)\n\t70\n\nGet subsequences\n----------------\n\nSubsequences can be retrieved from FASTA file by using a list of [start, end] coordinates\n\n.. code:: python\n\n    \u003e\u003e\u003e # get subsequence with start and end position\n    \u003e\u003e\u003e interval = (1, 10)\n    \u003e\u003e\u003e fa.fetch('JZ822577.1', interval)\n    'CTCTAGAGAT'\n\n    \u003e\u003e\u003e # get subsequences with a list of start and end position\n    \u003e\u003e\u003e intervals = [(1, 10), (50, 60)]\n    \u003e\u003e\u003e fa.fetch('JZ822577.1', intervals)\n    'CTCTAGAGATTTTAGTTTGAC'\n\n    \u003e\u003e\u003e # get subsequences with reverse strand\n    \u003e\u003e\u003e fa.fetch('JZ822577.1', (1, 10), strand='-')\n    'ATCTCTAGAG'\n\nKey function\n------------\n\nNew in ``pyfastx`` 0.5.1\n\nSometimes your fasta will have a long header which contains multiple identifiers and description, for example, \"\u003eJZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence\". In this case, both \"JZ822577.1\" and \"contig1\" can be used as identifer. you can specify the key function to select one as identifier.\n\n.. code:: python\n\n\t\u003e\u003e\u003e #default use JZ822577.1 as identifier\n\t\u003e\u003e\u003e #specify key_func to select contig1 as identifer\n\t\u003e\u003e\u003e fa = pyfastx.Fasta('tests/data/test.fa.gz', key_func=lambda x: x.split()[1])\n\t\u003e\u003e\u003e fa\n\t\u003cFasta\u003e tests/data/test.fa.gz contains 211 seqs\n\nSequence\n========\n\nGet a sequence from FASTA\n-------------------------\n\n.. code:: python\n\n    \u003e\u003e\u003e # get sequence like a dictionary by identifier\n    \u003e\u003e\u003e s1 = fa['JZ822577.1']\n    \u003e\u003e\u003e s1\n    \u003cSequence\u003e JZ822577.1 with length of 333\n\n    \u003e\u003e\u003e # get sequence like a list by index\n    \u003e\u003e\u003e s2 = fa[2]\n    \u003e\u003e\u003e s2\n    \u003cSequence\u003e JZ822579.1 with length of 176\n\n    \u003e\u003e\u003e # get last sequence\n    \u003e\u003e\u003e s3 = fa[-1]\n    \u003e\u003e\u003e s3\n    \u003cSequence\u003e JZ840318.1 with length of 134\n\n    \u003e\u003e\u003e # check a sequence name weather in FASTA file\n    \u003e\u003e\u003e 'JZ822577.1' in fa\n    True\n\nGet sequence information\n------------------------\n\n.. code:: python\n\n    \u003e\u003e\u003e s = fa[-1]\n    \u003e\u003e\u003e s\n    \u003cSequence\u003e JZ840318.1 with length of 134\n\n    \u003e\u003e\u003e # get sequence order number in FASTA file\n    \u003e\u003e\u003e # New in pyfastx 0.3.7\n    \u003e\u003e\u003e s.id\n    211\n\n    \u003e\u003e\u003e # get sequence name\n    \u003e\u003e\u003e s.name\n    'JZ840318.1'\n\n    \u003e\u003e\u003e # get sequence description\n    \u003e\u003e\u003e # New in pyfastx 0.3.1\n    \u003e\u003e\u003e s.description\n    'R283 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence'\n\n    \u003e\u003e\u003e # get sequence string\n    \u003e\u003e\u003e s.seq\n    'ACTGGAGGTTCTTCTTCCTGTGGAAAGTAACTTGTTTTGCCTTCACCTGCCTGTTCTTCACATCAACCTTGTTCCCACACAAAACAATGGGAATGTTCTCACACACCCTGCAGAGATCACGATGCCATGTTGGT'\n\n    \u003e\u003e\u003e # get sequence raw string, New in pyfastx 0.6.3\n    \u003e\u003e\u003e print(s.raw)\n    \u003eJZ840318.1 R283 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence\n    ACTGGAGGTTCTTCTTCCTGTGGAAAGTAACTTGTTTTGCCTTCACCTGCCTGTTCTTCACATCAACCTT\n    GTTCCCACACAAAACAATGGGAATGTTCTCACACACCCTGCAGAGATCACGATGCCATGTTGGT\n\n    \u003e\u003e\u003e # get sequence length\n    \u003e\u003e\u003e len(s)\n    134\n\n    \u003e\u003e\u003e # get GC content if dna sequence\n    \u003e\u003e\u003e s.gc_content\n    46.26865768432617\n\n    \u003e\u003e\u003e # get nucleotide composition if dna sequence\n    \u003e\u003e\u003e s.composition\n    {'A': 31, 'C': 37, 'G': 25, 'T': 41, 'N': 0}\n\nSequence slice\n--------------\n\nSequence object can be sliced like a python string\n\n.. code:: python\n\n    \u003e\u003e\u003e # get a sub seq from sequence\n    \u003e\u003e\u003e s = fa[-1]\n    \u003e\u003e\u003e ss = s[10:30]\n    \u003e\u003e\u003e ss\n    \u003cSequence\u003e JZ840318.1 from 11 to 30\n\n    \u003e\u003e\u003e ss.name\n    'JZ840318.1:11-30'\n\n    \u003e\u003e\u003e ss.seq\n    'CTTCTTCCTGTGGAAAGTAA'\n\n    \u003e\u003e\u003e ss = s[-10:]\n    \u003e\u003e\u003e ss\n    \u003cSequence\u003e JZ840318.1 from 125 to 134\n\n    \u003e\u003e\u003e ss.name\n    'JZ840318.1:125-134'\n\n    \u003e\u003e\u003e ss.seq\n    'CCATGTTGGT'\n\n\n.. note::\n\n\tSlicing start and end coordinates are 0-based. Currently, pyfastx does not support an optional third ``step`` or ``stride`` argument. For example ``ss[::-1]``\n\nReverse and complement sequence\n-------------------------------\n\n.. code:: python\n\n    \u003e\u003e\u003e # get sliced sequence\n    \u003e\u003e\u003e fa[0][10:20].seq\n    'GTCAATTTCC'\n\n    \u003e\u003e\u003e # get reverse of sliced sequence\n    \u003e\u003e\u003e fa[0][10:20].reverse\n    'CCTTTAACTG'\n\n    \u003e\u003e\u003e # get complement of sliced sequence\n    \u003e\u003e\u003e fa[0][10:20].complement\n    'CAGTTAAAGG'\n\n    \u003e\u003e\u003e # get reversed complement sequence, corresponding to sequence in antisense strand\n    \u003e\u003e\u003e fa[0][10:20].antisense\n    'GGAAATTGAC'\n\nRead sequence line by line\n--------------------------\n\nNew in ``pyfastx`` 0.3.0\n\nThe sequence object can be iterated line by line as they appear in FASTA file.\n\n.. code:: python\n\n\t\u003e\u003e\u003e for line in fa[0]:\n\t... \tprint(line)\n\t...\n\tCTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCG\n\tAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATC\n\tATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCC\n\tCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCC\n\tAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT\n\n.. note::\n\n    Sliced sequence (e.g. fa[0][10:50]) cannot be read line by line\n\nSearch for subsequence\n----------------------\n\nNew in ``pyfastx`` 0.3.6\n\nSearch for subsequence from given sequence and get one-based start position of the first occurrence\n\n.. code:: python\n\n    \u003e\u003e\u003e # search subsequence in sense strand\n    \u003e\u003e\u003e fa[0].search('GCTTCAATACA')\n    262\n\n    \u003e\u003e\u003e # check subsequence weather in sequence\n    \u003e\u003e\u003e 'GCTTCAATACA' in fa[0]\n    True\n\n    \u003e\u003e\u003e # search subsequence in antisense strand\n    \u003e\u003e\u003e fa[0].search('CCTCAAGT', '-')\n    301\n\nFastaKeys\n=========\n\nNew in ``pyfastx`` 0.8.0. We have changed ``Identifier`` object to ``FastaKeys`` object.\n\nGet keys\n--------------\n\nGet all names of sequence as a list-like object.\n\n.. code:: python\n\n    \u003e\u003e\u003e ids = fa.keys()\n    \u003e\u003e\u003e ids\n    \u003cFastaKeys\u003e contains 211 keys\n\n    \u003e\u003e\u003e # get count of sequence\n    \u003e\u003e\u003e len(ids)\n    211\n\n    \u003e\u003e\u003e # get key by index\n    \u003e\u003e\u003e ids[0]\n    'JZ822577.1'\n\n    \u003e\u003e\u003e # check key whether in fasta\n    \u003e\u003e\u003e 'JZ822577.1' in ids\n    True\n\n    \u003e\u003e\u003e # iterate over keys\n    \u003e\u003e\u003e for name in ids:\n    \u003e\u003e\u003e     print(name)\n\n    \u003e\u003e\u003e # convert to a list\n    \u003e\u003e\u003e list(ids)\n\nSort keys\n----------------\n\nSort keys by sequence id, name, or length for iteration\n\nNew in ``pyfastx`` 0.5.0\n\n.. code:: python\n\n    \u003e\u003e\u003e # sort keys by length with descending order\n    \u003e\u003e\u003e for name in ids.sort(by='length', reverse=True):\n    \u003e\u003e\u003e     print(name)\n\n    \u003e\u003e\u003e # sort keys by name with ascending order\n    \u003e\u003e\u003e for name in ids.sort(by='name'):\n    \u003e\u003e\u003e     print(name)\n\n    \u003e\u003e\u003e # sort keys by id with descending order\n    \u003e\u003e\u003e for name in ids.sort(by='id', reverse=True)\n    \u003e\u003e\u003e     print(name)\n\nFilter keys\n------------------\n\nFilter keys by sequence length and name\n\nNew in ``pyfastx`` 0.5.10\n\n.. code:: python\n\n    \u003e\u003e\u003e # get keys with length \u003e 600\n    \u003e\u003e\u003e ids.filter(ids \u003e 600)\n    \u003cFastaKeys\u003e contains 48 keys\n\n    \u003e\u003e\u003e # get keys with length \u003e= 500 and \u003c= 700\n    \u003e\u003e\u003e ids.filter(ids\u003e=500, ids\u003c=700)\n    \u003cFastaKeys\u003e contains 48 keys\n\n    \u003e\u003e\u003e # get keys with length \u003e 500 and \u003c 600\n    \u003e\u003e\u003e ids.filter(500\u003cids\u003c600)\n    \u003cFastaKeys\u003e contains 22 keys\n\n    \u003e\u003e\u003e # get keys contain JZ8226\n    \u003e\u003e\u003e ids.filter(ids % 'JZ8226')\n    \u003cFastaKeys\u003e contains 90 keys\n\n    \u003e\u003e\u003e # get keys contain JZ8226 with length \u003e 550\n    \u003e\u003e\u003e ids.filter(ids % 'JZ8226', ids\u003e550)\n    \u003cFastaKeys\u003e contains 17 keys\n\n    \u003e\u003e\u003e # clear sort order and filters\n    \u003e\u003e\u003e ids.reset()\n    \u003cFastaKeys\u003e contains 211 keys\n\n    \u003e\u003e\u003e # list a filtered result\n    \u003e\u003e\u003e ids.filter(ids % 'JZ8226', ids\u003e730)\n    \u003e\u003e\u003e list(ids)\n    ['JZ822609.1', 'JZ822650.1', 'JZ822664.1', 'JZ822699.1']\n\n    \u003e\u003e\u003e # list a filtered result with sort order\n    \u003e\u003e\u003e ids.filter(ids % 'JZ8226', ids\u003e730).sort('length', reverse=True)\n    \u003e\u003e\u003e list(ids)\n    ['JZ822609.1', 'JZ822699.1', 'JZ822664.1', 'JZ822650.1']\n\n    \u003e\u003e\u003e ids.filter(ids % 'JZ8226', ids\u003e730).sort('name', reverse=True)\n    \u003e\u003e\u003e list(ids)\n    ['JZ822699.1', 'JZ822664.1', 'JZ822650.1', 'JZ822609.1']\n\nFASTQ\n=====\n\nNew in ``pyfastx`` 0.4.0\n\nRead FASTQ file\n---------------\n\nRead plain or gzipped file and build index, support for random access to reads from FASTQ.\n\n.. code:: python\n\n    \u003e\u003e\u003e import pyfastx\n    \u003e\u003e\u003e fq = pyfastx.Fastq('tests/data/test.fq.gz')\n    \u003e\u003e\u003e fq\n    \u003cFastq\u003e tests/data/test.fq.gz contains 100 reads\n\nFASTQ records iteration\n-----------------------\n\nThe fastest way to parse plain or gzipped FASTQ file without building index, the iteration will return a tuple contains read name, seq and quality.\n\n.. code:: python\n\n    \u003e\u003e\u003e import pyfastx\n    \u003e\u003e\u003e for name,seq,qual in pyfastx.Fastq('tests/data/test.fq.gz', build_index=False):\n    \u003e\u003e\u003e     print(name)\n    \u003e\u003e\u003e     print(seq)\n    \u003e\u003e\u003e     print(qual)\n\nYou can also iterate read object from FASTQ object like this:\n\n.. code:: python\n\n    \u003e\u003e\u003e import pyfastx\n    \u003e\u003e\u003e for read in pyfastx.Fastq('test/data/test.fq.gz'):\n    \u003e\u003e\u003e     print(read.name)\n    \u003e\u003e\u003e     print(read.seq)\n    \u003e\u003e\u003e     print(read.qual)\n    \u003e\u003e\u003e     print(read.quali)\n\nIteration with ``build_index=True`` (default) return read object which allows you to access attribution of read. New in pyfastx 0.6.3.\n\n\nGet FASTQ information\n---------------------\n\n.. code:: python\n\n    \u003e\u003e\u003e # get read counts in FASTQ\n    \u003e\u003e\u003e len(fq)\n    800\n\n    \u003e\u003e\u003e # get total bases\n    \u003e\u003e\u003e fq.size\n    120000\n\n    \u003e\u003e\u003e # get GC content of FASTQ file\n    \u003e\u003e\u003e fq.gc_content\n    66.17471313476562\n\n    \u003e\u003e\u003e # get composition of bases in FASTQ\n    \u003e\u003e\u003e fq.composition\n    {'A': 20501, 'C': 39705, 'G': 39704, 'T': 20089, 'N': 1}\n\n    \u003e\u003e\u003e # New in pyfastx 0.6.10\n    \u003e\u003e\u003e # get average length of reads\n    \u003e\u003e\u003e fq.avglen\n    150.0\n\n    \u003e\u003e\u003e # get maximum lenth of reads\n    \u003e\u003e\u003e fq.maxlen\n    150\n\n    \u003e\u003e\u003e # get minimum length of reas\n    \u003e\u003e\u003e fq.minlen\n    150\n\n    \u003e\u003e\u003e # get maximum quality score\n    \u003e\u003e\u003e fq.maxqual\n    70\n\n    \u003e\u003e\u003e # get minimum quality score\n    \u003e\u003e\u003e fq.minqual\n    35\n\n    \u003e\u003e\u003e # get phred which affects the quality score conversion\n    \u003e\u003e\u003e fq.phred\n    33\n\n    \u003e\u003e\u003e # Guess fastq quality encoding system\n    \u003e\u003e\u003e # New in pyfastx 0.4.1\n    \u003e\u003e\u003e fq.encoding_type\n    ['Sanger Phred+33', 'Illumina 1.8+ Phred+33']\n\nRead\n====\n\nGet read from FASTQ\n-------------------\n\n.. code:: python\n\n    \u003e\u003e\u003e #get read like a dict by read name\n    \u003e\u003e\u003e r1 = fq['A00129:183:H77K2DMXX:1:1101:4752:1047']\n    \u003e\u003e\u003e r1\n    \u003cRead\u003e A00129:183:H77K2DMXX:1:1101:4752:1047 with length of 150\n\n    \u003e\u003e\u003e # get read like a list by index\n    \u003e\u003e\u003e r2 = fq[10]\n    \u003e\u003e\u003e r2\n    \u003cRead\u003e A00129:183:H77K2DMXX:1:1101:18041:1078 with length of 150\n\n    \u003e\u003e\u003e # get the last read\n    \u003e\u003e\u003e r3 = fq[-1]\n    \u003e\u003e\u003e r3\n    \u003cRead\u003e A00129:183:H77K2DMXX:1:1101:31575:4726 with length of 150\n\n    \u003e\u003e\u003e # check a read weather in FASTQ file\n    \u003e\u003e\u003e 'A00129:183:H77K2DMXX:1:1101:4752:1047' in fq\n    True\n\nGet read information\n--------------------\n\n.. code:: python\n\n    \u003e\u003e\u003e r = fq[-10]\n    \u003e\u003e\u003e r\n    \u003cRead\u003e A00129:183:H77K2DMXX:1:1101:1750:4711 with length of 150\n\n    \u003e\u003e\u003e # get read order number in FASTQ file\n    \u003e\u003e\u003e r.id\n    791\n\n    \u003e\u003e\u003e # get read name\n    \u003e\u003e\u003e r.name\n    'A00129:183:H77K2DMXX:1:1101:1750:4711'\n\n    \u003e\u003e\u003e # get read full header line, New in pyfastx 0.6.3\n    \u003e\u003e\u003e r.description\n    '@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG'\n\n    \u003e\u003e\u003e # get read length\n    \u003e\u003e\u003e len(r)\n    150\n\n    \u003e\u003e\u003e # get read sequence\n    \u003e\u003e\u003e r.seq\n    'CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA'\n\n    \u003e\u003e\u003e # get raw string of read, New in pyfastx 0.6.3\n    \u003e\u003e\u003e print(r.raw)\n    @A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG\n    CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA\n    +\n    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:\n\n    \u003e\u003e\u003e # get read quality ascii string\n    \u003e\u003e\u003e r.qual\n    'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:'\n\n    \u003e\u003e\u003e # get read quality integer value, ascii - 33 or 64\n    \u003e\u003e\u003e r.quali\n    [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25]\n\n    \u003e\u003e\u003e # get read length\n    \u003e\u003e\u003e len(r)\n    150\n\nFastqKeys\n=========\n\nNew in ``pyfastx`` 0.8.0.\n\nGet fastq keys\n---------------\n\nGet all names of read as a list-like object.\n\n.. code:: python\n\n    \u003e\u003e\u003e ids = fq.keys()\n    \u003e\u003e\u003e ids\n    \u003cFastqKeys\u003e contains 800 keys\n\n    \u003e\u003e\u003e # get count of read\n    \u003e\u003e\u003e len(ids)\n    800\n\n    \u003e\u003e\u003e # get key by index\n    \u003e\u003e\u003e ids[0]\n    'A00129:183:H77K2DMXX:1:1101:6804:1031'\n\n    \u003e\u003e\u003e # check key whether in fasta\n    \u003e\u003e\u003e 'A00129:183:H77K2DMXX:1:1101:14416:1031' in ids\n    True\n\nCommand line interface\n======================\n\nNew in ``pyfastx`` 0.5.0\n\n.. code:: bash\n\n    $ pyfastx -h\n\n    usage: pyfastx COMMAND [OPTIONS]\n\n    A command line tool for FASTA/Q file manipulation\n\n    optional arguments:\n      -h, --help     show this help message and exit\n      -v, --version  show program's version number and exit\n\n    Commands:\n\n        index        build index for fasta/q file\n        stat         show detailed statistics information of fasta/q file\n        split        split fasta/q file into multiple files\n        fq2fa        convert fastq file to fasta file\n        subseq       get subsequences from fasta file by region\n        sample       randomly sample sequences from fasta or fastq file\n        extract      extract full sequences or reads from fasta/q file\n\nBuild index\n-----------\n\nNew in ``pyfastx`` 0.6.10\n\n.. code:: bash\n\n    $ pyfastx index -h\n\n    usage: pyfastx index [-h] [-f] fastx [fastx ...]\n\n    positional arguments:\n      fastx       fasta or fastq file, gzip support\n\n    optional arguments:\n      -h, --help  show this help message and exit\n      -f, --full  build full index, base composition will be calculated\n\nShow statistics information\n---------------------------\n\n.. code:: bash\n\n    $ pyfastx stat -h\n\n    usage: pyfastx info [-h] fastx\n\n    positional arguments:\n      fastx       input fasta or fastq file, gzip support\n\n    optional arguments:\n      -h, --help  show this help message and exit\n\nSplit FASTA/Q file\n------------------\n\n.. code:: bash\n\n    $ pyfastx split -h\n\n    usage: pyfastx split [-h] (-n int | -c int) [-o str] fastx\n\n    positional arguments:\n      fastx                 fasta or fastq file, gzip support\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -n int                split a fasta/q file into N new files with even size\n      -c int                split a fasta/q file into multiple files containing the same sequence counts\n      -o str, --out-dir str\n                            output directory, default is current folder\n\nConvert FASTQ to FASTA file\n---------------------------\n\n.. code:: bash\n\n    $ pyfastx fq2fa -h\n\n    usage: pyfastx fq2fa [-h] [-o str] fastx\n\n    positional arguments:\n      fastx                 fastq file, gzip support\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -o str, --out-file str\n                            output file, default: output to stdout\n\nGet subsequence with region\n---------------------------\n\n.. code:: bash\n\n    $ pyfastx subseq -h\n\n    usage: pyfastx subseq [-h] [-r str | -b str] [-o str] fastx [region [region ...]]\n\n    positional arguments:\n      fastx                 input fasta file, gzip support\n      region                format is chr:start-end, start and end position is 1-based, multiple names were separated by space\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -r str, --region-file str\n                            tab-delimited file, one region per line, both start and end position are 1-based\n      -b str, --bed-file str\n                            tab-delimited BED file, 0-based start position and 1-based end position\n      -o str, --out-file str\n                            output file, default: output to stdout\n\nSample sequences\n----------------\n\n.. code:: bash\n\n    $ pyfastx sample -h\n\n    usage: pyfastx sample [-h] (-n int | -p float) [-s int] [--sequential-read] [-o str] fastx\n\n    positional arguments:\n      fastx                 fasta or fastq file, gzip support\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -n int                number of sequences to be sampled\n      -p float              proportion of sequences to be sampled, 0~1\n      -s int, --seed int    random seed, default is the current system time\n      --sequential-read     start sequential reading, particularly suitable for sampling large numbers of sequences\n      -o str, --out-file str\n                            output file, default: output to stdout\n\nExtract sequences\n-----------------\n\nNew in ``pyfastx`` 0.6.10\n\n.. code:: bash\n\n    $ pyfastx extract -h\n\n    usage: pyfastx extract [-h] [-l str] [--reverse-complement] [--out-fasta] [-o str] [--sequential-read]\n                           fastx [name [name ...]]\n\n    positional arguments:\n      fastx                 fasta or fastq file, gzip support\n      name                  sequence name or read name, multiple names were separated by space\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -l str, --list-file str\n                            a file containing sequence or read names, one name per line\n      --reverse-complement  output reverse complement sequence\n      --out-fasta           output fasta format when extract reads from fastq, default output fastq format\n      -o str, --out-file str\n                            output file, default: output to stdout\n      --sequential-read     start sequential reading, particularly suitable for extracting large numbers of sequences\n\nDrawbacks\n=========\n\nIf you intensively check sequence names exists in FASTA file using ``in`` operator on FASTA object like:\n\n.. code:: python\n\n\t\u003e\u003e\u003e fa = pyfastx.Fasta('tests/data/test.fa.gz')\n\t\u003e\u003e\u003e # Suppose seqnames has 100000 names\n\t\u003e\u003e\u003e for seqname in seqnames:\n\t\u003e\u003e\u003e     if seqname in fa:\n\t\u003e\u003e\u003e\t        do something\n\nThis will take a long time to finish. Becuase, pyfastx does not load the index into memory, the ``in`` operating is corresponding to sql query existence from index database. The faster alternative way to do this is:\n\n.. code:: python\n\n\t\u003e\u003e\u003e fa = pyfastx.Fasta('tests/data/test.fa.gz')\n\t\u003e\u003e\u003e # load all sequence names into a set object\n\t\u003e\u003e\u003e all_names = set(fa.keys())\n\t\u003e\u003e\u003e for seqname in seqnames:\n\t\u003e\u003e\u003e     if seqname in all_names:\n\t\u003e\u003e\u003e\t        do something\n\nTesting\n=======\n\nThe ``pyfaidx`` module was used to test ``pyfastx``. First, make sure you have a suitable version installed::\n\n    pip install pyfastx\n\nTo test pyfastx, you should also install pyfaidx 0.5.8::\n\n    pip install pyfaidx==0.5.8\n\nThen, to run the tests::\n\n\t$ python setup.py test\n\nAcknowledgements\n================\n\n`kseq.h \u003chttps://github.com/attractivechaos/klib/blob/master/kseq.h\u003e`_ and `zlib \u003chttps://www.zlib.net/\u003e`_ was used to parse FASTA format. `Sqlite3 \u003chttps://www.sqlite.org/index.html\u003e`_ was used to store built indexes. ``pyfastx`` can randomly access to sequences from gzipped FASTA file mainly attributed to `indexed_gzip \u003chttps://github.com/pauldmccarthy/indexed_gzip\u003e`_.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flmdu%2Fpyfastx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flmdu%2Fpyfastx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flmdu%2Fpyfastx/lists"}