{"id":13704201,"url":"https://github.com/mdshw5/pyfaidx","last_synced_at":"2025-05-14T08:06:28.660Z","repository":{"id":2348326,"uuid":"12792173","full_name":"mdshw5/pyfaidx","owner":"mdshw5","description":"Efficient pythonic random access to fasta subsequences","archived":false,"fork":false,"pushed_at":"2025-05-05T19:07:22.000Z","size":12314,"stargazers_count":468,"open_issues_count":12,"forks_count":74,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-05-12T06:06:04.537Z","etag":null,"topics":["bgzf","bioinformatics","dna","fasta","genomics","indexing","protein","python","samtools"],"latest_commit_sha":null,"homepage":"https://pypi.python.org/pypi/pyfaidx","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mdshw5.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-09-12T19:02:32.000Z","updated_at":"2025-05-05T19:06:04.000Z","dependencies_parsed_at":"2023-07-05T17:15:44.614Z","dependency_job_id":"1de128ae-f7db-4656-b459-a64d03301413","html_url":"https://github.com/mdshw5/pyfaidx","commit_stats":{"total_commits":783,"total_committers":38,"mean_commits":"20.605263157894736","dds":0.1353767560664112,"last_synced_commit":"567f4e0038c38b16ba1633f0656aa202c149db74"},"previous_names":[],"tags_count":90,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdshw5%2Fpyfaidx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdshw5%2Fpyfaidx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdshw5%2Fpyfaidx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdshw5%2Fpyfaidx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mdshw5","download_url":"https://codeload.github.com/mdshw5/pyfaidx/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254101616,"owners_count":22014909,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bgzf","bioinformatics","dna","fasta","genomics","indexing","protein","python","samtools"],"created_at":"2024-08-02T21:01:05.533Z","updated_at":"2025-05-14T08:06:23.631Z","avatar_url":"https://github.com/mdshw5.png","language":"Python","funding_links":[],"categories":["Next Generation Sequencing"],"sub_categories":["Python Modules"],"readme":"|CI| |Package| |PyPI| |Coverage| |Downloads|\n\nDescription\n-----------\n\nSamtools provides a function \"faidx\" (FAsta InDeX), which creates a\nsmall flat index file \".fai\" allowing for fast random access to any\nsubsequence in the indexed FASTA file, while loading a minimal amount of the\nfile in to memory. This python module implements pure Python classes for\nindexing, retrieval, and in-place modification of FASTA files using a samtools\ncompatible index. The pyfaidx module is API compatible with the `pygr`_ seqdb module.\nA command-line script \"`faidx`_\" is installed alongside the pyfaidx module, and\nfacilitates complex manipulation of FASTA files without any programming knowledge.\n\n.. _`pygr`: https://github.com/cjlee112/pygr\n\nIf you use pyfaidx in your publication, please cite:\n\n`Shirley MD`_, `Ma Z`_, `Pedersen B`_, `Wheelan S`_. `Efficient \"pythonic\" access to FASTA files using pyfaidx \u003chttps://dx.doi.org/10.7287/peerj.preprints.970v1\u003e`_. PeerJ PrePrints 3:e1196. 2015.\n\n.. _`Shirley MD`: http://github.com/mdshw5\n.. _`Ma Z`: http://github.com/azalea\n.. _`Pedersen B`: http://github.com/brentp\n.. _`Wheelan S`: http://github.com/swheelan\n\nInstallation\n------------\n\nThis package is tested under Linux and macOS using Python 3.7+, and and is available from the PyPI:\n\n::\n\n    pip install pyfaidx  # add --user if you don't have root\n\nor download a `release \u003chttps://github.com/mdshw5/pyfaidx/releases\u003e`_ and:\n\n::\n\n    pip install .\n\nIf using ``pip install --user`` make sure to add ``/home/$USER/.local/bin`` to your ``$PATH`` (on linux) or ``/Users/$USER/Library/Python/{python version}/bin`` (on macOS) if you want to run the ``faidx`` script.\n\nPython 2.6 and 2.7 users may choose to use a package version from `v0.7.2 \u003chttps://github.com/mdshw5/pyfaidx/releases/tag/v0.7.2.2\u003e`_ or earier.\n\nUsage\n-----\n\n.. code:: python\n\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta')\n    \u003e\u003e\u003e genes\n    Fasta(\"tests/data/genes.fasta\")  # set strict_bounds=True for bounds checking\n\nActs like a dictionary.\n\n.. code:: python\n\n    \u003e\u003e\u003e genes.keys()\n    ('AB821309.1', 'KF435150.1', 'KF435149.1', 'NR_104216.1', 'NR_104215.1', 'NR_104212.1', 'NM_001282545.1', 'NM_001282543.1', 'NM_000465.3', 'NM_001282549.1', 'NM_001282548.1', 'XM_005249645.1', 'XM_005249644.1', 'XM_005249643.1', 'XM_005249642.1', 'XM_005265508.1', 'XM_005265507.1', 'XR_241081.1', 'XR_241080.1', 'XR_241079.1')\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230]\n    \u003eNM_001282543.1:201-230\n    CTCGTTCCGCGCCCGCCATGGAACCGGATG\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230].seq\n    'CTCGTTCCGCGCCCGCCATGGAACCGGATG'\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230].name\n    'NM_001282543.1'\n\n    # Start attributes are 1-based\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230].start\n    201\n\n    # End attributes are 0-based\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230].end\n    230\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230].fancy_name\n    'NM_001282543.1:201-230'\n\n    \u003e\u003e\u003e len(genes['NM_001282543.1'])\n    5466\n\nNote that start and end coordinates of Sequence objects are [1, 0]. This can be changed to [0, 0] by passing ``one_based_attributes=False`` to ``Fasta`` or ``Faidx``. This argument only affects the ``Sequence .start/.end`` attributes, and has no effect on slicing coordinates.\n\nIndexes like a list:\n\n.. code:: python\n\n    \u003e\u003e\u003e genes[0][:50]\n    \u003eAB821309.1:1-50\n    ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAAC\n\nSlices just like a string:\n\n.. code:: python\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230][:10]\n    \u003eNM_001282543.1:201-210\n    CTCGTTCCGC\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230][::-1]\n    \u003eNM_001282543.1:230-201\n    GTAGGCCAAGGTACCGCCCGCGCCTTGCTC\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230][::3]\n    \u003eNM_001282543.1:201-230\n    CGCCCCTACA\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][:]\n    \u003eNM_001282543.1:1-5466\n    CCCCGCCCCT........\n\n- Slicing start and end coordinates are 0-based, just like Python sequences.\n\nComplements and reverse complements just like DNA\n\n.. code:: python\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230].complement\n    \u003eNM_001282543.1 (complement):201-230\n    GAGCAAGGCGCGGGCGGTACCTTGGCCTAC\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230].reverse\n    \u003eNM_001282543.1:230-201\n    GTAGGCCAAGGTACCGCCCGCGCCTTGCTC\n\n    \u003e\u003e\u003e -genes['NM_001282543.1'][200:230]\n    \u003eNM_001282543.1 (complement):230-201\n    CATCCGGTTCCATGGCGGGCGCGGAACGAG\n\n``Fasta`` objects can also be accessed using method calls:\n\n.. code:: python\n\n    \u003e\u003e\u003e genes.get_seq('NM_001282543.1', 201, 210)\n    \u003eNM_001282543.1:201-210\n    CTCGTTCCGC\n\n    \u003e\u003e\u003e genes.get_seq('NM_001282543.1', 201, 210, rc=True)\n    \u003eNM_001282543.1 (complement):210-201\n    GCGGAACGAG\n\nSpliced sequences can be retrieved from a list of [start, end] coordinates:\n**TODO** update this section\n\n.. code:: python\n\n    # new in v0.5.1\n    segments = [[1, 10], [50, 70]]\n    \u003e\u003e\u003e genes.get_spliced_seq('NM_001282543.1', segments)\n    \u003egi|543583786|ref|NM_001282543.1|:1-70\n    CCCCGCCCCTGGTTTCGAGTCGCTGGCCTGC\n\n.. _keyfn:\n\nCustom key functions provide cleaner access:\n\n.. code:: python\n\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0])\n    \u003e\u003e\u003e genes.keys()\n    dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])\n    \u003e\u003e\u003e genes['NR_104212'][:10]\n    \u003eNR_104212:1-10\n    CCCCGCCCCT\n\nYou can specify a character to split names on, which will generate additional entries:\n\n.. code:: python\n\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta', split_char='.', duplicate_action=\"first\") # default duplicate_action=\"stop\"\n    \u003e\u003e\u003e genes.keys()\n    dict_keys(['.1', 'NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])\n\nIf your `key_function` or `split_char` generates duplicate entries, you can choose what action to take:\n\n.. code:: python\n\n    # new in v0.4.9\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta', split_char=\"|\", duplicate_action=\"longest\")\n    \u003e\u003e\u003e genes.keys()\n    dict_keys(['gi', '563317589', 'dbj', 'AB821309.1', '', '557361099', 'gb', 'KF435150.1', '557361097', 'KF435149.1', '543583796', 'ref', 'NR_104216.1', '543583795', 'NR_104215.1', '543583794', 'NR_104212.1', '543583788', 'NM_001282545.1', '543583786', 'NM_001282543.1', '543583785', 'NM_000465.3', '543583740', 'NM_001282549.1', '543583738', 'NM_001282548.1', '530384540', 'XM_005249645.1', '530384538', 'XM_005249644.1', '530384536', 'XM_005249643.1', '530384534', 'XM_005249642.1', '530373237','XM_005265508.1', '530373235', 'XM_005265507.1', '530364726', 'XR_241081.1', '530364725', 'XR_241080.1', '530364724', 'XR_241079.1'])\n\nFilter functions (returning True) limit the index:\n\n.. code:: python\n\n    # new in v0.3.8\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta', filt_function = lambda x: x[0] == 'N')\n    \u003e\u003e\u003e genes.keys()\n    dict_keys(['NR_104212', 'NM_001282543', 'NR_104216', 'NR_104215', 'NM_001282549', 'NM_000465', 'NM_001282545', 'NM_001282548'])\n    \u003e\u003e\u003e genes['XM_005249644']\n    KeyError: XM_005249644 not in tests/data/genes.fasta.\n\nOr just get a Python string:\n\n.. code:: python\n\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta', as_raw=True)\n    \u003e\u003e\u003e genes\n    Fasta(\"tests/data/genes.fasta\", as_raw=True)\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][200:230]\n    CTCGTTCCGCGCCCGCCATGGAACCGGATG\n\nYou can make sure that you always receive an uppercase sequence, even if your fasta file has lower case\n\n.. code:: python\n\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e reference = Fasta('tests/data/genes.fasta.lower', sequence_always_upper=True)\n    \u003e\u003e\u003e reference['gi|557361099|gb|KF435150.1|'][1:70]\n\n    \u003egi|557361099|gb|KF435150.1|:2-70\n    TGACATCATTTTCCACCTCTGCTCAGTGTTCAACATCTGACAGTGCTTGCAGGATCTCTCCTGGACAAA\n\n\nYou can also perform line-based iteration, receiving the sequence lines as they appear in the FASTA file:\n\n.. code:: python\n\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta')\n    \u003e\u003e\u003e for line in genes['NM_001282543.1']:\n    ...   print(line)\n    CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC\n    AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA\n    CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC\n    ...\n\nSequence names are truncated on any whitespace. This is a limitation of the indexing strategy. However, full names can be recovered:\n\n.. code:: python\n\n    # new in v0.3.7\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta')\n    \u003e\u003e\u003e for record in genes:\n    ...   print(record.name)\n    ...   print(record.long_name)\n    ...\n    gi|563317589|dbj|AB821309.1|\n    gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds\n    gi|557361099|gb|KF435150.1|\n    gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced\n    gi|557361097|gb|KF435149.1|\n    gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds\n    ...\n\n    # new in v0.4.9\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta', read_long_names=True)\n    \u003e\u003e\u003e for record in genes:\n    ...   print(record.name)\n    ...\n    gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds\n    gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced\n    gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds\n\nRecords can be accessed efficiently as numpy arrays:\n\n.. code:: python\n\n    # new in v0.5.4\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e import numpy as np\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta')\n    \u003e\u003e\u003e np.asarray(genes['NM_001282543.1'])\n    array(['C', 'C', 'C', ..., 'A', 'A', 'A'], dtype='|S1')\n\nSequence can be buffered in memory using a read-ahead buffer\nfor fast sequential access:\n\n.. code:: python\n\n    \u003e\u003e\u003e from timeit import timeit\n    \u003e\u003e\u003e fetch = \"genes['NM_001282543.1'][200:230]\"\n    \u003e\u003e\u003e read_ahead = \"import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)\"\n    \u003e\u003e\u003e no_read_ahead = \"import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')\"\n    \u003e\u003e\u003e string_slicing = \"genes = {}; genes['NM_001282543.1'] = 'N'*10000\"\n\n    \u003e\u003e\u003e timeit(fetch, no_read_ahead, number=10000)\n    0.2204863309962093\n    \u003e\u003e\u003e timeit(fetch, read_ahead, number=10000)\n    0.1121859749982832\n    \u003e\u003e\u003e timeit(fetch, string_slicing, number=10000)\n    0.0033553699977346696\n\nRead-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions.\n\n.. role:: red\n\nIf you want to modify the contents of your FASTA file in-place, you can use the `mutable` argument.\nAny portion of the FastaRecord can be replaced with an equivalent-length string.\n:red:`Warning`: *This will change the contents of your file immediately and permanently:*\n\n.. code:: python\n\n    \u003e\u003e\u003e genes = Fasta('tests/data/genes.fasta', mutable=True)\n    \u003e\u003e\u003e type(genes['NM_001282543.1'])\n    \u003cclass 'pyfaidx.MutableFastaRecord'\u003e\n\n    \u003e\u003e\u003e genes['NM_001282543.1'][:10]\n    \u003eNM_001282543.1:1-10\n    CCCCGCCCCT\n    \u003e\u003e\u003e genes['NM_001282543.1'][:10] = 'NNNNNNNNNN'\n    \u003e\u003e\u003e genes['NM_001282543.1'][:15]\n    \u003eNM_001282543.1:1-15\n    NNNNNNNNNNCTGGC\n\nThe FastaVariant class provides a way to integrate single nucleotide variant calls to generate a consensus sequence.\n\n.. code:: python\n\n    # new in v0.4.0\n    \u003e\u003e\u003e consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', het=True, hom=True)\n    RuntimeWarning: Using sample NA06984 genotypes.\n\n    \u003e\u003e\u003e consensus['22'].variant_sites\n    (16042793, 21833121, 29153196, 29187373, 29187448, 29194610, 29821295, 29821332, 29993842, 32330460, 32352284)\n\n    \u003e\u003e\u003e consensus['22'][16042790:16042800]\n    \u003e22:16042791-16042800\n    TCGTAGGACA\n\n    \u003e\u003e\u003e Fasta('tests/data/chr22.fasta')['22'][16042790:16042800]\n    \u003e22:16042791-16042800\n    TCATAGGACA\n\n    \u003e\u003e\u003e consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', sample='NA06984', het=True, hom=True, call_filter='GT == \"0/1\"')\n    \u003e\u003e\u003e consensus['22'].variant_sites\n    (16042793, 29187373, 29187448, 29194610, 29821332)\n    \nYou can also specify paths using ``pathlib.Path`` objects.\n\n.. code:: python\n    \n    #new in v0.7.1\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e from pathlib import Path\n    \u003e\u003e\u003e genes = Fasta(Path('tests/data/genes.fasta'))\n    \u003e\u003e\u003e genes\n    Fasta(\"tests/data/genes.fasta\")\n\nAccessing fasta files from `filesystem_spec \u003chttps://filesystem-spec.readthedocs.io\u003e`_ filesystems:\n\n.. code:: python\n\n    # new in v0.7.0\n    # pip install fsspec s3fs\n    \u003e\u003e\u003e import fsspec\n    \u003e\u003e\u003e from pyfaidx import Fasta\n    \u003e\u003e\u003e of = fsspec.open(\"s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta\", anon=True)\n    \u003e\u003e\u003e genes = Fasta(of)\n\n\n.. _faidx:\n\nIt also provides a command-line script:\n\ncli script: faidx\n~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n    Fetch sequences from FASTA. If no regions are specified, all entries in the\n    input file are returned. Input FASTA file must be consistently line-wrapped,\n    and line wrapping of output is based on input line lengths.\n\n    positional arguments:\n      fasta                 FASTA file\n      regions               space separated regions of sequence to fetch e.g.\n                            chr1:1-1000\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -b BED, --bed BED     bed file of regions (zero-based start coordinate)\n      -o OUT, --out OUT     output file name (default: stdout)\n      -i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed} transform the requested regions into another format. default: None\n      -c, --complement      complement the sequence. default: False\n      -r, --reverse         reverse the sequence. default: False\n      -a SIZE_RANGE, --size-range SIZE_RANGE\n                            selected sequences are in the size range [low, high]. example: 1,1000 default: None\n      -n, --no-names        omit sequence names from output. default: False\n      -f, --full-names      output full names including description. default: False\n      -x, --split-files     write each region to a separate file (names are derived from regions)\n      -l, --lazy            fill in --default-seq for missing ranges. default: False\n      -s DEFAULT_SEQ, --default-seq DEFAULT_SEQ\n                            default base for missing positions and masking. default: None\n      -d DELIMITER, --delimiter DELIMITER\n                            delimiter for splitting names to multiple values (duplicate names will be discarded). default: None\n      -e HEADER_FUNCTION, --header-function HEADER_FUNCTION\n                            python function to modify header lines e.g: \"lambda x: x.split(\"|\")[0]\". default: lambda x: x.split()[0]\n      -u {stop,first,last,longest,shortest}, --duplicates-action {stop,first,last,longest,shortest}\n                            entry to take when duplicate sequence names are encountered. default: stop\n      -g REGEX, --regex REGEX\n                            selected sequences are those matching regular expression. default: .*\n      -v, --invert-match    selected sequences are those not matching 'regions' argument. default: False\n      -m, --mask-with-default-seq\n                            mask the FASTA file using --default-seq default: False\n      -M, --mask-by-case    mask the FASTA file by changing to lowercase. default: False\n      -e HEADER_FUNCTION, --header-function HEADER_FUNCTION\n                            python function to modify header lines e.g: \"lambda x: x.split(\"|\")[0]\". default: None\n      --no-rebuild          do not rebuild the .fai index even if it is out of date. default: False\n      --version             print pyfaidx version number\n\nExamples:\n\n.. code:: bash\n\n    $ faidx -v tests/data/genes.fasta\n    ### Creates an .fai index, but supresses sequence output using --invert-match ###\n\n    $ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320\n    \u003eNM_001282543.1:201-210\n    CTCGTTCCGC\n    \u003eNM_001282543.1:300-320\n    GTAATTGTGTAAGTGACTGCA\n\n    $ faidx --full-names tests/data/genes.fasta NM_001282543.1:201-210\n    \u003eNM_001282543.1| Homo sapiens BRCA1 associated RING domain 1 (BARD1), transcript variant 2, mRNA\n    CTCGTTCCGC\n\n    $ faidx --no-names tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320\n    CTCGTTCCGC\n    GTAATTGTGTAAGTGACTGCA\n\n    $ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210\n    \u003eNM_001282543.1:201-210 (complement)\n    GAGCAAGGCG\n\n    $ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210\n    \u003eNM_001282543.1:210-201\n    CGCCTTGCTC\n\n    $ faidx --reverse --complement tests/data/genes.fasta NM_001282543.1:201-210\n    \u003eNM_001282543.1:210-201 (complement)\n    GCGGAACGAG\n\n    $ faidx tests/data/genes.fasta NM_001282543.1\n    \u003eNM_001282543.1:1-5466\n    CCCCGCCCCT........\n    ..................\n    ..................\n    ..................\n\n    $ faidx --regex \"^NM_00128254[35]\" genes.fasta\n    \u003eNM_001282543.1\n    ..................\n    ..................\n    ..................\n    \u003eNM_001282545.1\n    ..................\n    ..................\n    ..................\n\n    $ faidx --lazy tests/data/genes.fasta NM_001282543.1:5460-5480\n    \u003eNM_001282543.1:5460-5480\n    AAAAAAANNNNNNNNNNNNNN\n\n    $ faidx --lazy --default-seq='Q' tests/data/genes.fasta NM_001282543.1:5460-5480\n    \u003eNM_001282543.1:5460-5480\n    AAAAAAAQQQQQQQQQQQQQQ\n\n    $ faidx tests/data/genes.fasta --bed regions.bed\n    ...\n\n    $ faidx --transform chromsizes tests/data/genes.fasta\n    AB821309.1\t3510\n    KF435150.1\t481\n    KF435149.1\t642\n    NR_104216.1\t4573\n    NR_104215.1\t5317\n    NR_104212.1\t5374\n    ...\n\n    $ faidx --transform bed tests/data/genes.fasta\n    AB821309.1\t1    3510\n    KF435150.1\t1    481\n    KF435149.1\t1    642\n    NR_104216.1\t1   4573\n    NR_104215.1\t1   5317\n    NR_104212.1\t1   5374\n    ...\n\n    $ faidx --transform nucleotide tests/data/genes.fasta\n    name\tstart\tend\tA\tT\tC\tG\tN\n    AB821309.1\t1\t3510\t955\t774\t837\t944\t0\n    KF435150.1\t1\t481\t149\t120\t103\t109\t0\n    KF435149.1\t1\t642\t201\t163\t129\t149\t0\n    NR_104216.1\t1\t4573\t1294\t1552\t828\t899\t0\n    NR_104215.1\t1\t5317\t1567\t1738\t968\t1044\t0\n    NR_104212.1\t1\t5374\t1581\t1756\t977\t1060\t0\n    ...\n\n    faidx --transform transposed tests/data/genes.fasta\n    AB821309.1\t1\t3510\tATGGTCAGCTGGGGTCGTTTCATC...\n    KF435150.1\t1\t481\tATGACATCATTTTCCACCTCTGCT...\n    KF435149.1\t1\t642\tATGACATCATTTTCCACCTCTGCT...\n    NR_104216.1\t1\t4573\tCCCCGCCCCTCTGGCGGCCCGCCG...\n    NR_104215.1\t1\t5317\tCCCCGCCCCTCTGGCGGCCCGCCG...\n    NR_104212.1\t1\t5374\tCCCCGCCCCTCTGGCGGCCCGCCG...\n    ...\n\n    $ faidx --split-files tests/data/genes.fasta\n    $ ls\n    AB821309.1.fasta\tNM_001282549.1.fasta\tXM_005249645.1.fasta\n    KF435149.1.fasta\tNR_104212.1.fasta\tXM_005265507.1.fasta\n    KF435150.1.fasta\tNR_104215.1.fasta\tXM_005265508.1.fasta\n    NM_000465.3.fasta\tNR_104216.1.fasta\tXR_241079.1.fasta\n    NM_001282543.1.fasta\tXM_005249642.1.fasta\tXR_241080.1.fasta\n    NM_001282545.1.fasta\tXM_005249643.1.fasta\tXR_241081.1.fasta\n    NM_001282548.1.fasta\tXM_005249644.1.fasta\n\n    $ faidx --delimiter='_' tests/data/genes.fasta 000465.3\n    \u003e000465.3\n    CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC\n    AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA\n    .......\n\n    $ faidx --size-range 5500,6000 -i chromsizes tests/data/genes.fasta\n    NM_000465.3\t5523\n\n    $ faidx -m --bed regions.bed tests/data/genes.fasta\n    ### Modifies tests/data/genes.fasta by masking regions using --default-seq character ###\n\n    $ faidx -M --bed regions.bed tests/data/genes.fasta\n    ### Modifies tests/data/genes.fasta by masking regions using lowercase characters ###\n\n    $ faidx -e \"lambda x: x.split('.')[0]\" tests/data/genes.fasta -i bed\n    AB821309\t1\t3510\n    KF435150\t1\t481\n    KF435149\t1\t642\n    NR_104216\t1\t4573\n    NR_104215\t1\t5317\n    .......\n\n\nSimilar syntax as ``samtools faidx``\n\n\nA lower-level Faidx class is also available:\n\n.. code:: python\n\n    \u003e\u003e\u003e from pyfaidx import Faidx\n    \u003e\u003e\u003e fa = Faidx('genes.fa')  # can return str with as_raw=True\n    \u003e\u003e\u003e fa.index\n    OrderedDict([('AB821309.1', IndexRecord(rlen=3510, offset=12, lenc=70, lenb=71)), ('KF435150.1', IndexRecord(rlen=481, offset=3585, lenc=70, lenb=71)),... ])\n\n    \u003e\u003e\u003e fa.index['AB821309.1'].rlen\n    3510\n\n    fa.fetch('AB821309.1', 1, 10)  # these are 1-based genomic coordinates\n    \u003eAB821309.1:1-10\n    ATGGTCAGCT\n\n\n-  If the FASTA file is not indexed, when ``Faidx`` is initialized the\n   ``build_index`` method will automatically run, and\n   the index will be written to \"filename.fa.fai\" with ``write_fai()``.\n   where \"filename.fa\" is the original FASTA file.\n-  Start and end coordinates are 1-based.\n\nSupport for compressed FASTA\n----------------------------\n\n``pyfaidx`` can create and read ``.fai`` indices for FASTA files that have\nbeen compressed using the `bgzip \u003chttps://www.htslib.org/doc/bgzip.html\u003e`_\ntool from `samtools \u003chttp://www.htslib.org/\u003e`_. ``bgzip`` writes compressed\ndata in a ``BGZF`` format. ``BGZF`` is ``gzip`` compatible, consisting of\nmultiple concatenated ``gzip`` blocks, each with an additional ``gzip``\nheader making it possible to build an index for rapid random access. I.e.,\nfiles compressed with ``bgzip`` are valid ``gzip`` and so can be read by\n``gunzip``.  See `this description\n\u003chttp://pydoc.net/Python/biopython/1.66/Bio.bgzf/\u003e`_ for more details on\n``bgzip``.\n\nChangelog\n---------\n\nPlease see the `releases \u003chttps://github.com/mdshw5/pyfaidx/releases\u003e`_ for a\ncomprehensive list of version changes.\n\nKnown issues\n------------\n\nI try to fix as many bugs as possible, but most of this work is supported by a single developer. Please check the `known issues \u003chttps://github.com/mdshw5/pyfaidx/issues?utf8=✓\u0026q=is%3Aissue+is%3Aopen+label%3Aknown\u003e`_ for bugs relevant to your work. Pull requests are welcome.\n\n\nContributing\n------------\n\nCreate a new Pull Request with one feature. If you add a new feature, please\ncreate also the relevant test.\n\nTo get test running on your machine:\n - Create a new virtualenv and install the `dev-requirements.txt`.\n \n      pip install -r dev-requirements.txt\n      \n - Download the test data running:\n\n      python tests/data/download_gene_fasta.py\n\n - Run the tests with\n\n      pytests\n\nAcknowledgements\n----------------\n\nThis project is freely licensed by the author, `Matthew\nShirley \u003chttp://mattshirley.com\u003e`_, and was completed under the\nmentorship and financial support of Drs. `Sarah\nWheelan \u003chttp://sjwheelan.som.jhmi.edu\u003e`_ and `Vasan\nYegnasubramanian \u003chttp://yegnalab.onc.jhmi.edu\u003e`_ at the Sidney Kimmel\nComprehensive Cancer Center in the Department of Oncology.\n\n.. |Travis| image:: https://travis-ci.com/mdshw5/pyfaidx.svg?branch=master\n    :target: https://travis-ci.com/mdshw5/pyfaidx\n    \n.. |CI| image:: https://github.com/mdshw5/pyfaidx/actions/workflows/main.yml/badge.svg?branch=master\n    :target: https://github.com/mdshw5/pyfaidx/actions/workflows/main.yml\n\n.. |PyPI| image:: https://img.shields.io/pypi/v/pyfaidx.svg?branch=master\n    :target: https://pypi.python.org/pypi/pyfaidx\n\n.. |Landscape| image:: https://landscape.io/github/mdshw5/pyfaidx/master/landscape.svg\n   :target: https://landscape.io/github/mdshw5/pyfaidx/master\n   :alt: Code Health\n\n.. |Coverage| image:: https://codecov.io/gh/mdshw5/pyfaidx/branch/master/graph/badge.svg\n   :target: https://codecov.io/gh/mdshw5/pyfaidx\n\n.. |Depsy| image:: http://depsy.org/api/package/pypi/pyfaidx/badge.svg\n   :target: http://depsy.org/package/python/pyfaidx\n\n.. |Appveyor| image:: https://ci.appveyor.com/api/projects/status/80ihlw30a003596w?svg=true\n   :target: https://ci.appveyor.com/project/mdshw5/pyfaidx\n   \n.. |Package| image:: https://github.com/mdshw5/pyfaidx/actions/workflows/pypi.yml/badge.svg\n   :target: https://github.com/mdshw5/pyfaidx/actions/workflows/pypi.yml\n   \n.. |Downloads| image:: https://img.shields.io/pypi/dm/pyfaidx.svg\n   :target: https://pypi.python.org/pypi/pyfaidx/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdshw5%2Fpyfaidx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmdshw5%2Fpyfaidx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdshw5%2Fpyfaidx/lists"}