{"id":22155937,"url":"https://github.com/citiususc/bigseqkit","last_synced_at":"2025-07-26T07:32:35.425Z","repository":{"id":40594969,"uuid":"507136548","full_name":"citiususc/BigSeqKit","owner":"citiususc","description":"BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale","archived":false,"fork":false,"pushed_at":"2023-08-06T16:45:04.000Z","size":235,"stargazers_count":46,"open_issues_count":3,"forks_count":3,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-04-16T11:35:30.946Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/citiususc.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-24T20:29:38.000Z","updated_at":"2024-03-22T07:37:56.000Z","dependencies_parsed_at":"2023-02-08T09:46:25.983Z","dependency_job_id":null,"html_url":"https://github.com/citiususc/BigSeqKit","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2FBigSeqKit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2FBigSeqKit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2FBigSeqKit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2FBigSeqKit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/citiususc","download_url":"https://codeload.github.com/citiususc/BigSeqKit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227660763,"owners_count":17800418,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-02T02:33:40.486Z","updated_at":"2024-12-02T02:33:41.192Z","avatar_url":"https://github.com/citiususc.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":".. image:: ./logo.svg\n   :width: 400\n\n=========\nBigSeqKit\n=========\nThe Next Generation Sequencing (NGS) raw data are stored in FASTA and FASTQ text-based file formats. In this way, manipulating these files efficiently is essential to analyze and interpret data in any genomics pipeline. Common operations on FASTA/Q files include searching, filtering, sampling, deduplication and sorting, among others. We can find several tools in the literature for FASTA/Q file manipulation but none of them are well fitted for large files of tens of GB (likely TBs in the near future) since mostly they are based on sequential processing. The exception is `seqkit \u003chttps://github.com/shenwei356/seqkit\u003e`_ that allows some routines to use a few threads but, in any case, the scalability is very limited.\n\nTo deal with this issue, we introduce **BigSeqKit**, a parallel toolkit to manipulate FASTA/Q files at scale with speed and scalability at its core. *BigSeqKit* takes advantage of an HPC-Big Data framework (`IgnisHPC \u003chttps://ignishpc.readthedocs.io\u003e`_) to parallelize and optimize the commands included in *seqkit*. In this way, in most cases **it is from tens to hundreds of times faster than other state-of-the-art tools** such as *seqkit*, `samtools \u003chttps://www.htslib.org\u003e`_ and `pyfastx \u003chttps://pyfastx.readthedocs.io/en/latest\u003e`_. At the same time, our tool is easy to use and install on any kind of hardware platform (single server or cluster). Routines in *BigSeqKit* can be used as a bioinformatics library or from the command line.\n\nIn order to improve the usability and facilitate the adoption of *BigSeqKit*, it implements the same command interface than `SeqKit \u003chttps://bioinf.shenwei.me/seqkit/usage\u003e`_.\n\nIf you use *BigSeqKit*, please cite the following publication:\n\nCésar Piñeiro and Juan C. Pichel. BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale, GigaScience, Vol. 12, 2023, `https://doi.org/10.1093/gigascience/giad062`\n\n------------\nUser's Guide\n------------\n\nBigSeqKit from the command line (CLI)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nBigSeqKit (and IgnisHPC) can be executed on different execution environments. In this case, we will focus on two common scenarios: running on a local computer, and a deployment on a HPC cluster that uses Slurm as workload manager. For more details, IgnisHPC has an online documentation available for users `https://ignishpc.readthedocs.io`.\n\nFirst, we will install the ``ignis-deploy`` script using ``pip`` (required only first time):\n\n.. code-block:: sh\n\n\tpip install ignishpc\n\nLocal server (Docker)\n^^^^^^^^^^^^^^^^^^^^^\n\nTo execute *BigSeqKit* on a local server is necessary to install Docker (please refer to `Docker \u003chttps://docs.docker.com/get-docker/\u003e`_ for instructions).\n\nDownload the precompiled IgnisHPC image (required only first time):\n\n.. code-block:: sh\n\n\tdocker pull ignishpc/full\n\nExtract ``ignis-submit`` script to use it without a container (required only first time):\n\n.. code-block:: sh\n\n\tdocker run --rm -v $(pwd):/target ignishpc/submitter ignis-export /target\n\nSet the following environment variables:\n\n.. code-block:: sh\n\n\t# set current directory as job directory\n\texport IGNIS_DFS_ID=$(pwd)\n\t# set docker as scheduler\n\texport IGNIS_SCHEDULER_TYPE=docker\n\t# set where docker is available\n\texport IGNIS_SCHEDULER_URL=/var/run/docker.sock\n\nNow it is only necessary to select command or routine (see a complete list `here \u003chttps://bioinf.shenwei.me/seqkit/usage\u003e`_) and pass its arguments through command line following the syntax:\n\n.. code-block:: sh\n\n\t./ignis/bin/ignis-submit ignishpc/full bigseqkit \u003ccmd\u003e \u003carguments\u003e\n\nFor example, the following expression uses the routine *seq* to print the name of the sequences included in a FASTA file to an output file:\n\n.. code-block:: sh\n\n\t./ignis/bin/ignis-submit ignishpc/full bigseqkit seq -n -o names.txt input-file.fa\n\nHPC Cluster (Slurm and Singularity)\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nWe assume that the cluster has installed Slurm and Singularity.\n\nCreate the Singularity image on your local server (required only first time):\n\n.. code-block:: sh\n\n\tignis-deploy images singularity --host ignishpc/full ignis_full.sif\n\nExtract ``ignis-slurm`` to use it without a container (required only first time):\n\n.. code-block:: sh\n\n\tdocker run --rm -v $(pwd):/target ignishpc/slurm-submitter ignis-export /target\n\nMove the Singularity image and the ``ignis/`` folder to the cluster.\n\nIn the cluster, set the following environment variables:\n\n.. code-block:: sh\n\n\t# set current directory as job directory\n\texport IGNIS_DFS_ID=$(pwd)\n\nNow it is only necessary to select command or routine (see a complete list `here \u003chttps://bioinf.shenwei.me/seqkit/usage\u003e`_) and pass its arguments through command line following the syntax:\n\n.. code-block:: sh\n\n\t./ignis/bin/ignis-slurm HH:MM:SS ignis_full.sif bigseqkit \u003ccmd\u003e \u003carguments\u003e\n\nNote that, unlike ``ignis-submit``, the Slurm script requires an estimation of the execution time in the format HH:MM:SS.\n\nFor example, the following expression uses the routine *seq* to print the name of the sequences included in a FASTA file to an output file:\n\n.. code-block:: sh\n\n\t./ignis/bin/ignis-slurm HH:MM:SS ignis_full.sif bigseqkit seq -n -o names.txt input-file.fa\n\nSetting the number of computing nodes, cores and memory per node\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nUsers can also specify through arguments the number of instances (nodes), cores and memory (in GB) per node to be used in the execution. By default, those values are set to 1. For example, we can execute the previous command on a single server using 4 cores:\n\n.. code-block:: sh\n\n\t./ignis/bin/ignis-submit ignishpc/full -p ignis.executor.cores=4 bigseqkit seq -n -o names.txt input-file.fa\n\n\nBigSeqKit as a library\n~~~~~~~~~~~~~~~~~~~~~~\n\n*BigSeqKit* can also be used as a bioinformatics library. It is worth noting that *BigSeqKit* was implemented in Go language. However, thanks to the multi-language support provided by IgnisHPC, it is possible to call *BigSeqKit* routines from C/C++, Python, Java and Go applications without additional overhead. An example of Python code is shown below:\n\n.. code-block:: python\n\n\t#!/bin/env python3\n\t\n\timport ignis\n\timport bigseqkit\n\n\t# Initialization of the framework\n\tignis.Ignis.start()\n\t# Resources/Configuration of the cluster\n\tprop = ignis.IProperties()\n\tprop[\"ignis.executor.image\"] = \"ignishpc/full\"\n\tprop[\"ignis.executor.instances\"] = \"2\"\n\tprop[\"ignis.executor.cores\"] = \"4\"\n\tprop[\"ignis.executor.memory\"] = \"1GB\"\n\t# Construction of the cluster\n\tcluster = ignis.ICluster(prop)\n\t# Initialization of a Go Worker\n\tworker = ignis.IWorker(cluster, \"go\")\n\t# Sequence reading\n\tseqs = bigseqkit.readFASTA(\"file.fa\", worker)\n\t# Obtain Sequence names\n\tnames = bigseqkit.seq(seqs, name=True)\n\t# Save the result\n\tnames.saveAsTextFile(\"names.txt\")\n\t# Stop the framework\n\tignis.Ignis.stop()\n\nInstead of commands from terminal like *SeqKit*, *BigSeqKit* utilities are functions that can be called from a driver code. Note that their names and arguments are exactly the same than those included in *SeqKit*, which can be found in `https://bioinf.shenwei.me/seqkit/usage`.\n\nFunctions in *BigSeqKit* do not use files as input, they use DataFrames instead, an abstract representation of parallel data used by IgnisHPC (similar to RDDs in Spark). Parameters are grouped in a data structure where each field represents the long names of a parameter. Note that *BigSeqKit* functions can be linked (like system pipes using \"|\"), so the DataFrame generated by one can be used as input to another. In this way, integrate *BigSeqKit* routines in a more complex code is really easy.\n\nThe code starts initializing the IgnisHPC framework (line 5). Next, a cluster of containers is configured and built (lines from 7 to 15). Multiple parameters can be used to configure the environment such as image, number of containers, number of cores and memory per container. In this example, we will use 2 nodes (instances) and 4 cores per node. After configuring the IgnisHPC execution environment, the *BigSeqKit* code actually starts. First, we read the input file (line 17). There is a different function for reading FASTA and FASTQ files. All the input sequences are stored as a single data structure. The next stage consists of printing the name of the sequences included in the FASTA file (line 19). The function takes as parameters the sequences and the options that specify its behavior. Finally, the names of the sequences are written to disk.\n\nLocal server (Docker)\n^^^^^^^^^^^^^^^^^^^^^\n\nDownload the precompiled IgnisHPC image (only first time):\n\n.. code-block:: sh\n\n\tdocker pull ignishpc/full\n\nExtract ``ignis-submit`` for use without a container (only first time):\n\n.. code-block:: sh\n\n\tdocker run --rm -v $(pwd):/target ignishpc/submitter ignis-export /target\n\n.. code-block:: sh\n\n\t# set current directory as job directory\n\texport IGNIS_DFS_ID=$(pwd)\n\t# set docker as scheduler\n\texport IGNIS_SCHEDULER_TYPE=docker\n\t# set where docker is available\n\texport IGNIS_SCHEDULER_URL=/var/run/docker.sock\n\n\t# Submit the job\n\t./ignis/bin/ignis-submit ignishpc/full ./example\n\n\nHPC Cluster (Slurm and Singularity)\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n.. code-block:: sh\n\n\t# Create the Singularity image (only first time)\n\tignis-deploy images singularity --host ignishpc/full ignis_full.sif\n\n\t# Extract ignis-slurm for use without a container (only first time)\n\tdocker run --rm -v $(pwd):/target ignishpc/slurm-submitter ignis-export /target\n\n\t# Set current directory as job directory\n\texport IGNIS_DFS_ID=$(pwd)\n\n\t# Submit the job\n\t./ignis/bin/ignis-slurm 0:10:00 ignis_full.sif ./example\n\nAs we mentioned previously, unlike ``ignis-submit``, the Slurm script requires an estimation of the execution time in the format HH:MM:SS.\n\nCompilation of Go user code\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nTo compile user code implemented in Go instead of Python, the following command should be executed:\n\n.. code-block:: sh\n\n\tdocker run --rm -v \u003cexample-dir\u003e:/src -w /src ignishpc/go-libs-compiler igo-bigseqkit-build\n\nGo programming language *compiles folders* instead of particular files, so the example code should be stored inside ``\u003cexample-dir\u003e``.\n\nInstallation from repository of BigSeqKit and IgnisHPC (optional)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nInstead of using the preconfigured images uploaded to docker hub (x64 architecture), we can build ours locally. The only dependence of *BigSeqKit* is IgnisHPC, but at the same time, IgnisHPC depends on Docker, so its installation on the local system is mandatory (please refer to `Docker \u003chttps://docs.docker.com/get-docker/\u003e`_ for instructions).\n\nNext, we will install the ``ignis-deploy`` script using ``pip``:\n\n.. code-block:: sh\n\n\tpip install ignishpc\n\nIgnisHPC is a framework that works inside containers, so it is necessary to build the required images. Next, we show the corresponding commands to do it. IgnisHPC supports C/C++, Python, Java and Go programming languages, but since the example below was implemented using only Python, it is only necessary to build the *core-python* image. There are the equivalent *core-java*, *core-cpp* and *core-go* images.\n\n.. code-block:: sh\n\n\tignis-deploy images build --full --ignore submitter mesos nomad zookeeper --sources\\\n\t   https://github.com/ignishpc/dockerfiles.git \\\n\t   https://github.com/ignishpc/backend.git \\\n\t   https://github.com/ignishpc/core-python.git \\\n\t   https://github.com/citiususc/BigSeqKit.git\n\n\nNote that the ``--platform`` parameter is used to specify the target processor architecture. Currently, we can build images for *amd64* systems and those based on PowerPC processors (*ppc64le*) such as the Marconi100 supercomputer (CINECA, Italy). If this parameter is not specified, the target architecture will be the one where the command is executed on.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitiususc%2Fbigseqkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcitiususc%2Fbigseqkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitiususc%2Fbigseqkit/lists"}