{"id":46729837,"url":"https://github.com/Rbfinch/grepq","last_synced_at":"2026-03-23T16:00:59.860Z","repository":{"id":260864738,"uuid":"882566930","full_name":"Rbfinch/grepq","owner":"Rbfinch","description":"quickly filter fastq files by matching sequences to a set of regex patterns","archived":false,"fork":false,"pushed_at":"2025-12-14T00:47:12.000Z","size":5055,"stargazers_count":58,"open_issues_count":3,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-30T05:24:29.768Z","etag":null,"topics":["bioinformatics","fastq","grep","grep-like","grep-search","grepping","gzip","json","regex","sqlite","zstd"],"latest_commit_sha":null,"homepage":"https://github.com/Rbfinch/grepq","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rbfinch.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":"codemeta.json","zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"Rbfinch"}},"created_at":"2024-11-03T06:05:35.000Z","updated_at":"2026-01-15T19:57:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"b4cb94a9-ded4-4ac9-a5a0-9676965291fe","html_url":"https://github.com/Rbfinch/grepq","commit_stats":null,"previous_names":["rbfinch/grepq"],"tags_count":56,"template":false,"template_full_name":null,"purl":"pkg:github/Rbfinch/grepq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rbfinch%2Fgrepq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rbfinch%2Fgrepq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rbfinch%2Fgrepq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rbfinch%2Fgrepq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rbfinch","download_url":"https://codeload.github.com/Rbfinch/grepq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rbfinch%2Fgrepq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30863009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-23T14:38:03.667Z","status":"ssl_error","status_checked_at":"2026-03-23T14:38:01.683Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","fastq","grep","grep-like","grep-search","grepping","gzip","json","regex","sqlite","zstd"],"created_at":"2026-03-09T15:00:23.180Z","updated_at":"2026-03-23T16:00:59.847Z","avatar_url":"https://github.com/Rbfinch.png","language":"Rust","readme":"\u003cimg src=\"src/grepq-icon.svg\" width=\"128\" /\u003e\n\n_Quickly filter FASTQ files_\n\n[![Crates.io](https://img.shields.io/crates/v/grepq.svg)](https://crates.io/crates/grepq)\n![Crates.io Total Downloads](https://img.shields.io/crates/d/grepq)\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![DOI](https://joss.theoj.org/papers/10.21105/joss.08048/status.svg)](https://doi.org/10.21105/joss.08048)\n\n**Table of Contents**\n\n- [Feature set](#feature-set)\n- [Features and performance in detail](#features-and-performance-in-detail)\n- [Usage](#usage)\n  - [Preparing pattern files](#preparing-pattern-files)\n- [Requirements](#requirements)\n- [Installation](#installation)\n- [Examples and tests](#examples-and-tests)\n- [Further testing](#further-testing)\n- [Citation](#citation)\n- [Update changes](#update-changes)\n- [Contributing and issue reporting](#contributing-and-issue-reporting)\n- [License](#license)\n\n## Feature set\n\n\u003e[!NOTE]\nThis README contains documentation for the latest version of `grepq`. If you are working through this documentation and the examples, please ensure that you are using the latest version. You can check the version by running `grepq -V`. For installation instructions, see the [Installation](#installation) section.\n\n- very fast and scales to large FASTQ files\n- IUPAC ambiguity code support\n- support for gzip and zstd compression\n- JSON support for pattern file input and `tune` and `summarise` command output, allowing named regex sets, named regex patterns, and named and unnamed variants\n- use **predicates** to filter on the header field (= record ID line) using a regex, minimum sequence length, and minimum average quality score (supports Phred+33 and Phred+64)\n- does not match false positives\n- output matched sequences to one of four formats\n- optionally output matched sequences to a **SQLite** database file, including GC content, tetranucleotide and canonical tetranucleotide frequencies, and regex pattern matches and their position(s) in each matched FASTQ sequence, allowing for further analysis\n- tune your pattern file and **enumerate named and unnamed variants** with the `tune` command (use the `summarise` command to process all FASTQ records)\n- **bucket matching sequences** to separate files named after each regexName with the `--bucket` flag, in any of the four output formats\n- supports inverted matching with the `inverted` command\n- plays nicely with your unix workflows\n- comprehensive help, examples and testing script\n- read the **JOSS** [paper](https://joss.theoj.org/papers/10.21105/joss.08048)\n\n## Features and performance in detail\n\n**1. Very fast and scales to large FASTQ files**\n\n| tool          | mean wall time (s) | S.D. wall time (s) | speedup (× grep) | speedup (× ripgrep) | speedup (× awk) |\n|---------------|--------------------|--------------------|------------------|---------------------|-----------------|\n| _grepq_       | 0.19               | 0.01               | 1796.76          | 18.62               | 863.52          |\n| _fqgrep_      | 0.34               | 0.01               | 1017.61          | 10.55               | 489.07          |\n| _ripgrep_     | 3.57               | 0.01               | 96.49            | 1.00                | 46.37           |\n| _seqkit grep_ | 2.89               | 0.01               | 119.33           | 1.24                | 57.35           |\n| _grep_        | 344.26             | 0.55               | 1.00             | 0.01                | 0.48            |\n| _awk_         | 165.45             | 1.59               | 2.08             | 0.02                | 1.00            |\n| _gawk_        | 287.66             | 1.68               | 1.20             | 0.01                | 0.58            |\n\n\u003cdetails\u003e\n  \u003csummary\u003eDetails\u003c/summary\u003e\n  \u003cp\u003e2022 model Mac Studio with 32GB RAM and Apple M1 max chip running macOS 15.0.1. The FASTQ file (SRX26365298.fastq) was 874MB in size and was stored on the internal SSD (APPLE SSD AP0512R). The pattern file contained 30 regex patterns (see `examples/16S-no-iupac.txt` for the patterns used). grepq v1.4.0, fqgrep v.1.02, ripgrep v14.1.1, seqkit grep v.2.9.0, grep 2.6.0-FreeBSD, awk v. 20200816, and gawk v.5.3.1. fqgrep and seqkit grep were run with default settings, ripgrep was run with -B 1 -A 2 --colors 'match:none' --no-line-number, and grep -B 1 -A 2 was run with --color=never. The tools were configured to output matching records in FASTQ format. The wall times, given in seconds, are the mean of 10 runs, and S.D. is the standard deviation of the wall times, also given in seconds.\u003c/p\u003e\n\u003c/details\u003e\n\n**2. Reads and writes regular or gzip or zstd-compressed FASTQ files**\n\nUse the `--best` option for best compression, or the `--fast` option for faster compression.\n\n| tool      | mean wall time (s) | S.D. wall time (s) | speedup (× ripgrep) |\n|-----------|--------------------|--------------------|---------------------|\n| _grepq_   | 1.71               | 0.00               | 2.10                |\n| _fqgrep_  | 1.83               | 0.01               | 1.95                |\n| _ripgrep_ | 3.58               | 0.01               | 1.00                |\n\n\u003cdetails\u003e\n  \u003csummary\u003eDetails\u003c/summary\u003e\n  \u003cp\u003eConditions and versions as above, but the FASTQ file was gzip-compressed. `grepq` was run with the `--read-gzip` option, `ripgrep` with the `-z` option, and `grep` with the `-Z` option. The wall times, given in seconds, are the mean of 10 runs, and S.D. is the standard deviation of the wall times, also given in seconds.\u003c/p\u003e\n\u003c/details\u003e\n\n**3. Predicates**\n\nPredicates can be used to filter on the header field (= record ID line) using a regex, minimum sequence length, and minimum average quality score (supports Phred+33 and Phred+64).\n\n\u003e[!NOTE]\nA regex supplied to filter on the header field (= record ID line) is first passed as a string to the regex engine, and then the regex engine is used to match the header field. Regex patterns to match the header field (= record ID line) must comply with the Rust regex library syntax (\u003chttps://docs.rs/regex/latest/regex/#syntax\u003e). If you get an error message, be sure to escape any special characters in the regex pattern.\n\nPredicates are specified in a JSON pattern file. For an example, see `16S-iupac-and-predicates.json` in the `examples` directory.\n\n**4. Does not match false positives**\n\n`grepq` will only match regex patterns to the sequence of a FASTQ record, which is the most common use case. Unlike `ripgrep` and `grep`, which will match the regex patterns to the entire FASTQ record, which includes the record ID, sequence, separator, and quality fields. This can lead to false positives and slow down the filtering process. When multiple regex patterns are provided, a matched sequence is one where _any_ of the regex patterns in the pattern file match the sequence of the FASTQ record.\n\n**5. Output matched sequences to one of four formats**\n\n- sequences only (default)\n- sequences and their corresponding record IDs (`-I` option)\n- FASTA format (`-F` option)\n- FASTQ format (`-R` option)\n\n\u003e[!NOTE]\nOther than when the `tune` or `summarise` command is run (see below), a FASTQ record is deemed to match (and hence provided in the output) when _any_ of the regex patterns in the pattern file match the sequence of the FASTQ record.\n\n**6. Optionally output matched sequences to a SQLite database file**\n\nOther than when the `inverted` command is given, output to a SQLite database is supported with the `writeSQL` option. The SQLite database will contain a table called `fastq_data` with the following fields: the fastq record (header, sequence and quality fields), length of the sequence (length), percent GC content (GC), percent GC content as an integer (GC_int), number of unique tetranucleotides in the sequence (nTN), number of unique canonical tetranucleotides in the sequence (nCTN), percent tetranucleotide frequency in the sequence (TNF), percent canonical tetranucleotide frequency in the sequence (CTNF), and a JSON array containing the matched regex patterns, the matches and their position(s) in the FASTQ sequence (variants). If the pattern file was given in JSON format and contained a non-null qualityEncoding field, then the average quality score for the sequence (average_quality) will also be written. The `--num-tetranucleotides` option can be used to limit the number of tetranucleotides written to the TNF and CTNF fields of the fastq_data SQLite table, these being the most or equal most frequent tetranucleotides and canonical tetranucleotides in the sequence of the matched FASTQ records. A summary of the invoked query (pattern and data files) is written to a second table called `query`.\n\nThe structure of the `fastq_data` table facilitates database indexing and provides a rich dataset to further query. Since all elements of each matched FASTQ record are also written, a FASTQ file can be reconstructed from the SQLite database (see `examples/export_fastq.sql` for an example of how to do this; and scripts `examples/summarise.sql` and `examples/variants-as-json-array.sql` could also come in handy).\n\n**7. Tune your pattern file and enumerate named and unnamed variants with the `tune` command**\n\nUse the `tune` or `summarise` command (`grepq tune -h` and `grepq summarise -h` for instructions) in a simple shell script to update the number and order of regex patterns in your pattern file according to their matched frequency, further targeting and speeding up the filtering process.\n\nSpecifying the `-c` option to the `tune`or `summarise` command will output the matched substrings and their frequencies, ranked from highest to lowest.\n\nWhen the patterns file is given in JSON format, then specifying the `-c`, `--names`, `--json-matches` and `--variants` options to the `tune` or `summarise` command will output the matched pattern variants and their corresponding counts in JSON format to a file called `matches.json`, allowing named regex sets, named regex patterns, and named and unnamed variants. See `examples/16S-iupac.json` for an example of a JSON pattern file and `examples/matches.json` for an example of the output of the `tune` or `summarise` command in JSON format.\n\n```bash\n# For each matched pattern in a search of no more than 20000 matches of a gzip-compressed FASTQ file, print the pattern and the number of matches to a JSON file called matches.json, and include the top three most frequent variants of each pattern, and their respective counts\n\ngrepq --read-gzip 16S-no-iupac.json SRX26365298.fastq.gz tune -n 20000 -c --names --json-matches --variants 3\n```\n\nAbridged output (see `examples/matches.json` for the full output):\n\n```json\n{\n    \"regexSet\": {\n        \"regex\": [\n            {\n                \"regexCount\": 2,\n                \"regexName\": \"Primer contig 06a\",\n                \"regexString\": \"[AG]AAT[AT]G[AG]CGGGG\",\n                \"variants\": [\n                    {\n                        \"count\": 1,\n                        \"variant\": \"GAATTGGCGGGG\",\n                        \"variantName\": \"06a-v3\"\n                    },\n                    {\n                        \"count\": 1,\n                        \"variant\": \"GAATTGACGGGG\",\n                        \"variantName\": \"06a-v1\"\n                    }\n                ]\n            },\n            // matches for other regular expressions...\n    ],\n    \"regexSetName\": \"conserved 16S rRNA regions\"\n  }\n}\n```\n\nTo output all variants of each pattern, use the `--all` argument, for example:\n\n```bash\n# For each matched pattern in a search of no more than 20000 matches of a gzip-compressed FASTQ file, print the pattern and the number of matches to a JSON file called matches.json, and include all variants of each pattern, and their respective counts. Note that the --variants argument is not given when --all is specified.\n\ngrepq --read-gzip 16S-no-iupac.json SRX26365298.fastq.gz tune -n 20000 -c --names --json-matches --all\n```\n\nYou could then use a tool like `jq` to parse the JSON output of the `tune` or `summarise` command, for example the following command will sort the output by the number of matches for each regex pattern, and then for each pattern, sort the variants by the number of matches:\n\n```bash\njq -r '\n    .regexSet.regex |\n    sort_by(-.regexCount)[] |\n    \"\\(.regexName): \\(.regexCount)\\n\" +\n    (\n      .variants |\n      sort_by(-.count)[] |\n      \"  \\(.variantName // \"unnamed\"): \\(.variant): \\(.count)\"\n    )\n  ' matches.json\n```\n\n\u003e[!NOTE]\nWhen the count option (-c) is given with the `tune` or `summarise` command, `grepq` will count the number of FASTQ records containing a sequence that is matched, for each matching regex in the pattern file. If, however, there are multiple occurrences of a given regex _within a FASTQ record sequence field_, `grepq` will count this as one match. To ensure all records are processed, use the `summarise` command instead of the `tune` command. When the count option (-c) is not given as part of the `tune` or `summarise` command, `grepq` provides the total number of matching FASTQ records for the set of regex patterns in the pattern file. Further, note that counts produced through independently matching regex patterns to the sequence of a FASTQ record inherently underestimate the true number of those patterns in the biological sample, since a regex pattern may span two reads (i.e., be truncated at either the beginning or end of a read). To illustrate, a regex pattern representing a 12-mer motif has a 5.5% chance of being truncated for a read length of 400 nucleotides (11/400 + 11/400 = 22/400 = 0.055 or 5.5%), assuming a uniform distribution of motif positions and reads are sampled randomly with respect to motifs (this calculation would need to be adjusted to the extent that motifs are not uniformly distributed and reads are not randomly sampled with respect to motifs).\n\n**8. Supports inverted matching with the `inverted` command**\n\nUse the `inverted` command to output sequences that do not match any of the regex patterns in your pattern file.\n\n**9. Plays nicely with your unix workflows**\n\nFor example, see `tune.sh` in the `examples` directory. This simple script will filter a FASTQ file using `grepq`, tune the pattern file on a user-specified number of total matches, and then filter the FASTQ file again using the tuned pattern file for a user-specified number of the most frequent regex pattern matches.\n\n## Usage\n\nGet instructions and examples using `grepq -h`, or `grepq tune -h`, `grepq summarise -h` and `grepq inverted -h` for more information on the `tune`, `summarise` and `inverted` commands, respectively. See the `examples` directory for examples of pattern files and FASTQ files, and the `cookbook.sh` and `cookbook.md` files for more examples. Finally, `help.md` contains a full dump of the help output, in markdown format.\n\n\u003e[!NOTE]\n`grepq` can output to several formats, including those that are gzip or zstd compressed. `grepq`, however, will only accept a FASTQ file or a compressed (gzip or zstd) FASTQ file as the sequence data file. If you get an error message, check that the input data file is a FASTQ file or a gzip or zstd compressed FASTQ file, and that you have specified the correct file format (--read-gzip or --read-zstd for FASTQ files compressed by gzip and zstd, respectively), and file path. Pattern files must contain one regex pattern per line or be provided in JSON format, and patterns are case-sensitive. You can supply an empty pattern file to count the total number of records in the FASTQ file. The regex patterns for matching FASTQ sequences should only include the DNA sequence characters (A, C, G, T), or IUPAC ambiguity codes (N, R, Y, etc.). See `16S-no-iupac.txt`, `16S-iupac.json`, `16S-no-iupac.json`, and `16S-iupac-and-predicates.json` in the `examples` directory for examples of valid pattern files. Regex patterns to match the header field (= record ID line) must comply with the Rust regex library syntax (\u003chttps://docs.rs/regex/latest/regex/#syntax\u003e). If you get an error message, be sure to escape any special characters in the regex pattern.\n\n### Preparing pattern files\n\nWhilst `grepq` can accept pattern files in plain text format (one regex pattern per line), it is recommended to use JSON format for more complex pattern files since JSON pattern files can contain named regex sets, named regex patterns, and named and unnamed variants. JSON can be a little verbose, so you may want to prepare you pattern file in YAML format (for example, see `16S-iupac.yaml` in the `examples` directory) and then convert it to JSON using a tool like `yq`. For example, to convert a YAML pattern file to JSON, use the following command:\n\n```bash\nyq eval '. | tojson' pattern-file.yaml \u003e pattern-file.json\n```\n\n`grepq` will validate the JSON pattern file before processing it, and will provide an error message if the JSON pattern file is not valid. However, if you wish to validate the JSON pattern file before running `grepq`, you can use a tool such as `ajv` and `grepq`'s JSON schema file (`grepq-schema.json`, located in the `examples` directory), for example:\n\n```bash\najv --strict=false -s grepq-schema.json -d pattern-file.json\n```\n\n## Requirements\n\n- `grepq` has been tested on Linux (x86-64 and ARM64) and macOS (ARM64). It might work on other platforms, but it has not been tested.\n- Ensure that Rust is installed on your system (\u003chttps://www.rust-lang.org/tools/install\u003e)\n- Ensure that the dependencies are installed on your system. If you are using a package manager, you can install them with the following commands:\n  - For Ubuntu/Debian: `sudo apt update \u0026\u0026 sudo apt install -y build-essential cmake libsqlite3-dev libzstd-dev sqlite3`\n  - For macOS: `brew install sqlite zstd`\n- If you are installing from `bioconda`, you will need to have conda or miniconda installed on your system. You can install conda or miniconda from \u003chttps://docs.conda.io/en/latest/miniconda.html\u003e or \u003chttps://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html\u003e.\n- If the build fails, make sure you have the latest version of the Rust compiler by running `rustup update`\n- To run the `test.sh` and `cookbook.sh` scripts in the `examples` directory, you will need `yq` (v4.44.6 or later), `gunzip` and version 4 or later of `bash`.\n\n## Installation\n\nFirst, install the dependencies described in the `Requirements` section, see above. Then, you can install `grepq` in one of the following ways:\n\n- From _crates.io_ (easiest method, but will not install the `examples` directory)\n  - `cargo install grepq`\n\n- From _source_ (will install the `examples` directory)\n  - Clone the repository and `cd` into the `grepq` directory\n  - Run `cargo build --release`\n  - Relative to the cloned parent directory, the executable will be located in `./target/release`\n  - Make sure the executable is in your `PATH` or use the full path to the executable\n\n- From _bioconda_ (assumes conda or miniconda is installed; will not install the `examples` directory)\n  - `conda init fish`  # Or conda init bash, or conda init zsh\n  - `conda create -n myenv`  # Create a new environment named \"myenv\"\n  - `conda activate myenv`  # Activate the new environment\n  - `conda config --add channels conda-forge` # Add conda-forge channel\n  - `conda config --prepend channels bioconda` # Add bioconda channel with higher priority\n  - `conda config --set channel_priority strict` # Set strict channel priority\n  - `conda install grepq` # Install grepq\n\n## Examples and tests\n\nGet instructions and examples using `grepq -h`, or `grepq tune -h`, `grepq summarise -h` and `grepq inverted -h` for more information on the `tune`, `summarise` and `inverted` commands, respectively. See the `examples` directory for examples of pattern files and FASTQ files, and the `cookbook.sh` and `cookbook.md` files for more examples.\n\n_File sizes of outfiles to verify `grepq` is working correctly, using the regex file `16S-no-iupac.txt` and the small fastq file `small.fastq`, both located in the `examples` directory:_\n\n```bash\ngrepq ./examples/16S-no-iupac.txt ./examples/small.fastq \u003e outfile.txt \n15953\n\ngrepq  ./examples/16S-no-iupac.txt ./examples/small.fastq inverted \u003e outfile.txt\n736547\n\ngrepq -I ./examples/16S-no-iupac.txt ./examples/small.fastq \u003e outfile.txt\n19515\n\ngrepq -I ./examples/16S-no-iupac.txt ./examples/small.fastq inverted \u003e outfile.txt \n901271\n\ngrepq -R ./examples/16S-no-iupac.txt ./examples/small.fastq \u003e outfile.txt\n35574\n\ngrepq -R ./examples/16S-no-iupac.txt ./examples/small.fastq inverted \u003e outfile.txt \n1642712\n```\n\nFor the curious-minded, note that the regex patterns in `16S-no-iupac.txt`, `16S-iupac.json`, `16S-no-iupac.json`, and `16S-iupac-and-predicates.json` are from Table 3 of Martinez-Porchas, Marcel, et al. \"How conserved are the conserved 16S-rRNA regions?.\" PeerJ 5 (2017): e3036.\n\nFor more examples, see the `examples` directory and the [cookbook](https://github.com/Rbfinch/grepq/blob/main/cookbook.md), available also as a shell script in the `examples` directory.\n\n**Test script**\n\nTo run the test script, you must have `yq` (v4.44.6 or later), `gunzip` and version 4 or later of `bash` installed on your system. Then follow all steps to install `grepq` _from source_ (refer instructions in the Installation section), cd into the `examples` directory and run the following command:\n\n```bash\n./test.sh commands-1.yaml; ./test.sh commands-2.yaml; ./test.sh commands-3.yaml; ./test.sh commands-4.yaml\n```\n\nIf all tests pass, there will be no orange (warning) text in the output, and no test will\nreport a failure. A summary of the number of passing and failing tests will be displayed at the end of the output. All tests should pass.\n\n_Example of failing test output:_\n\n\u003cspan style=\"color: rgb(255, 165, 0);\"\u003e\ntest-7 failed \u003cbr\u003e\nexpected: 54 counts \u003cbr\u003e\ngot: 53 counts \u003cbr\u003e\ncommand was: ../target/release/grepq -c 16S-no-iupac.txt small.fastq \u003cbr\u003e\n\u003c/span\u003e\n\u003cbr\u003e\n\nFurther, you can run the `cookbook.sh` script in the `examples` directory to test the cookbook examples, and you can use `predate` (\u003chttps://crates.io/crates/predate\u003e) if you prefer a Rust application to a shell script.\n\n```bash\n\n**SARS-CoV-2 example**\n\nCount of the top five most frequently matched patterns found in SRX26602697.fastq using the pattern file SARS-CoV-2.txt (this pattern file contains 64 sequences of length 60 from Table II of this [preprint](https://doi.org/10.1101/2021.04.14.439840)):\n\n```bash\ntime grepq SARS-CoV-2.txt SRX26602697.fastq tune -n 10000 -c | head -5\nGTATGGAAAAGTTATGTGCATGTTGTAGACGGTTGTAATTCATCAACTTGTATGATGTGT: 1595\nCGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAATTGGCAAAGA: 693\nTCCTTACTGCGCTTCGATTGTGTGCGTACTGCTGCAATATTGTTAACGTGAGTCTTGTAA: 356\nGCGCTTCGATTGTGTGCGTACTGCTGCAATATTGTTAACGTGAGTCTTGTAAAACCTTCT: 332\nCCGTAGCTGGTGTCTCTATCTGTAGTACTATGACCAATAGACAGTTTCATCAAAAATTAT: 209\n\n________________________________________________________\nExecuted in  218.80 millis    fish           external\n   usr time  188.97 millis    0.09 millis  188.88 millis\n   sys time   31.47 millis    4.98 millis   26.49 millis\n\n```\n\nObtain `SRX26602697.fastq` from the SRA using `fastq-dump --accession SRX26602697`.\n\n## Further testing\n\n`grepq` can be tested using tools that generate synthetic FASTQ files, such as `spikeq` (\u003chttps://crates.io/crates/spikeq\u003e)\n\nYou can verify that `grepq` has found the regex patterns by using tools such as `grep` and `ripgrep`, using their ability to color-match the regex patterns (this feature is not available in `grepq` as that would make the code more complicated; code maintainability is an objective of this project). Recall, however, that `grep` and `ripgrep` will match the regex patterns to the entire FASTQ record, which includes the record ID, sequence, separator, and quality fields, occasionally leading to false positives.\n\n## Citation\n\nIf you use `grepq` in your research, please cite as follows:\n\nCrosbie, N. D., (2025). grepq: A Rust application that quickly filters FASTQ files by matching sequences to a set of regular expressions. Journal of Open Source Software, 10(110), 8048, \u003chttps://doi.org/10.21105/joss.08048\u003e\n\n@article{Crosbie2025, doi = {10.21105/joss.08048}, url = {\u003chttps://doi.org/10.21105/joss.08048}\u003e, year = {2025}, publisher = {The Open Journal}, volume = {10}, number = {110}, pages = {8048}, author = {Nicholas D. Crosbie}, title = {grepq: A Rust application that quickly filters FASTQ files by matching sequences to a set of regular expressions}, journal = {Journal of Open Source Software} }}\n\n## Update changes\n\nsee [CHANGELOG](https://github.com/Rbfinch/grepq/blob/main/CHANGELOG.md)\n\n## Contributing and issue reporting\n\nsee [CONTRIBUTING](https://github.com/Rbfinch/grepq/blob/main/CONTRIBUTING.md)\n\n## License\n\nMIT\n","funding_links":["https://github.com/sponsors/Rbfinch"],"categories":["Data Processing"],"sub_categories":["Command Line Utilities"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRbfinch%2Fgrepq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRbfinch%2Fgrepq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRbfinch%2Fgrepq/lists"}