{"id":32203007,"url":"https://github.com/systemsbioinformatics/parcr","last_synced_at":"2026-02-19T23:01:12.734Z","repository":{"id":215245417,"uuid":"736972435","full_name":"SystemsBioinformatics/parcr","owner":"SystemsBioinformatics","description":"Construct parser combinators in R","archived":false,"fork":false,"pushed_at":"2026-02-13T12:19:42.000Z","size":684,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-02-13T21:27:54.038Z","etag":null,"topics":["combinators","higher-order-functions","parser","parsing","r-package"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SystemsBioinformatics.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-12-29T12:17:54.000Z","updated_at":"2026-02-13T12:19:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"0bcecc2c-3138-4f9e-bca4-a9e44ddac837","html_url":"https://github.com/SystemsBioinformatics/parcr","commit_stats":null,"previous_names":["systemsbioinformatics/parcr"],"tags_count":21,"template":false,"template_full_name":null,"purl":"pkg:github/SystemsBioinformatics/parcr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SystemsBioinformatics","download_url":"https://codeload.github.com/SystemsBioinformatics/parcr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29636035,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T22:32:43.237Z","status":"ssl_error","status_checked_at":"2026-02-19T22:32:38.330Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["combinators","higher-order-functions","parser","parsing","r-package"],"created_at":"2025-10-22T04:25:45.777Z","updated_at":"2026-02-19T23:01:12.727Z","avatar_url":"https://github.com/SystemsBioinformatics.png","language":"HTML","readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Edit README.Rmd --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\"\n)\nlibrary(parcr)\n```\n\n\u003c!-- badges: start --\u003e\n[![CRAN status](https://www.r-pkg.org/badges/version/parcr)](https://cran.r-project.org/package=parcr)\n[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n[![R-CMD-check](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml)\n\u003c!-- badges: end --\u003e\n\n## Construct parser combinator functions for parsing character vectors\n\nThis R package contains tools to construct parser combinator functions, higher \norder functions that parse input. The main goal of this package is to simplify\nthe creation of *transparent* parsers for structured text files generated by \nmachines like laboratory instruments. Such files consist of lines of text \norganized in higher-order structures like headers with metadata and blocks of \nmeasured values. To read these data into R you first need to create a parser\nthat processes these files and creates R-objects as output. The `parcr` package\nsimplifies the task of creating such parsers.\n\nThis package was inspired by the package \n[\"Ramble\"](https://github.com/NoRaincheck/Ramble) by Chapman Siu and co-workers \nand by the paper\n[\"Higher-order functions for parsing\"](https://doi.org/10.1017/S0956796800000411) \nby [Graham Hutton](https://orcid.org/0000-0001-9584-5150) (1992).\n\n## Installation\n\nInstall the stable version from CRAN\n\n```\ninstall.packages(\"parcr\")\n```\n\nTo install the development version including its vignette run the following command\n\n```\ninstall_github(\"SystemsBioinformatics/parcr\", build_vignettes=TRUE)\n```\n\n## Example application: a parser for *fasta* sequence files\n\nAs an example of a realistic application we write a parser for \nfasta-formatted files for nucleotide and protein sequences. We use a few \nsimplifying assumptions about this format for the sake of the example. Real \nfasta files are more complex than we pretend here.\n\n*Please note that more background about the functions that we use here is \navailable in the package documentation. Here we only present a summary.*\n\nA fasta file with mixed sequence types could look like the example below:\n\n```{r, echo=FALSE, comment = NA}\ndata(\"fastafile\")\ncat(paste0(fastafile, collapse=\"\\n\"))\n```\n\nSince fasta files are text files we could read such a file using `readLines()`\ninto a character vector. The package provides the data set `fastafile` which \ncontains that character vector.\n\n```{r, eval=FALSE}\ndata(\"fastafile\")\n```\n\nWe can distinguish the following higher order components in a fasta file:\n \n- A **fasta** file: consists of one or more **sequence blocks** until the \n  **end of the file**.\n- A **sequence block**: consist of a **header** and a \n  **nucleotide sequence** or a **protein sequence**. A sequence block could be\n  preceded by zero or more **empty lines**.\n- A **nucleotide sequence**: consists of one or more \n  **nucleotide sequence strings**.\n- A **protein sequence**: consists of one or more \n  **protein sequence strings**.\n- A **header** is a *string* that starts with a \"\u003e\" immediately followed by\n  a **title** without spaces.\n- A **nucleotide sequence string** is a *string* without spaces that consists\n  *entirely* of symbols from the set `{G,A,T,C}`.\n- A **protein sequence string** is a *string* without spaces that consists\n  *entirely* of symbols from the set `{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}`.\n\nIt now becomes clear what we mean when we say that the package allows us\nto write *transparent* parsers: the description above of the structure of fasta\nfiles can be put straight into code for a `Fasta()` parser:\n\n```{r}\nFasta \u003c- function() {\n  one_or_more(SequenceBlock()) %then%\n    eof()\n}\n\nSequenceBlock \u003c- function() {\n  MaybeEmpty() %then% \n    Header() %then% \n    (NuclSequence() %or% ProtSequence()) %using%\n    function(x) list(x)\n}\n\nNuclSequence \u003c- function() {\n  one_or_more(NuclSequenceString()) %using% \n    function(x) list(type = \"Nucl\", sequence = paste(x, collapse=\"\"))\n}\n\nProtSequence \u003c- function() {\n  one_or_more(ProtSequenceString()) %using% \n    function(x) list(type = \"Prot\", sequence = paste(x, collapse=\"\"))\n}\n```\n\nFunctions like `one_or_more()`, `%then%`, `%or%`, `%using%`, `eof()` and\n`MaybeEmpty()` are defined in the package and are the basic parsers with\nwhich the package user can build complex parsers. The `%using%` operator uses\nthe function on its right-hand side to modify parser output on its left hand \nside. Please see the vignette in the `parcr` package for more explanation why\nthis is useful or necessary even.\n\nNotice that the new parser functions that we define above are higher order \nfunctions taking no input, hence the empty argument brackets `()` behind their\nnames.\n\nNow we need to define the parsers `Header()`, `NuclSequenceString()`\nand `ProtSequenceString()` that actually recognize and process the header line \nstring and strings of nucleotide or protein sequences in the character vector \n`fastafile`. We use the function constructor `stringparser()` from the package \nto construct helper functions that recognize and capture the desired matches, \nand we use `match_s()` to to create `parcr` compliant parsers from these.\n\n```{r}\nHeader \u003c- function() {\n  match_s(stringparser(\"^\u003e(\\\\w+)\")) %using% \n    function(x) list(title = unlist(x))\n}\n\nNuclSequenceString \u003c- function() {\n  match_s(stringparser(\"^([GATC]+)$\"))\n}\n\nProtSequenceString \u003c- function() {\n  match_s(stringparser(\"^([ARNDBCEQZGHILKMFPSTWYV]+)$\"))\n}\n```\n\nNow we have all the elements that we need to apply the `Fasta()` parser.\n\n```{r}\nFasta()(fastafile)\n```\n\nThe output of the parser consists of two elements, `L` and `R`, where `L` \ncontains the parsed and processed part of the input and `R` the remaining \nun-parsed part of the input. Since we explicitly demanded to parse until the \nend of the file by the `eof()` function in the definition of the `Fasta()` \nparser, the `R` element contains an empty list to signal that the parser was\nindeed at the end of the input. Please see the package documentation for more\nexamples and explanation.\n\nFinally, let's present the result of the parse more concisely using the names \nof the elements inside the `L` element:\n\n```{r}\nd \u003c- Fasta()(fastafile)[[\"L\"]]\ninvisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, \"\\n\")}))\n```\n\n## Getting useful error messages when parsing\n\nBasic error messaging is implemented in the function `reporter()`. You can wrap \na parser in the `reporter()` function to obtain an error message that reports\nthe line of the input in which the parser ultimately failed as well as some lines\naround it to provide context. Suppose we have the following badly formatted\nfasta file:\n\n```{r}\nbad_header \u003c- c(\n  \"*sequence_A\",\n  \"GGTAAGTCCTCTAGTACAAACACCCCCAAT\",\n  \"\u003esequence_B\",\n  \"ATTGTGATATAATTAAAATTATATTCATAT\"\n)\n```\n\nNote that the first header starts with `*` instead of a `\u003e`. Upgrading the \n`Fasta()` parser with the `reporter()` function to an *error reporting parser*\nyields a basic error message:\n\n```{r}\n#| eval: false\nreporter(Fasta())(bad_header)\n```\n```{r}\n#| echo: false\ntry(reporter(Fasta())(bad_header))\n```\n\n\nWe could, however, get better error messaging by upgrading the `Header()` parser \nto a named parser:\n\n```{r}\nHeader \u003c- function() {\n  named(\n    match_s(stringparser(\"^\u003e(\\\\w+)\")) %using% \n      function(x) list(title = unlist(x)),\n    \"FASTA header (\u003esequence_name)\"\n  )\n}\n```\n\nwhere the first argument to the `named()` function is a parser body and the \nsecond argument is a brief description of the parser. Now, the reporter yields \na more detailed message: \n\n```{r}\n#| eval: false\nreporter(Fasta())(bad_header)\n```\n\n\n```{r}\n#| echo: false\ntry(reporter(Fasta())(bad_header))\n```\n\nSuppose we have the following bad fasta file:\n\n```{r}\nmissing_sequence \u003c- c(\n  \"\u003esequence_A\",\n  \"\u003esequence_B\",\n  \"ATTGTGATATAATTAAAATTATATTCATAT\"\n)\n```\n\nUpgrading the `NuclSequence` and `ProtSequence` to named parsers yields a \nbetter error message:\n\n```{r}\nNuclSequence \u003c- function() {\n  named(\n    one_or_more(NuclSequenceString()) %using% \n      function(x) list(type = \"Nucl\", sequence = paste(x, collapse=\"\")),\n    \"Nucleotide_Sequence\"\n  )\n}\n\nProtSequence \u003c- function() {\n  named(\n    one_or_more(ProtSequenceString()) %using% \n      function(x) list(type = \"Prot\", sequence = paste(x, collapse=\"\")),\n    \"Protein_Sequence\"\n    \n  )\n}\n```\n\n```{r}\n#| eval: false\nreporter(Fasta())(missing_sequence)\n```\n\n```{r}\n#| echo: false\ntry(reporter(Fasta())(missing_sequence))\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsystemsbioinformatics%2Fparcr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsystemsbioinformatics%2Fparcr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsystemsbioinformatics%2Fparcr/lists"}