{"id":13707378,"url":"https://github.com/SystemsBioinformatics/parcr","last_synced_at":"2025-05-06T03:31:23.810Z","repository":{"id":215245417,"uuid":"736972435","full_name":"SystemsBioinformatics/parcr","owner":"SystemsBioinformatics","description":"Construct parser combinators in R","archived":false,"fork":false,"pushed_at":"2024-06-07T07:23:17.000Z","size":602,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-08T06:10:40.529Z","etag":null,"topics":["combinators","higher-order-functions","parser","parsing","r-package"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SystemsBioinformatics.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-12-29T12:17:54.000Z","updated_at":"2024-06-07T07:23:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"0bcecc2c-3138-4f9e-bca4-a9e44ddac837","html_url":"https://github.com/SystemsBioinformatics/parcr","commit_stats":null,"previous_names":["systemsbioinformatics/parcr"],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SystemsBioinformatics%2Fparcr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SystemsBioinformatics","download_url":"https://codeload.github.com/SystemsBioinformatics/parcr/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224484506,"owners_count":17318987,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["combinators","higher-order-functions","parser","parsing","r-package"],"created_at":"2024-08-02T22:01:29.473Z","updated_at":"2024-11-13T16:30:21.356Z","avatar_url":"https://github.com/SystemsBioinformatics.png","language":"HTML","readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Edit README.Rmd --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\"\n)\nlibrary(parcr)\n```\n\n\u003c!-- badges: start --\u003e\n[![CRAN status](https://www.r-pkg.org/badges/version/parcr)](https://cran.r-project.org/package=parcr)\n[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n[![R-CMD-check](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml)\n\u003c!-- badges: end --\u003e\n\n## Construct parser combinator functions for parsing character vectors\n\nThis R package contains tools to construct parser combinator functions, higher \norder functions that parse input. The main goal of this package is to simplify\nthe creation of *transparent* parsers for structured text files generated by \nmachines like laboratory instruments. Such files consist of lines of text \norganized in higher-order structures like headers with metadata and blocks of \nmeasured values. To read these data into R you first need to create a parser\nthat processes these files and creates R-objects as output. The `parcr` package\nsimplifies the task of creating such parsers.\n\nThis package was inspired by the package \n[\"Ramble\"](https://github.com/8bit-pixies/Ramble) by Chapman Siu and co-workers \nand by the paper\n[\"Higher-order functions for parsing\"](https://doi.org/10.1017/S0956796800000411) \nby [Graham Hutton](https://orcid.org/0000-0001-9584-5150) (1992).\n\n## Installation\n\nInstall the stable version from CRAN\n\n```\ninstall.packages(\"parcr\")\n```\n\nTo install the development version including its vignette run the following command\n\n```\ninstall_github(\"SystemsBioinformatics/parcr\", build_vignettes=TRUE)\n```\n\n## Example application: a parser for *fasta* sequence files\n\nAs an example of a realistic application we write a parser for \nfasta-formatted files for nucleotide and protein sequences. We use a few \nsimplifying assumptions about this format for the sake of the example. Real \nfasta files are more complex than we pretend here.\n\n*Please note that more background about the functions that we use here is \navailable in the package documentation. Here we only present a summary.*\n\nA fasta file with mixed sequence types could look like the example below:\n\n```{r, echo=FALSE, comment = NA}\ndata(\"fastafile\")\ncat(paste0(fastafile, collapse=\"\\n\"))\n```\n\nSince fasta files are text files we could read such a file using `readLines()`\ninto a character vector. The package provides the data set `fastafile` which \ncontains that character vector.\n\n```{r, eval=FALSE}\ndata(\"fastafile\")\n```\n\nWe can distinguish the following higher order components in a fasta file:\n \n- A **fasta** file: consists of one or more **sequence blocks** until the \n  **end of the file**.\n- A **sequence block**: consist of a **header** and a \n  **nucleotide sequence** or a **protein sequence**. A sequence block could be\n  preceded by zero or more **empty lines**.\n- A **nucleotide sequence**: consists of one or more \n  **nucleotide sequence strings**.\n- A **protein sequence**: consists of one or more \n  **protein sequence strings**.\n- A **header** is a *string* that starts with a \"\u003e\" immediately followed by\n  a **title** without spaces.\n- A **nucleotide sequence string** is a *string* without spaces that consists\n  *entirely* of symbols from the set `{G,A,T,C}`.\n- A **protein sequence string** is a *string* without spaces that consists\n  *entirely* of symbols from the set `{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}`.\n\nIt now becomes clear what we mean when we say that the package allows us\nto write *transparent* parsers: the description above of the structure of fasta\nfiles can be put straight into code for a `Fasta()` parser:\n\n```{r}\nFasta \u003c- function() {\n  one_or_more(SequenceBlock()) %then%\n    eof()\n}\n\nSequenceBlock \u003c- function() {\n  MaybeEmpty() %then% \n    Header() %then% \n    (NuclSequence() %or% ProtSequence()) %using%\n    function(x) list(x)\n}\n\nNuclSequence \u003c- function() {\n  one_or_more(NuclSequenceString()) %using% \n    function(x) list(type = \"Nucl\", sequence = paste(x, collapse=\"\"))\n}\n\nProtSequence \u003c- function() {\n  one_or_more(ProtSequenceString()) %using% \n    function(x) list(type = \"Prot\", sequence = paste(x, collapse=\"\"))\n}\n```\n\nFunctions like `one_or_more()`, `%then%`, `%or%`, `%using%`, `eof()` and\n`MaybeEmpty()` are defined in the package and are the basic parsers with\nwhich the package user can build complex parsers. The `%using%` operator uses\nthe function on its right-hand side to modify parser output on its left hand \nside. Please see the vignette in the `parcr` package for more explanation why\nthis is useful or necessary even.\n\nNotice that the new parser functions that we define above are higher order \nfunctions taking no input, hence the empty argument brackets `()` behind their\nnames. Now we need to define the line-parsers `Header()`, `NuclSequenceString()`\nand `ProtSequenceString()` that recognize and process the header line and \nsingle lines of nucleotide or protein sequences in the character vector \n`fastafile`. We use functions from `stringr` to do this in a few helper \nfunctions, and we use `match_s()` to to create `parcr` parsers from these.\n\n```{r}\n# returns the title after the \"\u003e\" in the sequence header\nparse_header \u003c- function(line) {\n  # Study stringr::str_match() to understand what we do here\n  m \u003c- stringr::str_match(line, \"^\u003e(\\\\w+)\")\n  if (is.na(m[1])) {\n    return(list()) # signal failure: no title found\n  } else {\n    return(m[2])\n  }\n}\n\n# returns a nucleotide sequence string\nparse_nucl_sequence_line \u003c- function(line) {\n  # The line must consist of GATC from the start (^) until the end ($)\n  m \u003c- stringr::str_match(line, \"^([GATC]+)$\")\n  if (is.na(m[1])) {\n    return(list()) # signal failure: not a valid nucleotide sequence string\n  } else {\n    return(m[2])\n  }\n}\n\n# returns a protein sequence string\nparse_prot_sequence_line \u003c- function(line) {\n  # The line must consist of ARNDBCEQZGHILKMFPSTWYV from the start (^) until the\n  # end ($)\n  m \u003c- stringr::str_match(line, \"^([ARNDBCEQZGHILKMFPSTWYV]+)$\")\n  if (is.na(m[1])) {\n    return(list()) # signal failure: not a valid protein sequence string\n  } else {\n    return(m[2])\n  }\n}\n```\n\nThen we define the line-parsers.\n\n```{r}\nHeader \u003c- function() {\n  match_s(parse_header) %using% \n    function(x) list(title = unlist(x))\n}\n\nNuclSequenceString \u003c- function() {\n  match_s(parse_nucl_sequence_line)\n}\n\nProtSequenceString \u003c- function() {\n  match_s(parse_prot_sequence_line)\n}\n```\n\nwhere `match_s()` is also a parser defined in `parcr`.\n\nNow we have all the elements that we need to apply the `Fasta()` parser.\n\n```{r}\nFasta()(fastafile)\n```\n\nThe output of the parser consists of two elements, `L` and `R`, where `L` \ncontains the parsed and processed part of the input and `R` the remaining \nun-parsed part of the input. Since we explicitly demanded to parse until the \nend of the file by the `eof()` function in the definition of the `Fasta()` \nparser, the `R` element contains an empty list to signal that the parser was\nindeed at the end of the input. Please see the package documentation for more\nexamples and explanation.\n\nFinally, let's present the result of the parse more concisely using the names \nof the elements inside the `L` element:\n\n```{r}\nd \u003c- Fasta()(fastafile)[[\"L\"]]\ninvisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, \"\\n\")}))\n```\n","funding_links":[],"categories":["HTML"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSystemsBioinformatics%2Fparcr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSystemsBioinformatics%2Fparcr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSystemsBioinformatics%2Fparcr/lists"}