An open API service indexing awesome lists of open source software.

https://github.com/r-lib/xmlparsedata

R code parse data as an XML tree
https://github.com/r-lib/xmlparsedata

r xml

Last synced: about 1 year ago
JSON representation

R code parse data as an XML tree

Awesome Lists containing this project

README

          

---
output: github_document
---

```{r}
#| label: setup
#| echo: false
#| message: false
knitr::opts_chunk$set(
comment = "#>",
tidy = FALSE,
error = FALSE
)
```

# xmlparsedata

> Parse Data of R Code as an 'XML' Tree

[![R-CMD-check](https://github.com/r-lib/xmlparsedata/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/r-lib/xmlparsedata/actions/workflows/R-CMD-check.yaml)
[![](https://www.r-pkg.org/badges/version/xmlparsedata)](https://www.r-pkg.org/pkg/xmlparsedata)
[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/xmlparsedata)](https://www.r-pkg.org/pkg/xmlparsedata)
[![Codecov test coverage](https://codecov.io/gh/r-lib/xmlparsedata/graph/badge.svg)](https://app.codecov.io/gh/r-lib/xmlparsedata)

Convert the output of 'utils::getParseData()' to an 'XML' tree, that is
searchable and easier to manipulate in general.

---

- [Installation](#installation)
- [Usage](#usage)
- [Introduction](#introduction)
- [`utils::getParseData()`](#utilsgetparsedata)
- [`xml_parse_data()`](#xml_parse_data)
- [Renaming some tokens](#renaming-some-tokens)
- [Search the parse tree with `xml2`](#search-the-parse-tree-with-xml2)
- [License](#license)

## Installation

Stable version:

```{r}
#| eval: false
install.packages("xmlparsedata")
```

Development version:

```{r}
#| eval: false
pak::pak("r-lib/zip")
```

## Usage

### Introduction

In recent R versions the parser can attach source code location
information to the parsed expressions. This information is often
useful for static analysis, e.g. code linting. It can be accessed
via the `utils::getParseData()` function.

`xmlparsedata` converts this information to an XML tree.
The R parser's token names are preserved in the XML as much as
possible, but some of them are not valid XML tag names, so they are
renamed, see below.

### `utils::getParseData()`

`utils::getParseData()` summarizes the parse information in a data
frame. The data frame has one row per expression tree node, and each
node points to its parent. Here is a small example:

```{r}
p <- parse(
text = "function(a = 1, b = 2) { \n a + b\n}\n",
keep.source = TRUE
)
getParseData(p)
```

### `xml_parse_data()`

`xmlparsedata::xml_parse_data()` converts the parse information to
an XML document. It works similarly to `getParseData()`. Specify the
`pretty = TRUE` option to pretty-indent the XML output. Note that this
has a small overhead, so if you are parsing large files, I suggest you
omit it.

```{r}
library(xmlparsedata)
xml <- xml_parse_data(p, pretty = TRUE)
cat(xml)
```

The top XML tag is ``, which is a list of
expressions, each expression is an `` tag. Each tag
has attributes that define the location: `line1`, `col1`,
`line2`, `col2`. These are from the `getParseData()`
data frame column names.

### Renaming some tokens

The R parser's token names are preserved in the XML as much as
possible, but some of them are not valid XML tag names, so they are
renamed, see the `xml_parse_token_map` vector for the mapping:

```{r}
xml_parse_token_map
```

### Search the parse tree with `xml2`

The `xml2` package can search XML documents using
[XPath](https://en.wikipedia.org/wiki/XPath) expressions. This is often
useful to search for specific code patterns.

As an example we search a source file from base R for `1:nrow()`
expressions, which are usually unsafe, as `nrow()` might be zero,
and then the expression is equivalent to `1:0`, i.e. `c(1, 0)`, which
is usually not the intended behavior.

We load and parse the file directly from the the R source code mirror
at https://github.com/wch/r-source:

```{r}
url <- paste0(
"https://raw.githubusercontent.com/wch/r-source/",
"4fc93819fc7401b8695ce57a948fe163d4188f47/src/library/tools/R/xgettext.R"
)
src <- readLines(url)
p <- parse(text = src, keep.source = TRUE)
```

and we convert it to an XML tree:

```{r}
library(xml2)
xml <- read_xml(xml_parse_data(p))
```

The `1:nrow()` expression corresponds to the following
tree in R:

```

+--
+-- NUM_CONST: 1
+-- ':'
+--
+--
+-- SYMBOL_FUNCTION_CALL nrow
+-- '('
+--
+-- ')'
```

```{r}
bad <- xml_parse_data(
parse(text = "1:nrow(expr)", keep.source = TRUE),
pretty = TRUE
)
cat(bad)
```

This translates to the following XPath expression (ignoring
the last tree tokens from the `length(expr)` expressions):

```{r}
xp <- paste0(
"//expr",
"[expr[NUM_CONST[text()='1']]]",
"[OP-COLON]",
"[expr[expr[SYMBOL_FUNCTION_CALL[text()='nrow']]]]"
)
```

We can search for this subtree with `xml2::xml_find_all()`:

```{r}
bad_nrow <- xml_find_all(xml, xp)
bad_nrow
```

There is only one hit, in line 334:

```{r}
cbind(332:336, src[332:336])
```

## Code of Conduct

Please note that the xmlparsedata project is released with a
[Contributor Code of Conduct](https://r-lib.github.io/xmlparsedata/CODE_OF_CONDUCT.html).
By contributing to this project, you agree to abide by its terms.

## License

MIT © Mango Solutions, RStudio