Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hrbrmstr/jwatr

:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R
https://github.com/hrbrmstr/jwatr

java r r-cyber rstats warc

Last synced: 23 days ago
JSON representation

:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R

Awesome Lists containing this project

README

        

---
output: rmarkdown::github_document
---

`jwatr` : Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

The Java Web Archive Toolkit ('JWAT')
is a library of Java objects and methods which enables reading, writing and validating web archive files.

WIP!!! Reading & writing need some optimization and edge case checking. There's also a chance I'll change the name to `warc` but some folks are using that package now and I dinna want to cause pain there yet.

The following functions are implemented:

**Reading**

- `read_warc`: Read a WARC file (compressed or uncompressed)
- `warc_stream_in`: Stream in records from a WARC file

**Writing**

- `warc_file`: Create a new WARC file
- `warc_write_warcinfo`: Write a 'warcinfo' record to a WARC File
- `warc_write_response`: Write simple `httr::GET` requests or full `httr` `response` objects to a WARC file
- `close_warc_file`: Close a WARC file

**`httr` Wrappers**

- `warc_GET`: WARC-ify an httr::GET request
- `warc_POST`: WARC-ify an httr::GET request

**Utility**

- `response_list_to_warc_file`: Turns a list of 'httr' 'response' objects into a WARC file
- `payload_content`: Helper function to convert WARC raw headers+payload into something useful
- `is_compressed`: Test if a raw vector is gzip compressed

NOTE: To read in typical (~800MB-1GB gzip'd WARC files) you should consider doing the following (in order) in your scripts:

```{r eval=FALSE}
options(java.parameters = "-Xmx2g")

library(rJava)
library(jwatjars)
library(jwatr)
```

That idiom generally provides enough heap space, but you may need to adjust the heap size if you've got larger payloads.

Alternatively, you can set the same option in your R startup scripts, but that will likely come back to bite you when moving workloads around.

### Installation

```{r eval=FALSE}
devtools::install_github("hrbrmstr/jwatr")
```

```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```

### Reading WARC Files

```{r message=FALSE, warning=FALSE, error=FALSE}
library(rJava)
library(jwatr)
library(magick)
library(tidyverse)

# current verison
packageVersion("jwatr")
```

```{r message=FALSE, warning=FALSE, error=FALSE}
# small, uncompressed WARC file
glimpse(read_warc(system.file("extdata/bbc.warc", package="jwatr")))

# larger example
xdf <- read_warc(system.file("extdata/sample.warc.gz", package="jwatr"),
warc_types = "response", include_payload = TRUE)

glimpse(xdf)

# get the payload content
payload_content(url = xdf$target_uri[279], ctype = xdf$http_protocol_content_type[279],
xdf$http_raw_headers[[279]], xdf$payload[[279]])

# or ingest the raw bits yourself
imgs <- filter(xdf, grepl("(png|gif|jpeg)$", http_protocol_content_type))

imgs

image_read(imgs$payload[[1]])
```

![](imgs/img1.jpeg)

### Writing WARC Files

```{r message=FALSE, warning=FALSE, error=FALSE}
library(jwatr)
library(httr)
library(magick)
library(tidyverse)

tf <- tempfile("test")
wf <- warc_file(tf)

warc_write_response(wf, "https://rud.is/b/")

# store a simple httr::GET request
warc_write_response(wf, GET("https://rud.is/b/"))

warc_write_response(wf, "https://www.rstudio.com/")
warc_write_response(wf, "https://www.r-project.org/")

# all valid content types work, like this PDF
warc_write_response(wf, "http://che.org.il/wp-content/uploads/2016/12/pdf-sample.pdf")

# complex API calls can be made and the results stored in the WARC file as well
# this API call returns a JSON object
POST(
url = "https://data.police.uk/api/crimes-street/all-crime",
query = list( lat = "52.629729", lng = "-1.131592", date = "2017-01")
) -> uk_res

warc_write_response(wf, uk_res)
warc_write_response(wf, "https://journal.r-project.org/RLogo.png")

close_warc_file(wf)

xdf <- read_warc(sprintf("%s.warc.gz", tf), include_payload = TRUE)

glimpse(xdf)

# decode the WARC stored JSON response from the UK Crimes API
glimpse(jsonlite::fromJSON(rawToChar(xdf[6,]$payload[[1]]), flatten=TRUE))

select(xdf, content_length, http_protocol_content_type)

image_read(xdf$payload[[5]])
```

![](imgs/img2.png)

```{r echo=FALSE}
unlink(tf)
```

### Streaming

The `warc_stream_in()` function provides a pure-R method for stream processing WARC
files through the use of an R callback handler. One way of using this is to build a
data frame. The following example builds a data frame of WARC `response` records. Space
is reserved for a 10,000-element list which will get truncated or expanded as necessary:

```{r message=FALSE, warning=FALSE, error=FALSE}
xdf <- list(10000)
xdf_i <- 0

myfun <- function(headers, payload, ...) {
headers <- setNames(headers, gsub("-", "_", names(headers)))
xdf_i <<- xdf_i + 1
headers$payload <- list(payload)
xdf[xdf_i] <<- list(headers)
}

(n <- warc_stream_in(
system.file("extdata/sample.warc.gz", package="jwatr"),
myfun,
warc_types = "response"
))

xdf <- bind_rows(xdf)

glimpse(xdf)

count(xdf, content_type)

cat(rawToChar(xdf$payload[[1]]))
```

### Test Results

```{r message=FALSE, warning=FALSE, error=FALSE}
library(jwatr)
library(testthat)

date()

test_dir("tests/")
```

### Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.