https://github.com/hrbrmstr/jwatr

:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R
https://github.com/hrbrmstr/jwatr

java r r-cyber rstats warc

Last synced: 5 months ago
JSON representation

:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R

Host: GitHub
URL: https://github.com/hrbrmstr/jwatr
Owner: hrbrmstr
License: apache-2.0
Created: 2017-08-18T18:10:09.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2017-09-04T03:44:06.000Z (almost 8 years ago)
Last Synced: 2024-08-06T03:05:01.856Z (11 months ago)
Topics: java, r, r-cyber, rstats, warc
Language: R
Homepage:
Size: 38.6 MB
Stars: 7
Watchers: 4
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

        ---

output: rmarkdown::github_document

---

`jwatr` : Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

The Java Web Archive Toolkit ('JWAT') 

is a library of Java objects and methods which enables reading, writing and validating web archive files.

WIP!!! Reading & writing need some optimization and edge case checking. There's also a chance I'll change the name to `warc` but some folks are using that package now and I dinna want to cause pain there yet.

The following functions are implemented:

**Reading**

- `read_warc`:	Read a WARC file (compressed or uncompressed)

- `warc_stream_in`:	Stream in records from a WARC file

**Writing**

- `warc_file`:	Create a new WARC file

- `warc_write_warcinfo`:	Write a 'warcinfo' record to a WARC File

- `warc_write_response`:	Write simple `httr::GET` requests or full `httr` `response` objects to a WARC file

- `close_warc_file`:	Close a WARC file

**`httr` Wrappers**

- `warc_GET`:	WARC-ify an httr::GET request

- `warc_POST`:	WARC-ify an httr::GET request

**Utility**

- `response_list_to_warc_file`:	Turns a list of 'httr' 'response' objects into a WARC file

- `payload_content`:	Helper function to convert WARC raw headers+payload into something useful

- `is_compressed`:	Test if a raw vector is gzip compressed

NOTE: To read in typical (~800MB-1GB gzip'd WARC files) you should consider doing the following (in order) in your scripts:

```{r eval=FALSE}

options(java.parameters = "-Xmx2g")

library(rJava)

library(jwatjars)

library(jwatr)

```

That idiom generally provides enough heap space, but you may need to adjust the heap size if you've got larger payloads.

Alternatively, you can set the same option in your R startup scripts, but that will likely come back to bite you when moving workloads around.

### Installation

```{r eval=FALSE}

devtools::install_github("hrbrmstr/jwatr")

```

```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}

options(width=120)

```

### Reading WARC Files

```{r message=FALSE, warning=FALSE, error=FALSE}

library(rJava)

library(jwatr)

library(magick)

library(tidyverse)

# current verison

packageVersion("jwatr")

```

```{r message=FALSE, warning=FALSE, error=FALSE}

# small, uncompressed WARC file

glimpse(read_warc(system.file("extdata/bbc.warc", package="jwatr")))

# larger example

xdf <- read_warc(system.file("extdata/sample.warc.gz", package="jwatr"),

                 warc_types = "response", include_payload = TRUE)

glimpse(xdf)

# get the payload content

payload_content(url = xdf$target_uri[279], ctype = xdf$http_protocol_content_type[279], 

                xdf$http_raw_headers[[279]], xdf$payload[[279]])

# or ingest the raw bits yourself

imgs <- filter(xdf, grepl("(png|gif|jpeg)$", http_protocol_content_type))

imgs

image_read(imgs$payload[[1]])

```

![](imgs/img1.jpeg)

### Writing WARC Files

```{r message=FALSE, warning=FALSE, error=FALSE}

library(jwatr)

library(httr)

library(magick)

library(tidyverse)

tf <- tempfile("test")

wf <- warc_file(tf)

warc_write_response(wf, "https://rud.is/b/")

# store a simple httr::GET request

warc_write_response(wf, GET("https://rud.is/b/"))

warc_write_response(wf, "https://www.rstudio.com/")

warc_write_response(wf, "https://www.r-project.org/")

# all valid content types work, like this PDF

warc_write_response(wf, "http://che.org.il/wp-content/uploads/2016/12/pdf-sample.pdf")

# complex API calls can be made and the results stored in the WARC file as well

# this API call returns a JSON object

POST(

 url = "https://data.police.uk/api/crimes-street/all-crime",

 query = list( lat = "52.629729", lng = "-1.131592", date = "2017-01")

) -> uk_res

warc_write_response(wf, uk_res)

warc_write_response(wf, "https://journal.r-project.org/RLogo.png")

close_warc_file(wf)

xdf <- read_warc(sprintf("%s.warc.gz", tf), include_payload = TRUE)

glimpse(xdf)

# decode the WARC stored JSON response from the UK Crimes API

glimpse(jsonlite::fromJSON(rawToChar(xdf[6,]$payload[[1]]), flatten=TRUE))

select(xdf, content_length, http_protocol_content_type)

image_read(xdf$payload[[5]])

```

![](imgs/img2.png)

```{r echo=FALSE}

unlink(tf)

```

### Streaming

The `warc_stream_in()` function provides a pure-R method for stream processing WARC

files through the use of an R callback handler. One way of using this is to build a

data frame. The following example builds a data frame of WARC `response` records. Space

is reserved for a 10,000-element list which will get truncated or expanded as necessary:

```{r message=FALSE, warning=FALSE, error=FALSE}

xdf <- list(10000)

xdf_i <- 0

myfun <- function(headers, payload, ...) {

  headers <- setNames(headers, gsub("-", "_", names(headers)))

  xdf_i <<- xdf_i + 1

  headers$payload <- list(payload)

  xdf[xdf_i] <<- list(headers)

}

(n <- warc_stream_in(

  system.file("extdata/sample.warc.gz", package="jwatr"),

  myfun,

  warc_types = "response"

))

xdf <- bind_rows(xdf)

glimpse(xdf)

count(xdf, content_type)

cat(rawToChar(xdf$payload[[1]]))

```

### Test Results

```{r message=FALSE, warning=FALSE, error=FALSE}

library(jwatr)

library(testthat)

date()

test_dir("tests/")

```

### Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hrbrmstr/jwatr

Awesome Lists containing this project

README