Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hrbrmstr/jwatr
:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R
https://github.com/hrbrmstr/jwatr
java r r-cyber rstats warc
Last synced: 23 days ago
JSON representation
:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R
- Host: GitHub
- URL: https://github.com/hrbrmstr/jwatr
- Owner: hrbrmstr
- License: apache-2.0
- Created: 2017-08-18T18:10:09.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2017-09-04T03:44:06.000Z (about 7 years ago)
- Last Synced: 2024-08-06T03:05:01.856Z (3 months ago)
- Topics: java, r, r-cyber, rstats, warc
- Language: R
- Homepage:
- Size: 38.6 MB
- Stars: 7
- Watchers: 4
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output: rmarkdown::github_document
---`jwatr` : Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit
The Java Web Archive Toolkit ('JWAT')
is a library of Java objects and methods which enables reading, writing and validating web archive files.WIP!!! Reading & writing need some optimization and edge case checking. There's also a chance I'll change the name to `warc` but some folks are using that package now and I dinna want to cause pain there yet.
The following functions are implemented:
**Reading**
- `read_warc`: Read a WARC file (compressed or uncompressed)
- `warc_stream_in`: Stream in records from a WARC file**Writing**
- `warc_file`: Create a new WARC file
- `warc_write_warcinfo`: Write a 'warcinfo' record to a WARC File
- `warc_write_response`: Write simple `httr::GET` requests or full `httr` `response` objects to a WARC file
- `close_warc_file`: Close a WARC file**`httr` Wrappers**
- `warc_GET`: WARC-ify an httr::GET request
- `warc_POST`: WARC-ify an httr::GET request**Utility**
- `response_list_to_warc_file`: Turns a list of 'httr' 'response' objects into a WARC file
- `payload_content`: Helper function to convert WARC raw headers+payload into something useful
- `is_compressed`: Test if a raw vector is gzip compressedNOTE: To read in typical (~800MB-1GB gzip'd WARC files) you should consider doing the following (in order) in your scripts:
```{r eval=FALSE}
options(java.parameters = "-Xmx2g")library(rJava)
library(jwatjars)
library(jwatr)
```That idiom generally provides enough heap space, but you may need to adjust the heap size if you've got larger payloads.
Alternatively, you can set the same option in your R startup scripts, but that will likely come back to bite you when moving workloads around.
### Installation
```{r eval=FALSE}
devtools::install_github("hrbrmstr/jwatr")
``````{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```### Reading WARC Files
```{r message=FALSE, warning=FALSE, error=FALSE}
library(rJava)
library(jwatr)
library(magick)
library(tidyverse)# current verison
packageVersion("jwatr")
``````{r message=FALSE, warning=FALSE, error=FALSE}
# small, uncompressed WARC file
glimpse(read_warc(system.file("extdata/bbc.warc", package="jwatr")))# larger example
xdf <- read_warc(system.file("extdata/sample.warc.gz", package="jwatr"),
warc_types = "response", include_payload = TRUE)glimpse(xdf)
# get the payload content
payload_content(url = xdf$target_uri[279], ctype = xdf$http_protocol_content_type[279],
xdf$http_raw_headers[[279]], xdf$payload[[279]])# or ingest the raw bits yourself
imgs <- filter(xdf, grepl("(png|gif|jpeg)$", http_protocol_content_type))imgs
image_read(imgs$payload[[1]])
```![](imgs/img1.jpeg)
### Writing WARC Files
```{r message=FALSE, warning=FALSE, error=FALSE}
library(jwatr)
library(httr)
library(magick)
library(tidyverse)tf <- tempfile("test")
wf <- warc_file(tf)warc_write_response(wf, "https://rud.is/b/")
# store a simple httr::GET request
warc_write_response(wf, GET("https://rud.is/b/"))warc_write_response(wf, "https://www.rstudio.com/")
warc_write_response(wf, "https://www.r-project.org/")# all valid content types work, like this PDF
warc_write_response(wf, "http://che.org.il/wp-content/uploads/2016/12/pdf-sample.pdf")# complex API calls can be made and the results stored in the WARC file as well
# this API call returns a JSON object
POST(
url = "https://data.police.uk/api/crimes-street/all-crime",
query = list( lat = "52.629729", lng = "-1.131592", date = "2017-01")
) -> uk_reswarc_write_response(wf, uk_res)
warc_write_response(wf, "https://journal.r-project.org/RLogo.png")close_warc_file(wf)
xdf <- read_warc(sprintf("%s.warc.gz", tf), include_payload = TRUE)
glimpse(xdf)
# decode the WARC stored JSON response from the UK Crimes API
glimpse(jsonlite::fromJSON(rawToChar(xdf[6,]$payload[[1]]), flatten=TRUE))select(xdf, content_length, http_protocol_content_type)
image_read(xdf$payload[[5]])
```![](imgs/img2.png)
```{r echo=FALSE}
unlink(tf)
```### Streaming
The `warc_stream_in()` function provides a pure-R method for stream processing WARC
files through the use of an R callback handler. One way of using this is to build a
data frame. The following example builds a data frame of WARC `response` records. Space
is reserved for a 10,000-element list which will get truncated or expanded as necessary:```{r message=FALSE, warning=FALSE, error=FALSE}
xdf <- list(10000)
xdf_i <- 0myfun <- function(headers, payload, ...) {
headers <- setNames(headers, gsub("-", "_", names(headers)))
xdf_i <<- xdf_i + 1
headers$payload <- list(payload)
xdf[xdf_i] <<- list(headers)
}(n <- warc_stream_in(
system.file("extdata/sample.warc.gz", package="jwatr"),
myfun,
warc_types = "response"
))xdf <- bind_rows(xdf)
glimpse(xdf)
count(xdf, content_type)
cat(rawToChar(xdf$payload[[1]]))
```### Test Results
```{r message=FALSE, warning=FALSE, error=FALSE}
library(jwatr)
library(testthat)date()
test_dir("tests/")
```### Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.