Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/r-spark/sparkwarc
Load WARC files into Apache Spark with sparklyr
https://github.com/r-spark/sparkwarc
Last synced: about 1 month ago
JSON representation
Load WARC files into Apache Spark with sparklyr
- Host: GitHub
- URL: https://github.com/r-spark/sparkwarc
- Owner: r-spark
- Created: 2017-01-09T04:37:09.000Z (almost 8 years ago)
- Default Branch: main
- Last Pushed: 2022-01-11T18:32:34.000Z (almost 3 years ago)
- Last Synced: 2024-11-09T21:20:20.195Z (about 1 month ago)
- Language: WebAssembly
- Homepage:
- Size: 511 KB
- Stars: 13
- Watchers: 5
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
title: "sparkwarc - WARC files in sparklyr"
output:
github_document:
fig_width: 9
fig_height: 5
---# Install
Install using with:
```{r eval=FALSE}
devtools::install_github("javierluraschi/sparkwarc")
```# Intro
The following example loads a very small subset of a WARC file from [Common Crawl](http://commoncrawl.org), a nonprofit 501 organization that crawls the web and freely provides its archives and datasets to the public.
```{r message=FALSE}
library(sparkwarc)
library(sparklyr)
library(DBI)
library(dplyr)
``````{r connect-1, max.print=10}
sc <- spark_connect(master = "local")
``````{r load-sample}
spark_read_warc(sc, path = spark_warc_sample_path(), name = "WARC")
``````{sql query-1, connection=sc, max.print=1}
SELECT count(value)
FROM WARC
WHERE length(regexp_extract(value, ' 0
``````{r functions-1}
cc_regex <- function(ops) {
ops %>%
filter(regval != "") %>%
group_by(regval) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(100)
}cc_stats <- function(regex) {
tbl(sc, "warc") %>%
transmute(regval = regexp_extract(value, regex, 1)) %>%
cc_regex()
}
``````{r query-2}
cc_stats("http-equiv=\"Content-Language\" content=\"(.*)\"")
``````{r query-3}
cc_stats("")
``````{r query-5}
cc_stats(" ([a-zA-Z]{5,10}) ")
``````{r query-6}
cc_stats("<meta .*keywords.*content=\"([^,\"]+).*")
``````{r query-7}
cc_stats("<script .*src=\".*/([^/]+.js)\".*")
``````{r disconnect-1}
spark_disconnect(sc)
```# Querying 1GB
```{r download-1}
warc_big <- normalizePath("~/cc.warc.gz") # Name a 5GB warc file
if (!file.exists(warc_big)) # If the file does not exist
download.file( # download by
gsub("s3n://commoncrawl/", # mapping the S3 bucket url
"https://commoncrawl.s3.amazonaws.com/", # into a adownloadable url
sparkwarc::cc_warc(1)), warc_big) # from the first archive file
``````{r connect-2}
config <- spark_config()
config[["spark.memory.fraction"]] <- "0.9"
config[["spark.executor.memory"]] <- "10G"
config[["sparklyr.shell.driver-memory"]] <- "10G"sc <- spark_connect(master = "local", config = config)
``````{r load-full}
spark_read_warc(
sc,
"warc",
warc_big,
repartition = 8)
```df <- data.frame(list(a = list("a,b,c")))
```{sql query-8, connection=sc, max.print=1}
SELECT count(value)
FROM WARC
WHERE length(regexp_extract(value, '<([a-z]+)>', 0)) > 0
``````{sql query-9, connection=sc, max.print=1}
SELECT count(value)
FROM WARC
WHERE length(regexp_extract(value, '<html', 0)) > 0
``````{r query-10}
cc_stats("http-equiv=\"Content-Language\" content=\"([^\"]*)\"")
``````{r query-11}
cc_stats("WARC-Target-URI: http://([^/]+)/.*")
``````{r query-12}
cc_stats("<([a-zA-Z]+)>")
``````{r query-13}
cc_stats("<meta .*keywords.*content=\"([a-zA-Z0-9]+).*")
``````{r disconnect-2}
spark_disconnect(sc)
```# Querying 1TB
By [running sparklyr in EMR](https://aws.amazon.com/blogs/big-data/running-sparklyr-rstudios-r-interface-to-spark-on-amazon-emr/), one can configure an EMR cluster and load about **~5GB** of data using:
```{r eval=FALSE}
sc <- spark_connect(master = "yarn-client")
spark_read_warc(sc, "warc", cc_warc(1, 1))tbl(sc, "warc") %>% summarize(n = n())
spark_disconnect_all()
```To read the first 200 files, or about **~1TB** of data, first scale the cluster, consider maximizing resource allocation with the followin EMR config:
```
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
```Followed by loading the `[1, 200]` file range with:
```{r eval=FALSE}
sc <- spark_connect(master = "yarn-client")
spark_read_warc(sc, "warc", cc_warc(1, 200))tbl(sc, "warc") %>% summarize(n = n())
spark_disconnect_all()
```To **query ~1PB** for the entire crawl, a custom script would be needed to load all the WARC files.