https://github.com/hrbrmstr/wayback

:rewind: Tools to Work with the Various Internet Archive Wayback Machine APIs
https://github.com/hrbrmstr/wayback

internet-archive memento r r-cyber rstats wayback wayback-machine web-scraping

Last synced: 9 months ago
JSON representation

:rewind: Tools to Work with the Various Internet Archive Wayback Machine APIs

Host: GitHub
URL: https://github.com/hrbrmstr/wayback
Owner: hrbrmstr
Created: 2017-02-26T22:37:10.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2018-09-18T13:04:09.000Z (over 7 years ago)
Last Synced: 2025-03-01T06:11:09.590Z (10 months ago)
Topics: internet-archive, memento, r, r-cyber, rstats, wayback, wayback-machine, web-scraping
Language: R
Homepage: https://hrbrmstr.github.io/wayback/index.html
Size: 1.29 MB
Stars: 55
Watchers: 9
Forks: 8
Open Issues: 7
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

          ---

output: rmarkdown::github_document

---

[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/wayback.svg?branch=master)](https://travis-ci.org/hrbrmstr/wayback)

[![codecov](https://codecov.io/gh/hrbrmstr/wayback/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/wayback)

[![Appveyor Status](https://ci.appveyor.com/api/projects/status/w9rwdf8a16t0amht/branch/master?svg=true)](https://ci.appveyor.com/project/hrbrmstr/wayback/branch/master)

# wayback

Tools to Work with Internet Archive Wayback Machine APIs

## Description

The 'Internet Archive' provides access to millions of cached sites. Methods are provided to access these cached resources through the 'APIs' provided by the 'Internet Archive' and also content from 'MementoWeb'.

## What's Inside the Tin?

The following functions are implemented:

**Memento-ish API**:

- `archive_available`:	Does the Internet Archive have a URL cached?

- `cdx_basic_query`:	Perform a basic/limited Internet Archive CDX resource query for a URL

- `get_mementos`: Retrieve site mementos from the Internet Archive

- `get_timemap`:	Retrieve a timemap for a URL

- `read_memento`:	Read a resource directly from the Time Travel MementoWeb

- `is_memento`: Various memento-type testers (useful in `purrr` or `dplyr` contexts)

- `is_first_memento`: Various memento-type testers (useful in `purrr` or `dplyr` contexts)

- `is_next_memento`: Various memento-type testers (useful in `purrr` or `dplyr` contexts)

- `is_prev_memento`: Various memento-type testers (useful in `purrr` or `dplyr` contexts)

- `is_last_memento`: Various memento-type testers (useful in `purrr` or `dplyr` contexts)

- `is_original`: Various memento-type testers (useful in `purrr` or `dplyr` contexts)

- `is_timemap`: Various memento-type testers (useful in `purrr` or `dplyr` contexts)

- `is_timegate`: Various memento-type testers (useful in `purrr` or `dplyr` contexts)

**Scrape API**

- `ia_retrieve:`	Retrieve directory listings for Internet Archive objects by identifier

- `ia_scrape`:	Internet Archive Scraping API Access

- `ia_scrape_has_more`:	'ia_scrape()' Pagination Helpers

- `ia_scrape_next_page`:	Internet Archive Scraping API Access

## Installation

```{r eval=FALSE}

devtools::install_github("hrbrmstr/wayback")

```

```{r message=FALSE, warning=FALSE, error=FALSE, echo=FALSE}

options(width=120)

```

## Usage

```{r message=FALSE, warning=FALSE, error=FALSE}

library(wayback)

library(tidyverse)

# current verison

packageVersion("wayback")

```

### Memento-ish things

```{r avail, message=FALSE, warning=FALSE, error=FALSE}

archive_available("https://www.r-project.org/news.html")

```

```{r get_memento, message=FALSE, warning=FALSE, error=FALSE}

get_mementos("https://www.r-project.org/news.html")

```

```{r get_time, message=FALSE, warning=FALSE, error=FALSE}

get_timemap("https://www.r-project.org/news.html")

```

```{r basic_q, message=FALSE, warning=FALSE, error=FALSE}

cdx_basic_query("https://www.r-project.org/news.html", limit = 10) %>% 

  glimpse()

```

```{r read_mem, message=FALSE, warning=FALSE, error=FALSE}

mem <- read_memento("https://www.r-project.org/news.html")

res <- stringi::stri_split_lines(mem)[[1]]

cat(paste0(res[187:200], collaspe="\n"))

```

### Scrape API

```{r}

glimpse(

  ia_scrape("lemon curry")

)

```

```{r}

(nasa <- ia_scrape("collection:nasa", count=100L))

(item <- ia_retrieve(nasa$identifier[1]))

download.file(item$link[1], file.path("man/figures", item$file[1]))

```

![](man/figures/`r item$file[1]`)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hrbrmstr/wayback

Awesome Lists containing this project

README