An open API service indexing awesome lists of open source software.

https://github.com/fangzhou-xie/Randomuseragent

Randomly Shuffle User-Agent Strings
https://github.com/fangzhou-xie/Randomuseragent

Last synced: 4 months ago
JSON representation

Randomly Shuffle User-Agent Strings

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = ">",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# Randomuseragent

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/Randomuseragent)](https://CRAN.R-project.org/package=Randomuseragent)
[![CRAN_Downloads](http://cranlogs.r-pkg.org/badges/grand-total/Randomuseragent)](https://CRAN.R-project.org/package=Randomuseragent)
[![R-CMD-check](https://github.com/fangzhou-xie/Randomuseragent/workflows/R-CMD-check/badge.svg)](https://github.com/fangzhou-xie/Randomuseragent/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

The goal of Randomuseragent is to have a easy access to different user-agent strings
by randomly sampling from a pool of real strings.

## Installation

You can install the released version of Randomuseragent from [CRAN](https://CRAN.R-project.org) with:

```{r, eval=FALSE}
install.packages("Randomuseragent")
```

The development version can be installed from [GitHub](https://github.com/) with:

```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("fangzhou-xie/Randomuseragent")
```

## Example

This is a basic example to get random user-agent strings:

```{r example}
library(Randomuseragent)

random_useragent()

filter_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
```

Both function will accept the same set of arguments for filtering user-agent strings. Please refer
to documentation of either function for details.

## Advanced Example

Although calling `random_useragent()` is very convenient, but this may not be the best way
if you care about performance. `random_useragent()` essentially wraps up the `filter_useragent()`
function and return a random one from the pool.

```{r}
# call directly
random_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
```

However, if you need to generate **LOTS OF** them, i.e. calling `random_useragent()` repeatedly,
each time you call `random_useragent()` you need first to filter from all the strings that
this package provides, and then randomly draw one from the pool.
Hence, you are doing the subsetting each time you call the function.
This is very inefficient.

A better way would be to get the string pool directly from `filter_useragent()` and then sampling yourself.

```{r}
# first filter
uas <- filter_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
# then sample manually
sample(uas, 1)
```

To note this difference, we need to time the following code chunks.

```{r}
# first to call random_useragent() directly
system.time(lapply(1:5000, function(x){random_useragent()}))
```

```{r}
# second generate the character vector and sampling manually
system.time({
ua <- filter_useragent(min_obs = 5000, software_type = "browser", operating_system_name = "Windows")
lapply(1:5000, function(x) {sample(ua, 1)})
})
```

We run each method 5000 times to make a fair comparison between methods. You should immediately see
that the second method is more than 50 times faster than the first one!
That said, the first method only spends 0.2452 ms per call, on average,
which is pretty fast already.
The second method needs 4.4 ns per call. This is certainly faster, but for most use cases,
I don't think it worth going this far.

## Optional Parameters

You can type `?random_useragent` to see the documentation for the parameters.

1. `min_obs`: integer, threshold to filter number of times observed in the dataset. This is to keep the most frequently
used UAs while removing the less frequently used ones. Larger number of this argument will result in less
returned strings. Hence smaller set to be sampled from.
2. `software_name`: character vector, name of the software. For example, you can choose to only use
`software_name="Chrome"` or several platforms together `software_name = c("Safari", "Edge")`.
3. `software_type`: character vector, one or more of `"browser", "bot", "application"`.
For webscraping applications, you would most likely choose `software_type="browser"` to mimic
real browser behavior.
4. `operating_system_name`: character vector, system being operated. For example, use one or more of
`"Windows", "Linux", "Mac OS X", "macOS", etc`.
5. `layout_engine_name`: character vector, e.g. `"Gecko", "Blink", etc`.