https://github.com/fangzhou-xie/Randomuseragent

Randomly Shuffle User-Agent Strings
https://github.com/fangzhou-xie/Randomuseragent

Last synced: 7 months ago
JSON representation

Randomly Shuffle User-Agent Strings

Host: GitHub
URL: https://github.com/fangzhou-xie/Randomuseragent
Owner: fangzhou-xie
License: other
Created: 2021-06-15T20:35:34.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2021-11-19T20:12:52.000Z (over 3 years ago)
Last Synced: 2024-11-25T03:41:47.856Z (7 months ago)
Language: R
Homepage: https://fangzhou-xie.github.io/Randomuseragent/
Size: 478 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

jimsghstars - fangzhou-xie/Randomuseragent - Randomly Shuffle User-Agent Strings (R)

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = ">",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# Randomuseragent

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/Randomuseragent)](https://CRAN.R-project.org/package=Randomuseragent)

[![CRAN_Downloads](http://cranlogs.r-pkg.org/badges/grand-total/Randomuseragent)](https://CRAN.R-project.org/package=Randomuseragent)

[![R-CMD-check](https://github.com/fangzhou-xie/Randomuseragent/workflows/R-CMD-check/badge.svg)](https://github.com/fangzhou-xie/Randomuseragent/actions)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

The goal of Randomuseragent is to have a easy access to different user-agent strings

by randomly sampling from a pool of real strings.

## Installation

You can install the released version of Randomuseragent from [CRAN](https://CRAN.R-project.org) with:

```{r, eval=FALSE}

install.packages("Randomuseragent")

```

The development version can be installed from [GitHub](https://github.com/) with:

```{r, eval=FALSE}

# install.packages("devtools")

devtools::install_github("fangzhou-xie/Randomuseragent")

```

## Example

This is a basic example to get random user-agent strings:

```{r example}

library(Randomuseragent)

random_useragent()

filter_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")

```

Both function will accept the same set of arguments for filtering user-agent strings. Please refer

to documentation of either function for details.

## Advanced Example

Although calling `random_useragent()` is very convenient, but this may not be the best way

if you care about performance. `random_useragent()` essentially wraps up the `filter_useragent()` 

function and return a random one from the pool.

```{r}

# call directly

random_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")

```

However, if you need to generate **LOTS OF** them, i.e. calling `random_useragent()` repeatedly,

each time you call `random_useragent()` you need first to filter from all the strings that

this package provides, and then randomly draw one from the pool.

Hence, you are doing the subsetting each time you call the function.

This is very inefficient.

A better way would be to get the string pool directly from `filter_useragent()` and then sampling yourself.

```{r}

# first filter

uas <- filter_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")

# then sample manually

sample(uas, 1)

```

To note this difference, we need to time the following code chunks.

```{r}

# first to call random_useragent() directly

system.time(lapply(1:5000, function(x){random_useragent()}))

```

```{r}

# second generate the character vector and sampling manually

system.time({

  ua <- filter_useragent(min_obs = 5000, software_type = "browser", operating_system_name = "Windows")

  lapply(1:5000, function(x) {sample(ua, 1)})

})

```

We run each method 5000 times to make a fair comparison between methods. You should immediately see

that the second method is more than 50 times faster than the first one!

That said, the first method only spends 0.2452 ms per call, on average,

which is pretty fast already.

The second method needs 4.4 ns per call. This is certainly faster, but for most use cases,

I don't think it worth going this far.

## Optional Parameters

You can type `?random_useragent` to see the documentation for the parameters.

1. `min_obs`: integer, threshold to filter number of times observed in the dataset. This is to keep the most frequently

used UAs while removing the less frequently used ones. Larger number of this argument will result in less

returned strings. Hence smaller set to be sampled from.

2. `software_name`: character vector, name of the software. For example, you can choose to only use 

`software_name="Chrome"` or several platforms together `software_name = c("Safari", "Edge")`.

3. `software_type`: character vector, one or more of `"browser", "bot", "application"`.

For webscraping applications, you would most likely choose `software_type="browser"` to mimic

real browser behavior.

4. `operating_system_name`: character vector, system being operated. For example, use one or more of

`"Windows", "Linux", "Mac OS X", "macOS", etc`.

5. `layout_engine_name`: character vector, e.g. `"Gecko", "Blink", etc`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fangzhou-xie/Randomuseragent

Awesome Lists containing this project

README