https://github.com/rstudio/mleap

R Interface to MLeap
https://github.com/rstudio/mleap
jvm mleap pipelines spark sparklyr
Last synced: 2 months ago
JSON representation
R Interface to MLeap
Host: GitHub
URL: https://github.com/rstudio/mleap
Owner: rstudio
License: apache-2.0
Created: 2018-03-15T14:18:41.000Z (over 7 years ago)
Default Branch: main
Last Pushed: 2022-10-22T17:20:13.000Z (over 2 years ago)
Last Synced: 2025-04-04T02:34:32.502Z (3 months ago)
Topics: jvm, mleap, pipelines, spark, sparklyr
Language: R
Homepage: http://spark.rstudio.com/guides/mleap/
Size: 677 KB
Stars: 24
Watchers: 6
Forks: 9
Open Issues: 9
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
Awesome Lists containing this project

awesome-sparklyr - mleap: R Interface to MLeap
README

        ---

title: "R interface for MLeap"

output:

  github_document:

    fig_width: 7

    fig_height: 5

---

```{r setup, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  out.width = "100%",

  fig.align='center',

  eval = TRUE

)

Sys.setenv("JAVA_HOME" = "/Library/Java/JavaVirtualMachines/jdk-18.0.1.1.jdk/Contents/Home")

rJava::.jinit() 

#library(mleap)

devtools::load_all()

library(tibble)

library(sparklyr)

Sys.setenv("JAVA_HOME" = "/usr/local/java")

toc <- function() {

  re <- readLines("README.Rmd")

  has_title <- as.logical(lapply(re, function(x) substr(x, 1, 2) == "##"))

  only_titles <- re[has_title]

  titles <- trimws(gsub("#", "", only_titles))

  links <- trimws(gsub("`", "", titles))

  links <- tolower(links)

  links <- trimws(gsub(" ", "-", links))

  links <- trimws(gsub(",", "", links))

  toc_list <- lapply(

    seq_along(titles),

    function(x) {

      pad <- ifelse(substr(only_titles[x], 1, 3) == "###", "    - ", "  - ")

      paste0(pad, "[", titles[x], "](#",links[x], ")")

    }

  )

  toc_full <- paste(toc_list, collapse = "\n") 

  cat(toc_full)

}

```

[![R-CMD-check](https://github.com/rstudio/mleap/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/rstudio/mleap/actions/workflows/R-CMD-check.yaml)

[![MLeap-Tests](https://github.com/rstudio/mleap/actions/workflows/mleap-tests.yaml/badge.svg)](https://github.com/rstudio/mleap/actions/workflows/mleap-tests.yaml)

[![Coverage status](https://codecov.io/gh/rstudio/mleap/branch/master/graph/badge.svg)](https://codecov.io/github/rstudio/mleap?branch=master) [![CRAN status](https://www.r-pkg.org/badges/version/mleap)](https://cran.r-project.org/package=mleap)

## What is MLeap?

[MLeap](https://github.com/combust/mleap) allows us to take Spark pipelines to production. The MLeap runtime can recreate most of Spark's feature transformers 

and model predictions. This allows for the ML Pipeline to be **deployed with no

Spark dependencies**. 

![Figure 1 - Train in Spark](man/readme/mleap-fit.png)

In practice, we can save the ML Pipeline Model (fitted model) as an MLeap bundle 

(see Figure 1). MLeap serializes the pipeline steps and model.  The resulting Zip 

file can  then be used in an external environment that has MLeap. Once the MLeap 

bundle is loaded in the new environment, new data can be passed to obtain 

predictions (see Figure 2).

![Figure 2 - Deploy with MLeap](man/readme/mleap-predict.png)

## The `mleap` package

The goal of the `mleap` package is twofold:

1. Convert an ML Pipeline Model created in `sparklyr`, into an MLeap bundle file

1. Load an MLeap bundle file into an R session, and then use the loaded bundle for

predictions

Additionally, the `mleap` package allows us to load an existing MLeap bundle into

a Spark session. This would typically be to re-train, or modify a previously 

created ML Pipeline Model. 

The primary functions in `mleap` are:

- `ml_write_to_bundle_transformed()` - Writes an MLeap bundle. It depends on data that has

been trained using the pipeline

- `mleap_load_bundle()` - Loads an MLeap bundle file into R 

- `mleap_transform()` - Runs the MLeap bundle steps against new data in R

Additional operational functions in `mleap` are:

- `ml_read_bundle()` - Loads an MLeap bundle file into Spark, via a `sparklyr`

session

- `ml_write_bundle()` - Writes an MLeap bundle. It depends on a sample of the 

training data to re-train the pipeline

![Figure 3 - mleap functions](man/readme/mleap-functions.png)

## Use Cases

Here are couple of use cases to consider using MLeap, with `mleap`:

- It opens the door to **collaborate with non-R, and even non-Spark, teams**. The resulting

MLeap bundle can be used as the integration for those teams to use the model in

other environments.

- **Deploy a Shiny app, or a `plumber` API, with no dependencies on Spark**. Using

`mleap`, the model can be loaded into the R environment, and then used for predictions

within the R artifact. 

## Getting started

In order for the R package to work, we will need a local installation of MLeap. 

Maven is required to install MLeap. `mleap` contains functions to take care of

that. 

### Steps

1. **Install `mleap`.** For the CRAN version use:

    ```r

    install.packages("mleap")

    ```

   

   For the development version, use:

    

    ```r

    devtools::install_github("rstudio/mleap")

    ```

2. **Install Maven.** If you already have Maven installed, you can 

let `mleap` know by setting an R option:

    ```r

    options(maven.home = "path/to/maven")`:

    ```

    

    If no installation of Maven exists, use:

    ```r

    mleap::install_maven()

    ```

3. **Install MLeap.** There are a couple of considerations regarding the version

of MLeap to install:

    - If using Spark, the version of MLeap to install and use will be that closest

    to the recommended one by the developers of MLeap. The `mleap_dep_versions_table()`

    contains the combinations of Spark and MLeap versions as reference. 

    

    - If not using Spark, meaning, that we are using `mleap` to load an existing

    bundle, then we would need to match the version of MLeap in which the 

    bundle was originally created.

    ```r

    mleap::install_mleap(version = "0.20.0")

    ```

## Example

For the example, we will use the *Fine Foods* example data. It contains reviews

of foods. We will use an ML Pipeline Model to predict if the verbiage in the

review can tell us if the customer thinks if the product is "great".

### Create the pipeline

1. We will use a local version of Spark, version 3.2:

    ```{r}

    library(sparklyr)

    library(modeldata)

    

    data("small_fine_foods")

    

    sc <- spark_connect(master = "local", version = "3.2")

    

    sff_training_data <- copy_to(sc, training_data)

    

    sff_testing_data <- copy_to(sc, testing_data)

    ```

1. We will create an ML Pipeline. We will index the outcome varaible (**score**),

and then use several text feature transformers to create the **features** column

which will be used as our predictor:

    ```{r}

    sff_pipeline <- ml_pipeline(sc) %>% 

      ft_string_indexer(

        input_col = "score",

        output_col = "label",

        handle_invalid = "keep",

        string_order_type = "alphabetDesc"

      ) %>% 

      ft_tokenizer(

        input_col = "review",

        output_col = "word_list"

      ) %>% 

      ft_stop_words_remover(

        input_col = "word_list", 

        output_col = "wo_stop_words"

        ) %>% 

      ft_hashing_tf(

        input_col = "wo_stop_words", 

        output_col = "hashed_features", 

        num_features = 4096,

        binary = TRUE

        ) %>%

      ft_normalizer(

        input_col = "hashed_features", 

        output_col = "features"

        ) %>% 

      ml_logistic_regression(elastic_net_param = 0.05, reg_param = 0.25)

    ```

1. An ML Pipeline Model is now created after running the training data through the

pipeline created in the previous step:

    ```{r}

    sff_pipeline_model <- ml_fit(sff_pipeline, sff_training_data)

    ```

1. Assuming we are happy with the results. We run the same pipeline using the

hold-out set (`sff_testing_data`).  The idea, is that we can use this last

transformed data set as a base for our MLeap bundle.

    ```{r}

    sff_test_predictions <- sff_pipeline_model %>% 

      ml_transform(sff_testing_data)

    ```

1. Using `ml_write_to_bundle_transformed()` from `mleap`, we save the new ML

Pipeline Model as an MLeap bundle. We also pass the transformed data set we

created with the hold-out test set. 

    ```{r}

    ml_write_to_bundle_transformed(

      x = sff_pipeline_model,  

      transformed_dataset = sff_test_predictions,  

      path = "sff.zip", 

      overwrite = TRUE

      )

    ```

1. We can now close the Spark connection

    ```{r}

    spark_disconnect(sc) 

    ```

### Loading an MLeap bundle to R (without Spark dependencies)

1. We can use the same bundle created in the previous section to load into R. 

Simply pass the path to the Zip file to `ml_load_bundle()`:

    ```{r}

    sff_mleap_model <- mleap_load_bundle("sff.zip")

    

    sff_mleap_model

    ```

1. We can use `mleap_model_schema()` to view more information about the contents

of the bundle:

    ```{r}

    mleap_model_schema(sff_mleap_model)

    ```

1. `mleap_transform()` can process the model and new data. Pass a `tibble` with

the expected **input** variables:

    ```{r}

    tibble(review = "worst bad thing I will never buy again", score = "") %>% 

      mleap_transform(sff_mleap_model, .) %>% 

      glimpse()

    ```

    

    ```{r}

    tibble(review = "I really loved the proudct best product", score = "") %>% 

      mleap_transform(sff_mleap_model, .) %>% 

      dplyr::glimpse()

    ```

## Known limitations

MLeap translates the feature transformer and models into its own code base.  Not

everything available in Spark is translated. 

This means two layering things:

1. No `dplyr` transformation is available. Only models and feature transformers are

available. In `sparklyr`, feature transformers are functions that start with `ft_`.

1. Not every Spark Feature Transformer and model are supported. Please refer to 

the MLeap documentation to see a concise view of what is available: [MLeap Supported Transformers & Models](https://combust.github.io/mleap-docs/core-concepts/transformers/support.html).

Most notably, the following three transformers are not supported:

- `ft_dplyr_transformer()`

- `ft_sql_transformer()`

- `ft_r_formula()`

There is a **workaround for `ft_r_formula()`**. It involves using the ML Pipeline

"way" of setting up the outcome and predictors.  For the predictors, use `ft_vector_assembler()` if the final stage of the predictors is not a single 

vectorized variable.  For outcomes, anything numeric 

works fine, but anything categorical will not. Use `ft_string_indexer()` on top 

of the outcome variable, before passing it to the modeling step (See the Example 

section).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rstudio/mleap

Awesome Lists containing this project

README