Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/4paradigm/openembedding

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.
https://github.com/4paradigm/openembedding

distributed-training embedding-layers model-parallel parameter-server tensorflow tensorflow-training

Last synced: 2 months ago
JSON representation

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

Awesome Lists containing this project

README

        

# OpenEmbedding

[![build status](https://github.com/4paradigm/openembedding/actions/workflows/build.yml/badge.svg)](https://github.com/4paradigm/openembedding/actions/workflows/build.yml)
[![docker pulls](https://img.shields.io/docker/pulls/4pdosc/openembedding.svg)](https://hub.docker.com/r/4pdosc/openembedding)
[![python version](https://img.shields.io/pypi/pyversions/openembedding.svg?style=plastic)](https://badge.fury.io/py/openembedding)
[![pypi package version](https://badge.fury.io/py/openembedding.svg)](https://badge.fury.io/py/openembedding)
[![downloads](https://pepy.tech/badge/openembedding)](https://pepy.tech/project/openembedding)

English version | [中文版](README_cn.md)

## About

**OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration.**

Nowadays, many machine learning and deep learning applications are built based on parameter servers, which are used to efficiently store and update model weights. When a model has a large number of sparse features (e.g., Wide&Deep and DeepFM for CTR prediction), the number of weights easily runs into billions to trillions. In such a case, the tradition synchronization solutions (such as the Allreduce-based solution adopted by Horovod) are unable to achieve high-performance because of massive communication overhead introduced by a tremendous number of sparse features. In order to achieve efficiency for such sparse models, we develop OpenEmbedding, which enhances the parameter server especially for the sparse model training and inference.

## Highlights

Efficiency
- We propose an efficient customized sparse format to handle sparse features. Together with our fine-grained optimization, such as cache-conscious algorithms, asynchronous cache read and write, and lightweight locks to maximize parallelism. OpenEmbedding is able to achieve the performance speedup of 3-8x compared with the Allreduce-based solution on a single machine equipped with 8 GPUs for sparse model training.

Ease-of-use
- We have integrated OpenEmbedding into Tensorflow. Only three lines of code changes are required to utilize OpenEmbedding in Tensorflow for both training and inference.

Adaptability
- In addition to Tensorflow, it is straightforward to integrate OpenEmbedding into other popular frameworks. We have demonstrated the integration with DeepCTR and Horovod in the examples.

## Benchmark

![benchmark](documents/images/benchmark.png)

For models that contain sparse features, it is difficult to speed up using the Allreduce-based framework Horovod. Using both OpenEmbedding and Horovod can get better acceleration effects. In the single 8 GPU scene, the speedup ratio is 3 to 8 times. Many models achieved 3 to 7 times the performance of Horovod.

- [Benchmark](documents/en/benchmark.md)

## Install & Quick Start

You can install and run OpenEmbedding by the following steps. The examples show the whole process of training [criteo](https://labs.criteo.com/2014/09/kaggle-contest-dataset-now-available-academic-use/) data with OpenEmbedding and predicting with Tensorflow Serving.

### Docker

NVIDIA docker is required to use GPU in image. The OpenEmbedding image can be obtained from [Docker Hub](https://hub.docker.com/r/4pdosc/openembedding/tags).

```bash
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
docker run --rm --gpus all -v /tmp/criteo:/openembedding/tmp/criteo \
4pdosc/openembedding:latest examples/run/criteo_deepctr_standalone.sh

# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v /tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5

# Send requests and get predict results.
docker run --rm --network host 4pdosc/openembedding:latest examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example
```

### Ubuntu

```bash
# Install the dependencies required by OpenEmbedding.
apt update && apt install -y gcc-7 g++-7 python3 libpython3-dev python3-pip
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding

# Install the dependencies required by examples.
apt install -y git cmake mpich curl
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py

# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding

# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh

# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5

# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example
```

### CentOS

```bash
# Install the dependencies required by OpenEmbedding.
yum install -y centos-release-scl
yum install -y python3 python3-devel devtoolset-7
scl enable devtoolset-7 bash
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding

# Install the dependencies required by examples.
yum install -y git cmake mpich curl
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py

# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding

# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh

# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5

# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example
```

### Note

The installation usually requires g++ 7 or higher, or a compiler compatible with `tf.version.COMPILER_VERSION`. The compiler can be specified by environment variable `CC` and `CXX`. Currently OpenEmbedding can only be installed on linux.
```bash
CC=gcc CXX=g++ pip3 install openembedding
```

If TensorFlow was updated, you need to reinstall OpenEmbedding.
```bash
pip3 uninstall openembedding && pip3 install --no-cache-dir openembedding
```

## User Guide

A sample program for common usage is as follows.

Create `Model` and `Optimizer`.
```python
import tensorflow as tf
import deepctr.models import WDL
optimizer = tf.keras.optimizers.Adam()
model = WDL(feature_columns, feature_columns, task='binary')
```

Transform to distributed `Model` and distributed `Optimizer`. The `Embedding` layer will be stored on the parameter server.
```python
import horovod as hvd
import openembedding.tensorflow as embed
hvd.init()

optimizer = embed.distributed_optimizer(optimizer)
optimizer = hvd.DistributedOptimizer(optimizer)

model = embed.distributed_model(model)
```
Here, `embed.distributed_optimizer` is used to convert the TensorFlow optimizer into an optimizer that supports the parameter server, so that the parameters on the parameter server can be updated. The function `embed.distributed_model` is to replace the `Embedding` layers in the model and override the methods to support saving and loading with parameter servers. Method `Embedding.call` will pull the parameters from the parameter server and the backpropagation function was registered to push the gradients to the parameter server.

Data parallelism by Horovod.
```python
model.compile(optimizer, "binary_crossentropy", metrics=['AUC'],
experimental_run_tf_function=False)
callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback() ]
model.fit(dataset, epochs=10, verbose=2, callbacks=callbacks)
```

Export as a stand-alone SavedModel so that can be loaded by TensorFlow Serving.
```python
if hvd.rank() == 0:
# Must specify include_optimizer=False explicitly
model.save_as_original_model('model_path', include_optimizer=False)
```

More examples as follows.
- [Replace `Embedding` layer](examples/criteo_deepctr_hook.py)
- [Transform network model](examples/criteo_deepctr_network.py)
- [Custom subclass model](examples/criteo_lr_subclass.py)
- [With MirroredStrategy](examples/criteo_deepctr_network_mirrored.py)
- [With MultiWorkerMirroredStrategy and MPI](examples/criteo_deepctr_network_mirrored.py)

## Build

### Docker Build

```
docker build -t 4pdosc/openembedding-base:0.1.0 -f docker/Dockerfile.base .
docker build -t 4pdosc/openembedding:0.0.0-build -f docker/Dockerfile.build .
```

### Native Build

The compiler needs to be compatible with `tf.version.COMPILER_VERSION` (>= 7), and install all [prpc](https://github.com/4paradigm/prpc) dependencies to `tools` or `/usr/local`, and then run `build.sh` to complete the compilation. The `build.sh` will automatically install prpc (pico-core) and parameter-server (pico-ps) to the `tools` directory.

```bash
git submodule update --init --checkout --recursive
pip3 install tensorflow
./build.sh clean && ./build.sh build
pip3 install ./build/openembedding-*.tar.gz
```

## Features

TensorFlow 2
- `dtype`: `float32`, `float64`.
- `tensorflow.keras.initializers`
- `RandomNormal`, `RandomUniform`, `Constant`, `Zeros`, `Ones`.
- The parameter `seed` is currently ignored.
- `tensorflow.keras.optimizers`
- `Adadelta`, `Adagrad`, `Adam`, `Adamax`, `Ftrl`, `RMSprop`, `SGD`.
- `decay` and `LearningRateSchedule` are not supported.
- `Adam(amsgrad=True)` is not supported.
- `RMSProp(centered=True)` is not supported.
- The parameter server uses a sparse update method, which may cause different training results for the `Optimizer` with momentum.
- `tensorflow.keras.layers.Embedding`
- Support array for known `input_dim` and hash table for unknown `input_dim` (2**63 range).
- Can still be stored on workers and use dense update method.
- Should not use `embeddings_regularizer`, `embeddings_constraint`.
- `tensorflow.keras.Model`
- Can be converted to distributed `Model` and automatically ignore or convert incompatible settings (such as `embeddings_constraint`).
- Distributed `save`, `save_weights`, `load_weights` and `ModelCheckpoint`.
- Saving the distributed `Model` as a stand-alone SavedModel, which can be load by TensorFlow Serving.
- Do not support training multiple distributed `Model`s in one task.
- Can collaborate with Horovod. Training with `MirroredStrategy` or `MultiWorkerMirroredStrategy` is experimental.

## TODO

- Improve performance
- Support PyTorch training
- Support `tf.feature_column.embedding_column`
- Approximate `embedding_regularizer`, `LearningRateSchedule` and etc.
- Improve the support for `Initializer` and `Optimizer`
- Training multiple distributed `Model`s in one task
- Support ONNX

## Designs

- [Training](documents/en/training.md)
- [Serving](documents/en/serving.md)

## Authors

- Yiming Liu ([email protected])
- Yilin Wang ([email protected])
- Cheng Chen ([email protected])
- Guangchuan Shi ([email protected])
- Zhao Zheng ([email protected])

## Persistent Memory (PMem)

Currently, the interface for persistent memory is experimental.
PMem-based OpenEmbedding provides a lightweight checkpointing scheme as well as the comparable performance with its DRAM version. For long-running deep learning recommendation model training, PMem-based OpenEmbedding provides not only an efficient but also a reliable training process.
- [PMem-based OpenEmbedding](documents/en/pmem.md)

## Publications

- [OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory](documents/papers/openembedding_icde2023.pdf). Cheng Chen, Yilin Wang, Jun Yang, Yiming Liu, Mian Lu, Zhao Zheng, Bingsheng He, Weng-Fai Wong, Liang You, Penghao Sun, Yuping Zhao, Fenghua Hu, and Andy Rudoff. In 2023 IEEE 39rd International Conference on Data Engineering (ICDE) 2023.