https://github.com/4paradigm/openembedding

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.
https://github.com/4paradigm/openembedding

distributed-training embedding-layers model-parallel parameter-server tensorflow tensorflow-training

Last synced: 10 days ago
JSON representation

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

Host: GitHub
URL: https://github.com/4paradigm/openembedding
Owner: 4paradigm
License: apache-2.0
Created: 2021-07-07T06:37:15.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2023-04-13T06:56:51.000Z (about 2 years ago)
Last Synced: 2025-04-23T11:03:54.042Z (10 days ago)
Topics: distributed-training, embedding-layers, model-parallel, parameter-server, tensorflow, tensorflow-training
Language: C++
Homepage:
Size: 1.75 MB
Stars: 31
Watchers: 7
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# OpenEmbedding

[![build status](https://github.com/4paradigm/openembedding/actions/workflows/build.yml/badge.svg)](https://github.com/4paradigm/openembedding/actions/workflows/build.yml)
[![docker pulls](https://img.shields.io/docker/pulls/4pdosc/openembedding.svg)](https://hub.docker.com/r/4pdosc/openembedding)
[![python version](https://img.shields.io/pypi/pyversions/openembedding.svg?style=plastic)](https://badge.fury.io/py/openembedding)
[![pypi package version](https://badge.fury.io/py/openembedding.svg)](https://badge.fury.io/py/openembedding)
[![downloads](https://pepy.tech/badge/openembedding)](https://pepy.tech/project/openembedding)

English version | [中文版](README_cn.md)

## About

**OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration.**

Nowadays, many machine learning and deep learning applications are built based on parameter servers, which are used to efficiently store and update model weights. When a model has a large number of sparse features (e.g., Wide&Deep and DeepFM for CTR prediction), the number of weights easily runs into billions to trillions. In such a case, the tradition synchronization solutions (such as the Allreduce-based solution adopted by Horovod) are unable to achieve high-performance because of massive communication overhead introduced by a tremendous number of sparse features. In order to achieve efficiency for such sparse models, we develop OpenEmbedding, which enhances the parameter server especially for the sparse model training and inference.

## Highlights

Efficiency
- We propose an efficient customized sparse format to handle sparse features. Together with our fine-grained optimization, such as cache-conscious algorithms, asynchronous cache read and write, and lightweight locks to maximize parallelism. OpenEmbedding is able to achieve the performance speedup of 3-8x compared with the Allreduce-based solution on a single machine equipped with 8 GPUs for sparse model training.

Ease-of-use
- We have integrated OpenEmbedding into Tensorflow. Only three lines of code changes are required to utilize OpenEmbedding in Tensorflow for both training and inference.

Adaptability
- In addition to Tensorflow, it is straightforward to integrate OpenEmbedding into other popular frameworks. We have demonstrated the integration with DeepCTR and Horovod in the examples.

## Benchmark

![benchmark](documents/images/benchmark.png)

For models that contain sparse features, it is difficult to speed up using the Allreduce-based framework Horovod. Using both OpenEmbedding and Horovod can get better acceleration effects. In the single 8 GPU scene, the speedup ratio is 3 to 8 times. Many models achieved 3 to 7 times the performance of Horovod.

- [Benchmark](documents/en/benchmark.md)

## Install & Quick Start

You can install and run OpenEmbedding by the following steps. The examples show the whole process of training [criteo](https://labs.criteo.com/2014/09/kaggle-contest-dataset-now-available-academic-use/) data with OpenEmbedding and predicting with Tensorflow Serving.

### Docker

NVIDIA docker is required to use GPU in image. The OpenEmbedding image can be obtained from [Docker Hub](https://hub.docker.com/r/4pdosc/openembedding/tags).

```bash
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
docker run --rm --gpus all -v /tmp/criteo:/openembedding/tmp/criteo \
4pdosc/openembedding:latest examples/run/criteo_deepctr_standalone.sh

# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v /tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5

# Send requests and get predict results.
docker run --rm --network host 4pdosc/openembedding:latest examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example
```

### Ubuntu

```bash
# Install the dependencies required by OpenEmbedding.
apt update && apt install -y gcc-7 g++-7 python3 libpython3-dev python3-pip
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding

# Install the dependencies required by examples.
apt install -y git cmake mpich curl
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py

# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding

# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
# "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
# "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
# "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh

# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
-v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5

# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example
```

### CentOS

```bash
# Install the dependencies required by OpenEmbedding.
yum install -y centos-release-scl
yum install -y python3 python3-devel devtoolset-7
scl enable devtoolset-7 bash
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding

# Install the dependencies required by examples.
yum install -y git cmake mpich curl
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py

# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding

# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh

# Clear docker.
docker stop serving-example
docker rm serving-example
```

### Note

The installation usually requires g++ 7 or higher, or a compiler compatible with `tf.version.COMPILER_VERSION`. The compiler can be specified by environment variable `CC` and `CXX`. Currently OpenEmbedding can only be installed on linux.
```bash
CC=gcc CXX=g++ pip3 install openembedding
```

If TensorFlow was updated, you need to reinstall OpenEmbedding.
```bash
pip3 uninstall openembedding && pip3 install --no-cache-dir openembedding
```

## User Guide

A sample program for common usage is as follows.

Create `Model` and `Optimizer`.
```python
import tensorflow as tf
import deepctr.models import WDL
optimizer = tf.keras.optimizers.Adam()
model = WDL(feature_columns, feature_columns, task='binary')
```

Transform to distributed `Model` and distributed `Optimizer`. The `Embedding` layer will be stored on the parameter server.
```python
import horovod as hvd
import openembedding.tensorflow as embed
hvd.init()

optimizer = embed.distributed_optimizer(optimizer)
optimizer = hvd.DistributedOptimizer(optimizer)

model = embed.distributed_model(model)
```
Here, `embed.distributed_optimizer` is used to convert the TensorFlow optimizer into an optimizer that supports the parameter server, so that the parameters on the parameter server can be updated. The function `embed.distributed_model` is to replace the `Embedding` layers in the model and override the methods to support saving and loading with parameter servers. Method `Embedding.call` will pull the parameters from the parameter server and the backpropagation function was registered to push the gradients to the parameter server.

Data parallelism by Horovod.
```python
model.compile(optimizer, "binary_crossentropy", metrics=['AUC'],
experimental_run_tf_function=False)
callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback() ]
model.fit(dataset, epochs=10, verbose=2, callbacks=callbacks)
```

Export as a stand-alone SavedModel so that can be loaded by TensorFlow Serving.
```python
if hvd.rank() == 0:
# Must specify include_optimizer=False explicitly
model.save_as_original_model('model_path', include_optimizer=False)
```

More examples as follows.
- [Replace `Embedding` layer](examples/criteo_deepctr_hook.py)
- [Transform network model](examples/criteo_deepctr_network.py)
- [Custom subclass model](examples/criteo_lr_subclass.py)
- [With MirroredStrategy](examples/criteo_deepctr_network_mirrored.py)
- [With MultiWorkerMirroredStrategy and MPI](examples/criteo_deepctr_network_mirrored.py)

## Build

### Docker Build

```
docker build -t 4pdosc/openembedding-base:0.1.0 -f docker/Dockerfile.base .
docker build -t 4pdosc/openembedding:0.0.0-build -f docker/Dockerfile.build .
```

### Native Build

The compiler needs to be compatible with `tf.version.COMPILER_VERSION` (>= 7), and install all [prpc](https://github.com/4paradigm/prpc) dependencies to `tools` or `/usr/local`, and then run `build.sh` to complete the compilation. The `build.sh` will automatically install prpc (pico-core) and parameter-server (pico-ps) to the `tools` directory.

```bash
git submodule update --init --checkout --recursive
pip3 install tensorflow
./build.sh clean && ./build.sh build
pip3 install ./build/openembedding-*.tar.gz
```

## Features

TensorFlow 2
- `dtype`: `float32`, `float64`.
- `tensorflow.keras.initializers`
- `RandomNormal`, `RandomUniform`, `Constant`, `Zeros`, `Ones`.
- The parameter `seed` is currently ignored.
- `tensorflow.keras.optimizers`
- `Adadelta`, `Adagrad`, `Adam`, `Adamax`, `Ftrl`, `RMSprop`, `SGD`.
- `decay` and `LearningRateSchedule` are not supported.
- `Adam(amsgrad=True)` is not supported.
- `RMSProp(centered=True)` is not supported.
- The parameter server uses a sparse update method, which may cause different training results for the `Optimizer` with momentum.
- `tensorflow.keras.layers.Embedding`
- Support array for known `input_dim` and hash table for unknown `input_dim` (2**63 range).
- Can still be stored on workers and use dense update method.
- Should not use `embeddings_regularizer`, `embeddings_constraint`.
- `tensorflow.keras.Model`
- Can be converted to distributed `Model` and automatically ignore or convert incompatible settings (such as `embeddings_constraint`).
- Distributed `save`, `save_weights`, `load_weights` and `ModelCheckpoint`.
- Saving the distributed `Model` as a stand-alone SavedModel, which can be load by TensorFlow Serving.
- Do not support training multiple distributed `Model`s in one task.
- Can collaborate with Horovod. Training with `MirroredStrategy` or `MultiWorkerMirroredStrategy` is experimental.

## TODO

- Improve performance
- Support PyTorch training
- Support `tf.feature_column.embedding_column`
- Approximate `embedding_regularizer`, `LearningRateSchedule` and etc.
- Improve the support for `Initializer` and `Optimizer`
- Training multiple distributed `Model`s in one task
- Support ONNX

## Designs

- [Training](documents/en/training.md)
- [Serving](documents/en/serving.md)

## Authors

- Yiming Liu ([email protected])
- Yilin Wang ([email protected])
- Cheng Chen ([email protected])
- Guangchuan Shi ([email protected])
- Zhao Zheng ([email protected])

## Persistent Memory (PMem)

Currently, the interface for persistent memory is experimental.
PMem-based OpenEmbedding provides a lightweight checkpointing scheme as well as the comparable performance with its DRAM version. For long-running deep learning recommendation model training, PMem-based OpenEmbedding provides not only an efficient but also a reliable training process.
- [PMem-based OpenEmbedding](documents/en/pmem.md)

## Publications

- [OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory](documents/papers/openembedding_icde2023.pdf). Cheng Chen, Yilin Wang, Jun Yang, Yiming Liu, Mian Lu, Zhao Zheng, Bingsheng He, Weng-Fai Wong, Liang You, Penghao Sun, Yuping Zhao, Fenghua Hu, and Andy Rudoff. In 2023 IEEE 39rd International Conference on Data Engineering (ICDE) 2023.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/4paradigm/openembedding

Awesome Lists containing this project

README