https://github.com/p4lang/p4app-switchml

Switch ML Application
https://github.com/p4lang/p4app-switchml

collectives dpdk in-network-compute machine-learning p4 p4lang rdma tna tofino

Last synced: 11 months ago
JSON representation

Switch ML Application

Host: GitHub
URL: https://github.com/p4lang/p4app-switchml
Owner: p4lang
License: apache-2.0
Created: 2021-01-14T15:30:00.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2022-07-15T03:23:37.000Z (almost 4 years ago)
Last Synced: 2025-04-25T05:36:29.262Z (about 1 year ago)
Topics: collectives, dpdk, in-network-compute, machine-learning, p4, p4lang, rdma, tna, tofino
Language: C++
Homepage: https://switchml.readthedocs.io/
Size: 347 KB
Stars: 184
Watchers: 20
Forks: 52
Open Issues: 13
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# SwitchML: Switch-Based Training Acceleration for Machine Learning

SwitchML accelerates the Allreduce communication primitive commonly used by distributed Machine Learning frameworks. It uses a programmable switch dataplane to perform in-network computation, reducing the volume of exchanged data by aggregating vectors (e.g., model updates) from multiple workers in the network. It provides an end-host library that can be integrated with ML frameworks to provide an efficient solution that speeds up training for a number of real-world benchmark models.

The switch hardware is programmed with a [P4 program](/dev_root/p4) for the [Tofino Native Architecture (TNA)](https://github.com/barefootnetworks/Open-Tofino) and managed at runtime through a [Python controller](/dev_root/controller) using BFRuntime. The [end-host library](/dev_root/client_lib) provides simple APIs to perform Allreduce operations using different transport protocols. We currently support UDP through DPDK and RDMA UC. The library has already been integrated with ML frameworks as a [NCCL plugin](/dev_root/frameworks_integration/nccl_plugin).

## Getting started
To run SwitchML you need to:
- compile the P4 program and deploy it on the switch (see the [P4 code documentation](/dev_root/p4))
- run the Python controller (see the [controller documentation](/dev_root/controller))
- compile and run the end-host program using the end-host library (see the [library documentation](/dev_root/client_lib))

The [examples](/dev_root/examples) folder provides simple programs that show how to use the APIs.

## Repo organization
The SwitchML repository is organized as follows:

```
docs: project documentation
dev_root:
┣ p4: P4 code for TNA
┣ controller: switch controller program
┣ client_lib: end-host library
┣ examples: set of example programs
┣ benchmarks: programs used to test raw performance
┣ frameworks_integration: code to integrate with ML frameworks
┣ third_party: third party software
┣ protos: protobuf description for the interface between controller and end-host
┗ scripts: helper scripts
```

## Testing
The [benchmarks](/dev_root/benchmarks) contain a benchmarks program that we used to measure SwitchML performances.
In our experiments (see benchmark documentation for details) we observed a more than 2x speedup over NCCL when using RDMA. Moreover, differently from ring Allreduce, with SwitchML performance are constant with any number of workers.

![Benchmarks](/docs/img/benchmark.png)

## Publication

> [Scaling Distributed Machine Learning with In-Network Aggregation
> A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, P. Richtarik.
> In Proceedings of NSDI’21, Apr 2021.](https://www.usenix.org/conference/nsdi21/presentation/sapio)

## Contributing
This project welcomes contributions and suggestions.
To learn more about making a contribution to SwitchML, please see our [Contribution](/CONTRIBUTING.md) page.

## The Team
SwitchML is a project driven by the [P4.org](https://p4.org) community and is currently maintained by Amedeo Sapio, Omar Alama, Marco Canini, Jacob Nelson.

## License
SwitchML is released with an Apache License 2.0, as found in the [LICENSE](/LICENSE) file.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/p4lang/p4app-switchml

Awesome Lists containing this project

README