https://github.com/kleveross/ftlib
Fault-tolerant for DL frameworks
https://github.com/kleveross/ftlib
infrastructure machine-learning
Last synced: about 1 year ago
JSON representation
Fault-tolerant for DL frameworks
- Host: GitHub
- URL: https://github.com/kleveross/ftlib
- Owner: kleveross
- License: apache-2.0
- Created: 2019-11-08T02:35:42.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-07-05T20:58:02.000Z (almost 3 years ago)
- Last Synced: 2025-03-28T17:57:28.078Z (about 1 year ago)
- Topics: infrastructure, machine-learning
- Language: Python
- Homepage:
- Size: 800 KB
- Stars: 70
- Watchers: 10
- Forks: 13
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# FTLib
[](https://travis-ci.org/caicloud/ftlib)
[](./LICENSE)
FTLib (Fault-Tolerant Library) is a framework to keep data-parallel distributed training continue regardless worker loss or join. It exposes collective communication APIs with fault-tolerance support by gluing a `consensus` to a `communication library`, both of which can be user-specific. A distributed training using FTLib is able to continue as long as at least one single worker is alive and when new workers join the training.
## Status
Prototyping
## Design
* [Design docs](https://github.com/caicloud/ftlib/tree/master/docs/design)
## Develop Guide
**TODO**
Please refer to the [design docs](https://github.com/caicloud/ftlib/tree/master/docs/design).
## See also
* [ElasticDL](https://github.com/sql-machine-learning/elasticdl/)
## Getting started
### Where to use FTLib
- Less reliable infrastructure/script
Distributed training jobs running on less reliable infrastructure risks more as any worker or communication failure will leads to the termination of the entire job.
- Dynamic workload system
A system may reduce the total workload of distributed training jobs to release resources so that resource can be squeezed out for jobs with higher priority. Without such jobs with higher-priority, the system can increase the workload to avoid resource idling.
### Requirements
The requirements for using `FTLib` differs with choices of consensus and communication library. Please refer the `requirements.txt` under each consensus and communication library(*Not available, still in todo list*).
### Usage
Please refer [`test`](./test) for details on how to use `FTLib` in distributed training.
### Layout
```
.
├── CHANGELOG.md
├── deploy
├── docs
│ ├── design
│ └── imgs
├── ftlib
│ ├── consensus
│ ├── commlib
│ ├── ftlib_status.py
│ ├── __init__.py
│ └── rank_assign_scheme.py
├── LICENSE
├── OWNERS
├── README.md
├── requirements.txt
├── ROADMAP
├── scripts
└── test
```
## License
FTLib is [Apache license](LICENSE). Implementations of consensus and communication library may come with different licenses.