https://github.com/vlmarkov/fault-tolerance-library
MPI user-level checkpoint library
https://github.com/vlmarkov/fault-tolerance-library
checkpoint delta-encoding fault-tolerance fault-tolerance-library incremental jacobi-iteration laplace-equation mpi nbody-simulation particles recovery redundancy ulfm
Last synced: about 1 month ago
JSON representation
MPI user-level checkpoint library
- Host: GitHub
- URL: https://github.com/vlmarkov/fault-tolerance-library
- Owner: vlmarkov
- Created: 2016-10-08T10:28:29.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2020-07-11T13:40:57.000Z (almost 5 years ago)
- Last Synced: 2024-07-08T21:49:18.126Z (11 months ago)
- Topics: checkpoint, delta-encoding, fault-tolerance, fault-tolerance-library, incremental, jacobi-iteration, laplace-equation, mpi, nbody-simulation, particles, recovery, redundancy, ulfm
- Language: C++
- Homepage:
- Size: 23.6 MB
- Stars: 5
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MPI Fault Tolerance Library
## Feature list
+ **C/C++**
+ **MPI 3.0**
+ [**User-level checkpoint library**](https://github.com/54markov/Fault-Tolerance-Library/tree/master/user-level-checkpoint "link to source files")
+ [**ULFM**](http://fault-tolerance.org/category/ulfm/ "official site ULFM") (ver 1.0 support)
* **GNU/Linux**
* **Unit test (cxxtest framework)**## Test Samples
+ **head-2d** - [Laplace equation](https://en.wikipedia.org/wiki/Laplace%27s_equation "wiki Laplace equation") solver by [Jacobi iteration method](https://en.wikipedia.org/wiki/Jacobi_method "wiki Jacobi iteration method")
+ **n-body** - an [n-body simulation](https://en.wikipedia.org/wiki/N-body_simulation "wiki N-body simulation") approximates the motion of particles, often specifically particles that interact with one another through some type of physical forces.
+ **midpoint-rule**
+ **monte-carlo**
+ **nprimes**## User-level checkpoint library
+ **Rollback recovery** - checkpoint/restart based
+ **Failure detection** - ULFM based
+ **Snapshot creation** - hard drive based (in place/via NFS)
+ **Incremental chekpointing** - delta encoding based (XOR operation)
+ **Aditional compress procedure** - [zlib](https://zlib.net/ "official site") based## ULFM
+ **Survivability**
+ **Fault-tollerance**
+ **Compute redundancy**## WIP
+ **Implementing alternative recovery fault tolerance methods**
+ **Expanding test sample base**
+ **Reducing overhead**
+ **Improving impementation**## This project has been implemented as a part of my graduate thesis in Computing Systems department of Siberian State University of Telecommunications and Information Scienses.
+ **Graduate student: Vladislav Markov**
+ **Supervisor: Mikhail Kurnosov**