https://github.com/valay17/metadata-based-incremental-file-backup
A high-performance file synchronization system built in modern C++ that uses metadata caching, multithreading, and per-source serialized copying to perform incremental backups efficiently between local and external directories, minimizing redundant operations and optimizing disk I/O.
https://github.com/valay17/metadata-based-incremental-file-backup
backup-tool cache cli cpp20 incremental-backups linux metadata-driven multithreaded windows
Last synced: 10 months ago
JSON representation
A high-performance file synchronization system built in modern C++ that uses metadata caching, multithreading, and per-source serialized copying to perform incremental backups efficiently between local and external directories, minimizing redundant operations and optimizing disk I/O.
- Host: GitHub
- URL: https://github.com/valay17/metadata-based-incremental-file-backup
- Owner: Valay17
- Created: 2025-07-16T19:43:53.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-07-20T04:46:29.000Z (11 months ago)
- Last Synced: 2025-07-20T06:36:09.993Z (11 months ago)
- Topics: backup-tool, cache, cli, cpp20, incremental-backups, linux, metadata-driven, multithreaded, windows
- Homepage:
- Size: 36.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Metadata Based Incremental File Backup
FileSync is a fast and highly configurable file synchronization utility for local and external backups, available on both Windows and Linux. It supports multiple source directories and uses a metadata-based incremental sync approach to avoid redundant copying. File changes are detected using a combination of file size and modification time, which are hashed and stored in a binary cache to ensure accurate and efficient updates across runs.
Built in modern C++, the tool leverages multithreaded scanning and hashing to accelerate sync decisions across large directories. A dedicated copy manager serializes I/O operations per source to prevent disk contention and ensure stable performance. FileSync includes robust failure tracking: unsuccessful copy operations are detected and logged, and metadata is updated post-copy to reflect actual copy status, allowing precise reporting and recovery on future syncs. This makes it particularly well-suited for repeated backups, local mirroring, and any scenario requiring reliable, high-throughput file synchronization with built-in integrity tracking.
**Note:** FileSync scans only the configured source directories during synchronization. It does not perform a full scan or analysis of the destination directory’s contents, although it does verify the existence of destination folders and files as needed during the copy process. Updated files are overwritten in the destination along with their metadata. Files deleted from the source can be removed from the destination after a configurable number of sync runs, enabling automatic cleanup of stale files. This behavior is controlled by a configurable flag in the config file, providing flexibility between non-destructive and fully synchronized modes.
#
### Features
- **Metadata-Based Incremental Sync**
Uses a combination of file size and modification time, which are hashed and stored in a binary cache to detect changed files across sync runs. This avoids unnecessary file copies and speeds up subsequent synchronizations.
- **Multi Source and Multi Destination Support**
Supports syncing from multiple user defined source directories to a single destination per run. To sync to multiple destinations, separate runs of the tool are performed, with each destination managed independently, including its own metadata cache and sync state. This design allows clean separation between backup targets (e.g., external drives, network shares, secondary partitions) without cross contamination of metadata.
- **Per Source Serialized Copying**
Files within each source are copied one at a time, in the order they were discovered. This approach minimizes disk thrashing and significantly improves performance on HDDs. For SSDs or high performance environments(NAS), this behavior can be optionally disabled to enable parallel file copying per source.
- **Post Copy Metadata Update**
Metadata is updated only after a file has been physically and verifiably copied, ensuring that the cache reflects the true state of the destination. This prevents corrupt or incomplete sync states from being recorded and maintains cache integrity across runs.
- **Failure Detection and Recovery Support**
Copy failures such as permission issues, disk errors, or missing files are recorded during the sync. These failures are surfaced to the user via logs and excluded from metadata updates, allowing the tool to automatically retry them in subsequent runs without reprocessing successful files.
- **Multithreaded Scanning and Hashing**
Directories are scanned and files are hashed in parallel using a custom thread pool. This significantly improves performance on large directories while preserving system responsiveness through controlled concurrency.
- **Config File Driven Operation**
All behavior of FileSync is controlled via a simple, text based configuration file. This file defines source/destination paths, exclusions, sync mode, logging, stale file handling, and other advanced flags. Users can fully customize how the sync operates by modifying the config file.
- **Thread Safe Copy Manager**
A dedicated I/O thread handles copy execution by dequeuing per source copy queues, ensuring one at a time copy per source for consistent disk behavior.
- **Full Overlap Between Sync and Copy Phases**
Sync decisions (scanning and hashing) for one source can run while another source's files are being copied, maximizing throughput.
- **Efficient Queue Synchronization**
Thread safe mechanisms using mutexes and condition variables coordinate sync threads and the global copy manager.
- **Safe Path Handling and Validation**
Only absolute paths are supported and resolved; invalid or inaccessible paths are detected and reported before syncing begins.
- **Duplicate, Nested Folder, and Symlink Handling**
Detects duplicate source directories and nested folder relationships to avoid redundant operations. Symbolic links are ignored to prevent circular references and unintended copies.
- **INFO / ERROR Log Levels**
Logs are categorized by severity, helping users trace sync events or failure points clearly.
- **Live Console Feedback**
Real time sync and copy updates are shown in the terminal.
- **Failure Mode**
Ensures safe, resumable syncing in case of critical failures such as disk full, I/O errors, or destination disconnection. Copy operations are verified per source before being marked complete, and any failed files are automatically retried on the next run without affecting already synced files, allowing users to fix the issue and resume syncing cleanly using the existing cache.
#
### Why Use This Tool?
- Backing up personal files and documents to any storage medium accessible via the local file system such external hard drives, network-mounted drives or USB drives.
- Syncing development projects and code repositories across multiple workstations or laptops.
- Maintaining up to date media libraries (photos, videos, music) without redundant copying.
- Running scheduled incremental backups with minimal impact on system resources.
#
### What Can I Back Up?
FileSync supports backing up a wide range of files and directories, including:
- Regular files of any type, such as documents, images, videos, and executables.
- Hidden files and folders (e.g., dotfiles and dotfolders on Linux like `.bashrc`, and hidden files/folders on Windows like `desktop.ini` or hidden system folders).
- Multiple source directories, allowing complex backup configurations.
- Entire folders, including nested subdirectories, while avoiding redundant scanning of nested or duplicate sources.
**Note:**
- Only absolute paths are supported for both sources and destinations.
- Symbolic links are ignored to prevent circular references and unintended copies.
- Tool assumes that no failure occurs during the first sync run. A successful initial sync ensures accurate metadata and smooth operation in future runs.
- Tool does not check for available disk space on the destination before copying. If the destination runs out of space during transfer, the sync will stop and mark pending files as incomplete. Simply rerun the program and it will automatically enter failure recovery mode to continue any pending or partially completed file transfers.
- If you have very large or deeply nested directory structures, it is recommended to split your sources into separate entries in the config file. This ensures quicker recovery from failures on a per source basis, enables better parallelism, improves isolation of errors, and minimizes the risk of a single failure affecting the entire source's sync process.
- The tool is designed to operate on any storage medium accessible via a standard file system path including local disks, external drives, mounted network shares (such as SMB/CIFS or NFS), iSCSI volumes, and FUSE mounted systems. Protocol based sources like FTP, SFTP, or HTTP are not natively supported unless mounted into the local file system using 3rd party tools.
#
### Installation
Prebuilt binaries are provided as `.zip` files for Windows and `.tar.gz` files for Linux in the Releases section. (Architecture `x64`)
#### Building from Source
Build using CMake:
```cmd
cmake ..
cmake --build . --config Release
```
#
### Configuration File Info
FileSync uses a simple text based configuration file to control its behavior. The config file supports defining:
- Source directories (absolute paths only) (can be files and/or directories, both supported)
- Destination directory (absolute path)
- File and folder exclusions by name (absolute path only)
- Sync mode for defining threadcount and resource utilization
- Maximum number of log files to retain
- Disk type for copy thrashing prevention
- Stale file removal of files no longer present in the source from the destination
- Number of runs before stale files are removed from stored cache and destination(if enabled)
**Note:** The order of entries in the config file does not matter. Sources, excludes, and options can appear in any sequence. Refer to Customization below for more info.
### Acceptable Values for Configuration Flags
```
Source = (Absolute Source Path of file or directory) [Local, UNC, POSIX and Mapped Paths]
Destination = (Absolute Destination Path)
Exclude = (Absolute Path of file or directory to be excluded)
Mode = (BG/Inter/GodSpeed)
DiskType = (SSD/HDD) (HDD for no copy thrashing [sequential writes per source], SSD for copy thrashing [parallel writes multiple sources])
StaleEntries = (integer value)
DeleteStaleFromDest = (YES/NO)
MaxLogFiles = (integer value)
```
#### Sample Configuration Files
Windows
```
Source = C:\Users\YourName\Documents
Source = C:\Users\YourName\Desktop\To-Do.txt
Source = D:\Projects
Destination = D:\Backup
Exclude = C:\Users\YourName\Documents\help.txt
Exclude = D:\Projects\Python
Mode = Inter
MaxLogFiles = 5
DiskType = HDD
DeleteStaleFromDest = NO
StaleEntries = 10
```
Linux
```
Source = /home/username/Documents
Source = /var/media
Source = /usr/include/zlib.h
Source = /mnt/c/users/YourName/Documents
Destination = /mnt/backup
Exclude = /mnt/c/users/YourName/Documents/help.txt
Mode = GodSpeed
MaxLogFiles = 10
DiskType = SSD
DeleteStaleFromDest = YES
StaleEntries = 2
```
**Source and Destination are Mandatory, Rest all are Optional**
#
### Usage
Simply run the FileSync executable. It automatically loads the configuration file named `Config.txt` located in the same directory as the binary.
```
FileSync.exe
```
```
./FileSync
```
#
### Customization Locations in Code
Below are the places where you can edit the following things:
- **Configuration File Location**
Modify the path and filename used for storing log files. Default is same directory as the binary and `Config.txt`.
ConfigGlobal.cpp `Line 21`
- **Sync Log File Location**
Modify the path and filename used for storing log files. Default is same directory as the binary and `Sync_Logs`.
ConfigGlobal.cpp `Line 22`
- **Metadata Cache File Location**
Modify the path and filename used for storing metadata cache files. Default is same directory as the binary and `Meta_Cache`.
ConfigGlobal.cpp `Line 23`
- **Sync Mode and Thread Count**
ConfigGlobal.cpp `Line 24`,`Line 25` - Modify the default value the tool runs in. Default is `BG` and `2`.
ConfigParser.cpp `Line 214` - Modify the number of threads defined for BG, Inter and GodSpeed. Defaults are 2, 4 and Hardware Max Supported Thread Count.
- **Disk Type Optimization**
Set the type of disk (HDD or SSD) to optimize copy behavior and avoid disk thrashing. Default is `HDD`.
ConfigGlobal.cpp `Line 26`
- **Stale File Removal Threshold**
Set how many consecutive sync runs a file must be missing from the source before it is considered stale [and deleted from the destination (if enabled)]. Default is `5`.
ConfigGlobal.cpp `Line 27`
- **Stale File Deletion from Destination**
Enables or disables removal of files from the destination if they’ve been missing from the source for a configurable number of syncs. Default is `NO`.
When DeleteStaleFromDest = YES, stale files (those no longer present in the source) are removed from the destination after a set number of runs (StaleEntries).
**Note:** This process does not delete empty directories that may remain after file deletion. This ensures directory structure integrity is preserved and user is expected to delete the directory.
ConfigGlobal.cpp `Line 28`
- **Max Log Files**
Sets the maximum number of log files to retain before older logs are purged. Helps manage disk space used by logs. Default is 10.
ConfigGlobal.cpp `Line 29`
- **Thread Count for Hasher**
Adjust the default number of threads used by the Hasher. Typically, this value is determined by the selected Mode, but you can modify it here if you want to specify a different number.
FileHasher.hpp `Line 15`
- **Flags for robocopy/dd commands**
Modify the default flags used by the robocopy/dd commands.
FileCopier.hpp `Line 77`,`Line 131` respectively
#
### License
You are free to use and modify this software for personal or internal purposes.
However, redistribution or public distribution of this software or any modified versions is **not permitted** without explicit permission.