Projects in Awesome Lists tagged with deduplication
A curated list of projects in awesome lists tagged with deduplication .
https://github.com/restic/restic
Fast, secure, efficient backup program
backup dedupe deduplication go restic secure-by-default
Last synced: 12 May 2025
https://github.com/kopia/kopia
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
backup cloud deduplication encryption google-cloud-storage
Last synced: 12 May 2026
https://github.com/borgbackup/borg
Deduplicating archiver with compression and authenticated encryption.
backup borgbackup compression deduplication encryption python ssh
Last synced: 16 Mar 2026
https://github.com/prometheus/alertmanager
Prometheus Alertmanager
alertmanager deduplication email hacktoberfest monitoring notifications opsgenie pagerduty slack
Last synced: 12 May 2025
https://github.com/openvenues/libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
address address-parser c deduping deduplication international machine-learning natural-language-processing nlp record-linkage
Last synced: 12 May 2025
https://github.com/rustic-rs/rustic
rustic - fast, encrypted, and deduplicated backups powered by Rust
backup deduplication encryption hacktoberfest restic rust
Last synced: 13 May 2025
https://github.com/mhx/dwarfs
A fast high compression read-only file system for Linux, Windows and macOS
archiving compression cpp deduplication dwarfs filesystem flac fuse fuse-filesystem linux lrzip lzma macfuse macos squashfs windows winfsp zpaq zstd
Last synced: 02 Apr 2026
https://github.com/borgmatic-collective/borgmatic
Simple, configuration-driven backup software for servers and workstations
apprise backup borg borgbackup btrfs deduplication healthchecks loki lvm mariadb mongodb mysql ntfy postgresql python servers sqlite upitme-kuma zabbix zfs
Last synced: 06 Feb 2026
https://github.com/sahib/rmlint
Extremely fast tool to remove duplicates and other lint from your filesystem
c deduplication duplicates fdupes filesystem lint python
Last synced: 14 May 2025
https://github.com/witten/borgmatic
Simple, configuration-driven backup software for servers and workstations
apprise backup borg borgbackup btrfs deduplication healthchecks loki lvm mariadb mongodb mysql ntfy postgresql python servers sqlite upitme-kuma zabbix zfs
Last synced: 02 May 2025
https://github.com/moj-analytical-services/splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
data-matching data-science deduplicate-data deduplication duckdb em-algorithm entity-resolution fuzzy-matching record-linkage spark uk-gov-data-science
Last synced: 13 May 2025
https://github.com/cupcakearmy/autorestic
Config driven, easy backup cli for restic.
backup cli config config-driven deduplication incremental incremental-backup pruning restic
Last synced: 14 May 2025
https://github.com/zinggAI/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics cdp customer-data-platform data-science databricks dataengineering datalake dataquality dedupe deduplication entity-resolution fuzzy-matching fuzzymatch identity-resolution master-data-management masterdata mdm ml snowflake spark
Last synced: 16 Nov 2025
https://github.com/NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 29 Jul 2025
https://github.com/zinggai/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark
Last synced: 14 May 2025
https://github.com/j535d165/recordlinkage
A powerful and modular toolkit for record linkage and duplicate detection in Python
data-matching dedupe deduplication entity-resolution machine-learning privacy python python-library record-linkage similarity string-distance utrecht-university
Last synced: 14 May 2025
https://github.com/J535D165/recordlinkage
A powerful and modular toolkit for record linkage and duplicate detection in Python
data-matching dedupe deduplication entity-resolution machine-learning privacy python python-library record-linkage similarity string-distance utrecht-university
Last synced: 26 Mar 2025
https://github.com/karanhudia/borg-ui
Replace complex Borg Backup terminal commands with a beautiful web UI. Create, schedule, and restore backups with just a few clicks.
automation back borg borg-backup borgbackup borgbase deduplication docker raspber sbc self-hosted webapp
Last synced: 04 Mar 2026
https://github.com/data-prep-kit/data-prep-kit
Open source project for data preparation for GenAI applications
code-quality data data-prep data-preparation data-preprocessing data-preprocessing-pipelines datacuration datarecipes deduplication finetuning large-language-models large-scale-data-processing llm llmapps malware python ray spark
Last synced: 11 Feb 2026
https://github.com/dpc/rdedup
Data deduplication engine, supporting optional compression and public key encryption.
backup data-deduplication deduplication encryption
Last synced: 15 May 2025
https://yomguithereal.github.io/talisman/
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
clustering deduplication fuzzy-matching information-retrieval machine-learning natural-language-processing record-linkage
Last synced: 15 Nov 2025
https://github.com/yomguithereal/talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
clustering deduplication fuzzy-matching information-retrieval machine-learning natural-language-processing record-linkage
Last synced: 14 Apr 2025
https://github.com/Yomguithereal/talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
clustering deduplication fuzzy-matching information-retrieval machine-learning natural-language-processing record-linkage
Last synced: 15 Mar 2025
https://github.com/NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 20 Jul 2025
https://github.com/fcorbelli/zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
backup compression deduplication solaris zpaq
Last synced: 12 Feb 2026
https://github.com/sreedevk/deduplicator
Filter, Sort & Delete Duplicate Files Recursively
deduplication duplicate-detection duplicate-files duplicatefilefinder filesystem rust
Last synced: 21 Jun 2025
https://github.com/netinvent/npbackup
A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)
backup cli compression deduplication gui healthcheck monitoring orchestrator prometheus-metrics restic vss
Last synced: 02 Apr 2026
https://github.com/cargo-limit/cargo-limit
Productivity improvements for Rust ecosystem: warnings are skipped until errors are fixed, LSP-independent Neovim integration, etc.
build cargo cargo-plugin cargo-wrapper cli crates deduplication filter limit neovim neovim-plugin nvim plugin productivity runner rust wrapper
Last synced: 05 Jan 2026
https://github.com/dm-vdo/kvdo
A kernel module which provide a pool of deduplicated and/or compressed block storage.
compression deduplication kernel-modules linux-kernel storage vdo
Last synced: 12 Apr 2025
https://github.com/Jaskey/RocketMQDedupListener
RocketMQ消息幂等去重消费者,支持使用MySQL或者Redis做幂等表,开箱即用
deduplication rocketmq rocketmq-client
Last synced: 03 May 2025
https://github.com/opensanctions/nomenklatura
Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources
data-integration deduplication record-link
Last synced: 17 Mar 2026
https://github.com/dm-vdo/vdo
Userspace tools for managing VDO volumes.
compression deduplication storage vdo
Last synced: 04 Apr 2025
https://laktak.github.io/chkbit/
Check your files for data corruption and run quick file deduplication
backup bitrot-detection btrfs cloud-backup data-degradation data-integrity dedup dedupe deduper deduplication disk-check storage-media
Last synced: 03 Apr 2026
https://github.com/007revad/synology_enable_deduplication
Enable deduplication with non-Synology SSDs and unsupported NAS models
deduplication diskstation dsm rackstation synology synology-disk-station synology-dsm synology-nas
Last synced: 05 Apr 2025
https://github.com/yornaath/batshit
A batch manager that will deduplicate and batch requests for a certain data type made within a window. Useful to batch requests made from multiple react components that uses react-query
async batch-processing concurrency deduplication fetch react react-query tanstack typescript
Last synced: 04 Apr 2025
https://github.com/F483/dejavu
Quickly detect already witnessed data.
command-line command-line-tool deduplication duplicate-values duplicates go golang history memory probabilistic
Last synced: 30 Mar 2025
https://github.com/f483/dejavu
Quickly detect already witnessed data.
command-line command-line-tool deduplication duplicate-values duplicates go golang history memory probabilistic
Last synced: 20 Aug 2025
https://github.com/vintasoftware/entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
approximate-nearest-neighbors data-matching deduplication deep-learning embeddings entity-matching entity-resolution python pytorch record-linkage representation-learning
Last synced: 08 Oct 2025
https://github.com/markusressel/py-image-dedup
CLI utility to find near duplicate images and remove all but the best copy.
dedup deduplication duplicate-detection duplicate-images find-duplicates hacktoberfest image-analysis image-comparison python python-3 python3
Last synced: 07 Apr 2025
https://github.com/siddhant-k-code/distill
Reliable LLM outputs start with clean context. Deterministic deduplication, compression, and caching for RAG pipelines.
ai-agents compression context-optimization deduplication deterministic developer-tools go golang llamaindex llm pinecone qdrant rag retrieval-augmented-generation vector-database
Last synced: 02 May 2026
https://github.com/deajan/backup-bench
Quick and dirty backup tool benchmark with reproducible results
backup benchmark benchmarking borgbackup bupstash compression deduplication duplicacy kopia restic
Last synced: 05 Apr 2025
https://github.com/nlfiedler/fastcdc-rs
FastCDC implementation in Rust
chunking-algorithm deduplication rust
Last synced: 04 Apr 2025
https://github.com/elemental-lf/benji
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
b2 backup block-based ceph deduplication iscsi kubernetes lvm s3
Last synced: 06 Apr 2025
https://github.com/laktak/chkbit
Check your files for data corruption and run quick file deduplication
backup bitrot-detection btrfs cloud-backup data-degradation data-integrity dedup dedupe deduper deduplication disk-check storage-media
Last synced: 04 Apr 2025
https://github.com/zouzias/spark-lucenerdd
Spark RDD with Lucene's query and entity linkage capabilities
deduplication entity-linking hacktoberfest linkage lucene rdd record-linkage spark spatial-search
Last synced: 21 Jan 2026
https://github.com/opengene/gencore
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
bioinformatics consensus deduplication deep-sequencing duplex duplex-sequencing duplication ngs sequencing sequencing-error sequencing-noise somatic
Last synced: 20 Aug 2025
https://github.com/OpenGene/gencore
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
bioinformatics consensus deduplication deep-sequencing duplex duplex-sequencing duplication ngs sequencing sequencing-error sequencing-noise somatic
Last synced: 09 May 2025
https://github.com/usc-isi-i2/rltk
Record Linkage ToolKit (Find and link entities)
deduplication entity-resolution linkage record-linkage similarity similarity-metric string-similarity
Last synced: 06 Oct 2025
https://github.com/jvirkki/dupd
CLI utility to find duplicate files
c deduplication duplicate-files duplicatefilefinder duplicates fdupes
Last synced: 21 Mar 2025
https://github.com/tsileo/blobstash
You personal database. Mirror of https://git.sr.ht/~tsileo/blobstash
backup blob-store blobstash content-addressed deduplication document-store go storage
Last synced: 17 Mar 2025
https://github.com/unreadablewxy/fs-curator
Automation for the serious data hoarder that wants to have their data and use it
deduplication directory-tree file-renamer file-sorting hard-links organizer
Last synced: 29 Jul 2025
https://github.com/lostatc/acid-store
[UNMAINTAINED] A transactional and deduplicating virtual file system
acid deduplication encryption filesystem fuse rclone redis rust s3 sftp sqlite storage
Last synced: 16 Jul 2025
https://github.com/AI-team-UoA/pyJedAI
An open-source library that leverages Python’s data science ecosystem to build powerful end-to-end Entity Resolution workflows.
data-disambigation data-matching deduplication duplicate-detection entity-matching entity-resolution fuzzy-matching link-discovery machine-learning python
Last synced: 01 Mar 2026
https://github.com/openvenues/lieu
Dedupe/batch geocode addresses and venues around the world with libpostal
address deduplication geocoding international venues
Last synced: 20 Aug 2025
https://github.com/fritshermans/deduplipy
Python package for deduplication/entity resolution using active learning
deduplication entity-resolution fuzzy-matching record-linkage
Last synced: 19 Feb 2026
https://github.com/ronomon/deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
chunking content-dependent deduplication nodejs
Last synced: 17 Aug 2025
https://github.com/daniel-liu-c0deb0t/umicollapse
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
data-structures deduplication fastq hamming java string-search string-similarity umis unique-molecular-identifiers
Last synced: 13 Apr 2025
https://github.com/hexhive/igor
cluster crash deduplication fuzzing grouping security similarity trace
Last synced: 03 May 2025
https://github.com/iscc/fastcdc-py
FastCDC implementation in Python https://pypi.org/project/fastcdc/
chunking chunking-algorithm content-dependent deduplication python
Last synced: 17 Feb 2026
https://github.com/PJDude/dude
Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation.
cli deduplication duplicate duplicate-detection duplicate-files duplicates duplicates-removal easy easy-to-use easyui gui gui-application python python3 sha1 threads tkinter utility utility-application
Last synced: 06 Mar 2025
https://github.com/zen-logic/file-hunter
File Hunter — catalog, deduplicate, and consolidate your archive storage
deduplication duplicate-finder file-catalog file-manager homelab python self-hosted sqlite storage web-ui
Last synced: 24 Apr 2026
https://github.com/jRimbault/yadf
Yet Another Dupes Finder
dedupe deduplication dupes-finder duplicate-detection fdupes file-deduplication
Last synced: 06 Mar 2025
https://github.com/lobocv/simpleflow
Generic simple workflows and concurrency patterns
batching concurrency counter deduplication generics go golang timeseries worflows workerpool
Last synced: 23 Apr 2025
https://github.com/dssg/pgdedupe
A simple command line interface to the datamade/dedupe library.
data-cleaning database dedupe deduplication postgresql python record-linkage
Last synced: 21 Jan 2026
https://github.com/j535d165/recordlinkage-annotator
A browser user interface for manual labeling of record pairs.
annotation-tool data-matching deduplication entity-resolution labeling-tool machine-learning record-linkage
Last synced: 14 Jul 2025
https://github.com/OlivierBinette/er-evaluation
An End-to-End Evaluation Framework for Entity Resolution Systems
author-name-disambiguation data-science deduplication disambiguation duplicate-detection entity-resolution evaluation fuzzy-matching inventor-name-disambiguation matching ml-evaluation ml-testing record-linkage statistics
Last synced: 01 Mar 2026
https://github.com/dupgit/sauvegarde
Continuous data protection for GNU/Linux (cdpfgl).
backup continuous-data-protection deduplication gnu-linux rest-api stateless
Last synced: 17 Dec 2025
https://github.com/jchristn/watsondedupe
Self-contained C# library for data deduplication using Sqlite
chunk chunk-data chunk-key compress compression data-deduplication dedupe deduplication duplicate-data nuget sqlite-database storage
Last synced: 28 Feb 2026
https://github.com/maxkhim/laravel-storage-dedupler
Laravel Package Prevents File Duplication
deduplication file-storage laravel laravel-package
Last synced: 01 Mar 2026
https://github.com/ing-bank/spark-matcher
Record matching and entity resolution at scale in Spark
deduplication entity-resolution record-linkage spark
Last synced: 23 Jun 2025
https://github.com/noahgift/rdedupe
A Rust based deduplication tool
clap command-line deduplication filesystem multithreading rust rust-lang
Last synced: 05 Apr 2026
https://github.com/samber/go-singleflightx
🧬 x/sync/singleflight but with generics, batching, sharding and nullable result
cache channel concurrent deduplication generics go in-flight singleflight sync
Last synced: 22 Apr 2025
https://github.com/benzsevern/goldenmatch
Entity resolution toolkit — deduplicate, match, and create golden records. 27 MCP tools on Smithery. Zero-config. 97.2% F1.
a2a agent data-engineering data-quality dbt deduplication entity-resolution fellegi-sunter fuzzy-matching golden-record golden-suite llm mcp-server polars pprl privacy-preserving python record-linkage record-matching remote-mcp
Last synced: 13 May 2026
https://github.com/shivam5992/dupandas
:bar_chart: python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
deduplication flexible-matching pandas python text-cleaner
Last synced: 02 Jul 2025
https://github.com/sergey-dryabzhinsky/dedupsqlfs
Deduplicating filesystem via Python3, FUSE and SQLite
backup compression deduplication fuse python python3 storage
Last synced: 14 Oct 2025
https://github.com/inexplicablemagic/photodedupe
A utility for locating near duplicate photos irrespective of image resolution, compression settings or file format.
computer-vision computer-vision-tools deduplication duplicate-detection image-deduplication
Last synced: 11 Jan 2026
https://github.com/immobiliare/ufoid
Ultra Fast Optimized Image Deduplication.
automation computer-vision deduplication images immobiliare-labs python
Last synced: 23 Apr 2025
https://github.com/davidsvy/Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
dataset deduplication fine-tuning fraud gpt2 huggingface lsh minhash nlp pytorch readability scam transformer web-scraping
Last synced: 13 Jul 2025
https://github.com/j535d165/febrl-fork-v0.4.2
Fork of the Freely Extensible Biomedical Record Linkage program
deduplication entity-resolution matching python-library record-linkage
Last synced: 15 Oct 2025
https://github.com/erofs/docs
EROFS documentation repo for https://erofs.docs.kernel.org
compression containers deduplication deflate documentation erofs filesystems linux-kernel lz4 lzma zstd
Last synced: 04 Apr 2026
https://github.com/bakdata/dedupe
Java DSL for (online) deduplication
data-cleaning data-cleansing deduplication duplicate-detection duplicate-removal
Last synced: 10 Apr 2025
https://github.com/indyjo/cafs
Content-Addressable File System (used by BitWrk)
chunking deduplication download http rolling-hash synchronization upload
Last synced: 27 Dec 2025
https://github.com/InexplicableMagic/photodedupe
A utility for locating near duplicate photos irrespective of image resolution, compression settings or file format.
computer-vision computer-vision-tools deduplication duplicate-detection image-deduplication
Last synced: 07 Apr 2025
https://github.com/vmchale/phash
Perceptual hashing command-line tool
command-line-tool deduplication duplication-detection duplication-finder haskell perceptual-hash phash
Last synced: 26 Apr 2025
https://github.com/lkarlslund/stringdedup
String deduplication package for Go
dedup deduplication golang string xxhash
Last synced: 22 Apr 2025
https://github.com/vyrti/quichash
Ultra fast hashing app for Linux, Mac, Windows, Freebsd
blake3 cross-platform deduplication freebsd hash linux macos md5 rust sha2 sha256 sha3 sha512 simd verification windows xxhash xxhash3
Last synced: 15 Jan 2026
https://github.com/ragibson/sms-mms-deduplication
Tool to remove duplicate text messages (SMS/MMS/RCS). RCS support is available for some clients.
deduplication mms rcs sms text-message
Last synced: 22 Apr 2025
https://github.com/nebucatnetzer/borg-qt
A Qt frontend for the command line software BorgBackup.
backup borg borgbackup borgbackup-gui deduplication gplv3 pyqt5 python3 qt5
Last synced: 03 Oct 2025
https://github.com/veqryn/slog-dedup
Golang structured logging (slog) deduplication and sorting for use with json logging
dedup deduplication golang golang-library json logging slog structured-logging
Last synced: 12 Aug 2025
https://github.com/marcnuth/deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
algorithms cv deduplication google imagehash shingling simhash
Last synced: 14 May 2025
https://github.com/ncn-foreigners/blocking
An R package for blocking records for record linkage / data deduplication based on approximate nearest neighbours algorithms.
annoy approximate-nearest-neighbor-search deduplication entity-resolution hnsw igraph record-linkage
Last synced: 21 Feb 2026
https://github.com/semanticarts/rdfhash
RDF Graph Compression Tool. Hash RDF subjects based on a checksum of their triples, effectively consolidating together subjects that contain identical definitions. Reduce time taken to mint URIs. Use Blank Nodes to your Advantage
compression deduplication hashing md5 rdf sha256
Last synced: 05 Feb 2026
https://github.com/nickcrews/mismo
The SQL/Ibis powered sklearn of record linkage
deduplication duckdb entity-resolution ibis python record-linkage sql
Last synced: 16 Mar 2026
https://github.com/deadsoul/dugu
Find, remove and avoid duplicates with dugu: The Duplicates Guru
deduplication dugu duplicate-detection duplicate-files duplicatefilefinder duplicates duplicates-guru python
Last synced: 05 Apr 2025
https://github.com/juntaki/bucketsync
S3 backed FUSE Filesystem written in Go with dedup and encryption.
deduplication filesystem fuse golang s3
Last synced: 14 Apr 2025
https://github.com/gamemann/linux-btrfs-lab
A small lab using Ubuntu 23.04 with the BTRFS file system to test deduplication feature.
23-04 btrfs dd deduplication disk disk-space documentation duperemove filesystem hard-drive kvm lab linux qemu save-space ssd ubuntu vm
Last synced: 26 Oct 2025