An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with deduplication

A curated list of projects in awesome lists tagged with deduplication .

https://github.com/restic/restic

Fast, secure, efficient backup program

backup dedupe deduplication go restic secure-by-default

Last synced: 12 May 2025

https://github.com/kopia/kopia

Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.

backup cloud deduplication encryption google-cloud-storage

Last synced: 12 May 2026

https://github.com/borgbackup/borg

Deduplicating archiver with compression and authenticated encryption.

backup borgbackup compression deduplication encryption python ssh

Last synced: 16 Mar 2026

https://github.com/hsoft/dupeguru

Find duplicate files

deduplication python

Last synced: 18 Dec 2025

https://github.com/arsenetar/dupeguru

Find duplicate files

deduplication python

Last synced: 08 Apr 2025

https://github.com/openvenues/libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

address address-parser c deduping deduplication international machine-learning natural-language-processing nlp record-linkage

Last synced: 12 May 2025

https://github.com/rustic-rs/rustic

rustic - fast, encrypted, and deduplicated backups powered by Rust

backup deduplication encryption hacktoberfest restic rust

Last synced: 13 May 2025

https://github.com/mhx/dwarfs

A fast high compression read-only file system for Linux, Windows and macOS

archiving compression cpp deduplication dwarfs filesystem flac fuse fuse-filesystem linux lrzip lzma macfuse macos squashfs windows winfsp zpaq zstd

Last synced: 02 Apr 2026

https://github.com/sahib/rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem

c deduplication duplicates fdupes filesystem lint python

Last synced: 14 May 2025

https://github.com/witten/borgmatic

Simple, configuration-driven backup software for servers and workstations

apprise backup borg borgbackup btrfs deduplication healthchecks loki lvm mariadb mongodb mysql ntfy postgresql python servers sqlite upitme-kuma zabbix zfs

Last synced: 02 May 2025

https://github.com/moj-analytical-services/splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

data-matching data-science deduplicate-data deduplication duckdb em-algorithm entity-resolution fuzzy-matching record-linkage spark uk-gov-data-science

Last synced: 13 May 2025

https://github.com/karanhudia/borg-ui

Replace complex Borg Backup terminal commands with a beautiful web UI. Create, schedule, and restore backups with just a few clicks.

automation back borg borg-backup borgbackup borgbase deduplication docker raspber sbc self-hosted webapp

Last synced: 04 Mar 2026

https://github.com/dpc/rdedup

Data deduplication engine, supporting optional compression and public key encryption.

backup data-deduplication deduplication encryption

Last synced: 15 May 2025

https://yomguithereal.github.io/talisman/

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

clustering deduplication fuzzy-matching information-retrieval machine-learning natural-language-processing record-linkage

Last synced: 15 Nov 2025

https://github.com/yomguithereal/talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

clustering deduplication fuzzy-matching information-retrieval machine-learning natural-language-processing record-linkage

Last synced: 14 Apr 2025

https://github.com/Yomguithereal/talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

clustering deduplication fuzzy-matching information-retrieval machine-learning natural-language-processing record-linkage

Last synced: 15 Mar 2025

https://github.com/fcorbelli/zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix

backup compression deduplication solaris zpaq

Last synced: 12 Feb 2026

https://github.com/sreedevk/deduplicator

Filter, Sort & Delete Duplicate Files Recursively

deduplication duplicate-detection duplicate-files duplicatefilefinder filesystem rust

Last synced: 21 Jun 2025

https://github.com/netinvent/npbackup

A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)

backup cli compression deduplication gui healthcheck monitoring orchestrator prometheus-metrics restic vss

Last synced: 02 Apr 2026

https://github.com/cargo-limit/cargo-limit

Productivity improvements for Rust ecosystem: warnings are skipped until errors are fixed, LSP-independent Neovim integration, etc.

build cargo cargo-plugin cargo-wrapper cli crates deduplication filter limit neovim neovim-plugin nvim plugin productivity runner rust wrapper

Last synced: 05 Jan 2026

https://github.com/dm-vdo/kvdo

A kernel module which provide a pool of deduplicated and/or compressed block storage.

compression deduplication kernel-modules linux-kernel storage vdo

Last synced: 12 Apr 2025

https://github.com/Jaskey/RocketMQDedupListener

RocketMQ消息幂等去重消费者,支持使用MySQL或者Redis做幂等表,开箱即用

deduplication rocketmq rocketmq-client

Last synced: 03 May 2025

https://github.com/opensanctions/nomenklatura

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

data-integration deduplication record-link

Last synced: 17 Mar 2026

https://github.com/dm-vdo/vdo

Userspace tools for managing VDO volumes.

compression deduplication storage vdo

Last synced: 04 Apr 2025

https://laktak.github.io/chkbit/

Check your files for data corruption and run quick file deduplication

backup bitrot-detection btrfs cloud-backup data-degradation data-integrity dedup dedupe deduper deduplication disk-check storage-media

Last synced: 03 Apr 2026

https://github.com/007revad/synology_enable_deduplication

Enable deduplication with non-Synology SSDs and unsupported NAS models

deduplication diskstation dsm rackstation synology synology-disk-station synology-dsm synology-nas

Last synced: 05 Apr 2025

https://github.com/yornaath/batshit

A batch manager that will deduplicate and batch requests for a certain data type made within a window. Useful to batch requests made from multiple react components that uses react-query

async batch-processing concurrency deduplication fetch react react-query tanstack typescript

Last synced: 04 Apr 2025

https://github.com/kdeldycke/mail-deduplicate

📧 CLI to deduplicate mails from mail boxes.

babyl cleanup cli dedupe deduplication email mail mailbox maildir mbox mh mmdf python

Last synced: 13 Dec 2025

https://github.com/vintasoftware/entity-embed

PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

approximate-nearest-neighbors data-matching deduplication deep-learning embeddings entity-matching entity-resolution python pytorch record-linkage representation-learning

Last synced: 08 Oct 2025

https://github.com/siddhant-k-code/distill

Reliable LLM outputs start with clean context. Deterministic deduplication, compression, and caching for RAG pipelines.

ai-agents compression context-optimization deduplication deterministic developer-tools go golang llamaindex llm pinecone qdrant rag retrieval-augmented-generation vector-database

Last synced: 02 May 2026

https://github.com/deajan/backup-bench

Quick and dirty backup tool benchmark with reproducible results

backup benchmark benchmarking borgbackup bupstash compression deduplication duplicacy kopia restic

Last synced: 05 Apr 2025

https://github.com/nlfiedler/fastcdc-rs

FastCDC implementation in Rust

chunking-algorithm deduplication rust

Last synced: 04 Apr 2025

https://github.com/elemental-lf/benji

Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices

b2 backup block-based ceph deduplication iscsi kubernetes lvm s3

Last synced: 06 Apr 2025

https://github.com/laktak/chkbit

Check your files for data corruption and run quick file deduplication

backup bitrot-detection btrfs cloud-backup data-degradation data-integrity dedup dedupe deduper deduplication disk-check storage-media

Last synced: 04 Apr 2025

https://github.com/zouzias/spark-lucenerdd

Spark RDD with Lucene's query and entity linkage capabilities

deduplication entity-linking hacktoberfest linkage lucene rdd record-linkage spark spatial-search

Last synced: 21 Jan 2026

https://github.com/opengene/gencore

Generate duplex/single consensus reads to reduce sequencing noises and remove duplications

bioinformatics consensus deduplication deep-sequencing duplex duplex-sequencing duplication ngs sequencing sequencing-error sequencing-noise somatic

Last synced: 20 Aug 2025

https://github.com/OpenGene/gencore

Generate duplex/single consensus reads to reduce sequencing noises and remove duplications

bioinformatics consensus deduplication deep-sequencing duplex duplex-sequencing duplication ngs sequencing sequencing-error sequencing-noise somatic

Last synced: 09 May 2025

https://github.com/jvirkki/dupd

CLI utility to find duplicate files

c deduplication duplicate-files duplicatefilefinder duplicates fdupes

Last synced: 21 Mar 2025

https://github.com/tsileo/blobstash

You personal database. Mirror of https://git.sr.ht/~tsileo/blobstash

backup blob-store blobstash content-addressed deduplication document-store go storage

Last synced: 17 Mar 2025

https://github.com/unreadablewxy/fs-curator

Automation for the serious data hoarder that wants to have their data and use it

deduplication directory-tree file-renamer file-sorting hard-links organizer

Last synced: 29 Jul 2025

https://github.com/lostatc/acid-store

[UNMAINTAINED] A transactional and deduplicating virtual file system

acid deduplication encryption filesystem fuse rclone redis rust s3 sftp sqlite storage

Last synced: 16 Jul 2025

https://github.com/AI-team-UoA/pyJedAI

An open-source library that leverages Python’s data science ecosystem to build powerful end-to-end Entity Resolution workflows.

data-disambigation data-matching deduplication duplicate-detection entity-matching entity-resolution fuzzy-matching link-discovery machine-learning python

Last synced: 01 Mar 2026

https://github.com/openvenues/lieu

Dedupe/batch geocode addresses and venues around the world with libpostal

address deduplication geocoding international venues

Last synced: 20 Aug 2025

https://github.com/fritshermans/deduplipy

Python package for deduplication/entity resolution using active learning

deduplication entity-resolution fuzzy-matching record-linkage

Last synced: 19 Feb 2026

https://github.com/ronomon/deduplication

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.

chunking content-dependent deduplication nodejs

Last synced: 17 Aug 2025

https://github.com/daniel-liu-c0deb0t/umicollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.

data-structures deduplication fastq hamming java string-search string-similarity umis unique-molecular-identifiers

Last synced: 13 Apr 2025

https://github.com/iscc/fastcdc-py

FastCDC implementation in Python https://pypi.org/project/fastcdc/

chunking chunking-algorithm content-dependent deduplication python

Last synced: 17 Feb 2026

https://github.com/PJDude/dude

Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation.

cli deduplication duplicate duplicate-detection duplicate-files duplicates duplicates-removal easy easy-to-use easyui gui gui-application python python3 sha1 threads tkinter utility utility-application

Last synced: 06 Mar 2025

https://github.com/zen-logic/file-hunter

File Hunter — catalog, deduplicate, and consolidate your archive storage

deduplication duplicate-finder file-catalog file-manager homelab python self-hosted sqlite storage web-ui

Last synced: 24 Apr 2026

https://github.com/lobocv/simpleflow

Generic simple workflows and concurrency patterns

batching concurrency counter deduplication generics go golang timeseries worflows workerpool

Last synced: 23 Apr 2025

https://github.com/dssg/pgdedupe

A simple command line interface to the datamade/dedupe library.

data-cleaning database dedupe deduplication postgresql python record-linkage

Last synced: 21 Jan 2026

https://github.com/dupgit/sauvegarde

Continuous data protection for GNU/Linux (cdpfgl).

backup continuous-data-protection deduplication gnu-linux rest-api stateless

Last synced: 17 Dec 2025

https://github.com/maxkhim/laravel-storage-dedupler

Laravel Package Prevents File Duplication

deduplication file-storage laravel laravel-package

Last synced: 01 Mar 2026

https://github.com/ing-bank/spark-matcher

Record matching and entity resolution at scale in Spark

deduplication entity-resolution record-linkage spark

Last synced: 23 Jun 2025

https://github.com/donatj/imgdedup

CLI tool for image duplicate detection

deduplication image

Last synced: 14 Apr 2025

https://github.com/samber/go-singleflightx

🧬 x/sync/singleflight but with generics, batching, sharding and nullable result

cache channel concurrent deduplication generics go in-flight singleflight sync

Last synced: 22 Apr 2025

https://github.com/benzsevern/goldenmatch

Entity resolution toolkit — deduplicate, match, and create golden records. 27 MCP tools on Smithery. Zero-config. 97.2% F1.

a2a agent data-engineering data-quality dbt deduplication entity-resolution fellegi-sunter fuzzy-matching golden-record golden-suite llm mcp-server polars pprl privacy-preserving python record-linkage record-matching remote-mcp

Last synced: 13 May 2026

https://github.com/shivam5992/dupandas

:bar_chart: python package for performing deduplication using flexible text matching and cleaning in pandas dataframe

deduplication flexible-matching pandas python text-cleaner

Last synced: 02 Jul 2025

https://github.com/sergey-dryabzhinsky/dedupsqlfs

Deduplicating filesystem via Python3, FUSE and SQLite

backup compression deduplication fuse python python3 storage

Last synced: 14 Oct 2025

https://github.com/inexplicablemagic/photodedupe

A utility for locating near duplicate photos irrespective of image resolution, compression settings or file format.

computer-vision computer-vision-tools deduplication duplicate-detection image-deduplication

Last synced: 11 Jan 2026

https://github.com/immobiliare/ufoid

Ultra Fast Optimized Image Deduplication.

automation computer-vision deduplication images immobiliare-labs python

Last synced: 23 Apr 2025

https://github.com/davidsvy/Neural-Scam-Artist

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

dataset deduplication fine-tuning fraud gpt2 huggingface lsh minhash nlp pytorch readability scam transformer web-scraping

Last synced: 13 Jul 2025

https://github.com/j535d165/febrl-fork-v0.4.2

Fork of the Freely Extensible Biomedical Record Linkage program

deduplication entity-resolution matching python-library record-linkage

Last synced: 15 Oct 2025

https://github.com/erofs/docs

EROFS documentation repo for https://erofs.docs.kernel.org

compression containers deduplication deflate documentation erofs filesystems linux-kernel lz4 lzma zstd

Last synced: 04 Apr 2026

https://github.com/indyjo/cafs

Content-Addressable File System (used by BitWrk)

chunking deduplication download http rolling-hash synchronization upload

Last synced: 27 Dec 2025

https://github.com/InexplicableMagic/photodedupe

A utility for locating near duplicate photos irrespective of image resolution, compression settings or file format.

computer-vision computer-vision-tools deduplication duplicate-detection image-deduplication

Last synced: 07 Apr 2025

https://github.com/lkarlslund/stringdedup

String deduplication package for Go

dedup deduplication golang string xxhash

Last synced: 22 Apr 2025

https://github.com/ragibson/sms-mms-deduplication

Tool to remove duplicate text messages (SMS/MMS/RCS). RCS support is available for some clients.

deduplication mms rcs sms text-message

Last synced: 22 Apr 2025

https://github.com/nebucatnetzer/borg-qt

A Qt frontend for the command line software BorgBackup.

backup borg borgbackup borgbackup-gui deduplication gplv3 pyqt5 python3 qt5

Last synced: 03 Oct 2025

https://github.com/veqryn/slog-dedup

Golang structured logging (slog) deduplication and sorting for use with json logging

dedup deduplication golang golang-library json logging slog structured-logging

Last synced: 12 Aug 2025

https://github.com/marcnuth/deduplication

Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.

algorithms cv deduplication google imagehash shingling simhash

Last synced: 14 May 2025

https://github.com/ncn-foreigners/blocking

An R package for blocking records for record linkage / data deduplication based on approximate nearest neighbours algorithms.

annoy approximate-nearest-neighbor-search deduplication entity-resolution hnsw igraph record-linkage

Last synced: 21 Feb 2026

https://github.com/semanticarts/rdfhash

RDF Graph Compression Tool. Hash RDF subjects based on a checksum of their triples, effectively consolidating together subjects that contain identical definitions. Reduce time taken to mint URIs. Use Blank Nodes to your Advantage

compression deduplication hashing md5 rdf sha256

Last synced: 05 Feb 2026

https://github.com/nickcrews/mismo

The SQL/Ibis powered sklearn of record linkage

deduplication duckdb entity-resolution ibis python record-linkage sql

Last synced: 16 Mar 2026

https://github.com/deadsoul/dugu

Find, remove and avoid duplicates with dugu: The Duplicates Guru

deduplication dugu duplicate-detection duplicate-files duplicatefilefinder duplicates duplicates-guru python

Last synced: 05 Apr 2025

https://github.com/juntaki/bucketsync

S3 backed FUSE Filesystem written in Go with dedup and encryption.

deduplication filesystem fuse golang s3

Last synced: 14 Apr 2025

https://github.com/gamemann/linux-btrfs-lab

A small lab using Ubuntu 23.04 with the BTRFS file system to test deduplication feature.

23-04 btrfs dd deduplication disk disk-space documentation duperemove filesystem hard-drive kvm lab linux qemu save-space ssd ubuntu vm

Last synced: 26 Oct 2025