Projects in Awesome Lists tagged with deduplication
A curated list of projects in awesome lists tagged with deduplication .
https://github.com/restic/restic
Fast, secure, efficient backup program
backup dedupe deduplication go restic secure-by-default
Last synced: 23 Apr 2025
https://github.com/borgbackup/borg
Deduplicating archiver with compression and authenticated encryption.
backup borgbackup compression deduplication encryption python ssh
Last synced: 18 Apr 2025
https://github.com/kopia/kopia
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
backup cloud deduplication encryption google-cloud-storage hacktoberfest
Last synced: 23 Apr 2025
https://github.com/prometheus/alertmanager
Prometheus Alertmanager
alertmanager deduplication email hacktoberfest monitoring notifications opsgenie pagerduty slack
Last synced: 23 Apr 2025
https://github.com/openvenues/libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
address address-parser c deduping deduplication international machine-learning natural-language-processing nlp record-linkage
Last synced: 23 Apr 2025
https://github.com/rustic-rs/rustic
rustic - fast, encrypted, and deduplicated backups powered by Rust
backup deduplication encryption hacktoberfest restic rust
Last synced: 23 Apr 2025
https://github.com/mhx/dwarfs
A fast high compression read-only file system for Linux, Windows and macOS
archiving compression cpp deduplication dwarfs filesystem flac fuse fuse-filesystem gpl-license linux lrzip lzma macfuse macos squashfs windows winfsp zpaq zstd
Last synced: 10 Apr 2025
https://github.com/sahib/rmlint
Extremely fast tool to remove duplicates and other lint from your filesystem
c deduplication duplicates fdupes filesystem lint python
Last synced: 10 Apr 2025
https://github.com/borgmatic-collective/borgmatic
Simple, configuration-driven backup software for servers and workstations
apprise backup borg borgbackup btrfs deduplication healthchecks loki lvm mariadb mongodb mysql ntfy postgresql python servers sqlite upitme-kuma zabbix zfs
Last synced: 23 Apr 2025
https://github.com/moj-analytical-services/splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
data-matching data-science deduplicate-data deduplication duckdb em-algorithm entity-resolution fuzzy-matching record-linkage spark uk-gov-data-science
Last synced: 23 Apr 2025
https://github.com/cupcakearmy/autorestic
Config driven, easy backup cli for restic.
backup cli config config-driven deduplication incremental incremental-backup pruning restic
Last synced: 11 Apr 2025
https://github.com/zinggai/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark
Last synced: 10 Apr 2025
https://github.com/j535d165/recordlinkage
A powerful and modular toolkit for record linkage and duplicate detection in Python
data-matching dedupe deduplication entity-resolution machine-learning privacy python python-library record-linkage similarity string-distance utrecht-university
Last synced: 13 Apr 2025
https://github.com/J535D165/recordlinkage
A powerful and modular toolkit for record linkage and duplicate detection in Python
data-matching dedupe deduplication entity-resolution machine-learning privacy python python-library record-linkage similarity string-distance utrecht-university
Last synced: 26 Mar 2025
https://github.com/nvidia/nemo-curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 13 Apr 2025
https://github.com/dpc/rdedup
Data deduplication engine, supporting optional compression and public key encryption.
backup data-deduplication deduplication encryption
Last synced: 14 Apr 2025
https://github.com/yomguithereal/talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
clustering deduplication fuzzy-matching information-retrieval machine-learning natural-language-processing record-linkage
Last synced: 14 Apr 2025
https://github.com/Yomguithereal/talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
clustering deduplication fuzzy-matching information-retrieval machine-learning natural-language-processing record-linkage
Last synced: 15 Mar 2025
https://github.com/NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 27 Nov 2024
https://github.com/IBM/data-prep-kit
Open source project for data preparation of LLM application builders
code-quality data data-prep data-preparation data-preprocessing data-preprocessing-pipelines datacuration datarecipes deduplication finetuning large-language-models large-scale-data-processing llm llmapps malware python ray spark
Last synced: 11 Jan 2025
https://github.com/sreedevk/deduplicator
Filter, Sort & Delete Duplicate Files Recursively
deduplication duplicate-detection duplicate-files duplicatefilefinder filesystem rust
Last synced: 07 Apr 2025
https://github.com/cargo-limit/cargo-limit
Productivity improvements for Rust ecosystem: warnings are skipped until errors are fixed, LSP-independent Neovim integration, etc.
build cargo cargo-plugin cargo-wrapper cli crates deduplication filter limit neovim neovim-plugin nvim plugin productivity runner rust wrapper
Last synced: 07 Apr 2025
https://github.com/dm-vdo/kvdo
A kernel module which provide a pool of deduplicated and/or compressed block storage.
compression deduplication kernel-modules linux-kernel storage vdo
Last synced: 12 Apr 2025
https://github.com/Jaskey/RocketMQDedupListener
RocketMQ消息幂等去重消费者,支持使用MySQL或者Redis做幂等表,开箱即用
deduplication rocketmq rocketmq-client
Last synced: 12 Nov 2024
https://github.com/netinvent/npbackup
A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)
backup cli compression deduplication gui healthcheck monitoring orchestrator prometheus-metrics restic vss
Last synced: 09 Apr 2025
https://github.com/dm-vdo/vdo
Userspace tools for managing VDO volumes.
compression deduplication storage vdo
Last synced: 04 Apr 2025
https://github.com/opensanctions/nomenklatura
Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources
data-integration deduplication record-link
Last synced: 03 Apr 2025
https://github.com/007revad/synology_enable_deduplication
Enable deduplication with non-Synology SSDs and unsupported NAS models
deduplication diskstation dsm rackstation synology synology-disk-station synology-dsm synology-nas
Last synced: 05 Apr 2025
https://github.com/yornaath/batshit
A batch manager that will deduplicate and batch requests for a certain data type made within a window. Useful to batch requests made from multiple react components that uses react-query
async batch-processing concurrency deduplication fetch react react-query tanstack typescript
Last synced: 04 Apr 2025
https://github.com/F483/dejavu
Quickly detect already witnessed data.
command-line command-line-tool deduplication duplicate-values duplicates go golang history memory probabilistic
Last synced: 30 Mar 2025
https://github.com/f483/dejavu
Quickly detect already witnessed data.
command-line command-line-tool deduplication duplicate-values duplicates go golang history memory probabilistic
Last synced: 10 Feb 2025
https://github.com/vintasoftware/entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
approximate-nearest-neighbors data-matching deduplication deep-learning embeddings entity-matching entity-resolution python pytorch record-linkage representation-learning
Last synced: 09 Apr 2025
https://github.com/markusressel/py-image-dedup
CLI utility to find near duplicate images and remove all but the best copy.
dedup deduplication duplicate-detection duplicate-images find-duplicates hacktoberfest image-analysis image-comparison python python-3 python3
Last synced: 07 Apr 2025
https://github.com/deajan/backup-bench
Quick and dirty backup tool benchmark with reproducible results
backup benchmark benchmarking borgbackup bupstash compression deduplication duplicacy kopia restic
Last synced: 05 Apr 2025
https://github.com/nlfiedler/fastcdc-rs
FastCDC implementation in Rust
chunking-algorithm deduplication rust
Last synced: 04 Apr 2025
https://github.com/elemental-lf/benji
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
b2 backup block-based ceph deduplication iscsi kubernetes lvm s3
Last synced: 06 Apr 2025
https://github.com/laktak/chkbit
Check your files for data corruption and run quick file deduplication
backup bitrot-detection btrfs cloud-backup data-degradation data-integrity dedup dedupe deduper deduplication disk-check storage-media
Last synced: 04 Apr 2025
https://github.com/opengene/gencore
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
bioinformatics consensus deduplication deep-sequencing duplex duplex-sequencing duplication ngs sequencing sequencing-error sequencing-noise somatic
Last synced: 10 Apr 2025
https://github.com/OpenGene/gencore
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
bioinformatics consensus deduplication deep-sequencing duplex duplex-sequencing duplication ngs sequencing sequencing-error sequencing-noise somatic
Last synced: 16 Nov 2024
https://github.com/jvirkki/dupd
CLI utility to find duplicate files
c deduplication duplicate-files duplicatefilefinder duplicates fdupes
Last synced: 21 Mar 2025
https://github.com/tsileo/blobstash
You personal database. Mirror of https://git.sr.ht/~tsileo/blobstash
backup blob-store blobstash content-addressed deduplication document-store go storage
Last synced: 17 Mar 2025
https://github.com/unreadablewxy/fs-curator
Automation for the serious data hoarder that wants to have their data and use it
deduplication directory-tree file-renamer file-sorting hard-links organizer
Last synced: 04 Dec 2024
https://github.com/lostatc/acid-store
[UNMAINTAINED] A transactional and deduplicating virtual file system
acid deduplication encryption filesystem fuse rclone redis rust s3 sftp sqlite storage
Last synced: 24 Nov 2024
https://github.com/openvenues/lieu
Dedupe/batch geocode addresses and venues around the world with libpostal
address deduplication geocoding international venues
Last synced: 19 Dec 2024
https://github.com/daniel-liu-c0deb0t/umicollapse
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
data-structures deduplication fastq hamming java string-search string-similarity umis unique-molecular-identifiers
Last synced: 13 Apr 2025
https://github.com/ronomon/deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
chunking content-dependent deduplication nodejs
Last synced: 17 Dec 2024
https://github.com/hexhive/igor
cluster crash deduplication fuzzing grouping security similarity trace
Last synced: 12 Nov 2024
https://github.com/PJDude/dude
Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation.
cli deduplication duplicate duplicate-detection duplicate-files duplicates duplicates-removal easy easy-to-use easyui gui gui-application python python3 sha1 threads tkinter utility utility-application
Last synced: 06 Mar 2025
https://github.com/jRimbault/yadf
Yet Another Dupes Finder
dedupe deduplication dupes-finder duplicate-detection fdupes file-deduplication
Last synced: 06 Mar 2025
https://github.com/lobocv/simpleflow
Generic simple workflows and concurrency patterns
batching concurrency counter deduplication generics go golang timeseries worflows workerpool
Last synced: 23 Apr 2025
https://github.com/j535d165/recordlinkage-annotator
A browser user interface for manual labeling of record pairs.
annotation-tool data-matching deduplication entity-resolution labeling-tool machine-learning record-linkage
Last synced: 22 Nov 2024
https://github.com/jchristn/watsondedupe
Self-contained C# library for data deduplication using Sqlite
chunk chunk-data chunk-key compress compression data-deduplication dedupe deduplication duplicate-data nuget sqlite-database storage
Last synced: 24 Apr 2025
https://github.com/ing-bank/spark-matcher
Record matching and entity resolution at scale in Spark
deduplication entity-resolution record-linkage spark
Last synced: 14 Apr 2025
https://github.com/noahgift/rdedupe
A Rust based deduplication tool
clap command-line deduplication filesystem multithreading rust rust-lang
Last synced: 23 Mar 2025
https://github.com/samber/go-singleflightx
🧬 x/sync/singleflight but with generics, batching, sharding and nullable result
cache channel concurrent deduplication generics go in-flight singleflight sync
Last synced: 22 Apr 2025
https://github.com/sergey-dryabzhinsky/dedupsqlfs
Deduplicating filesystem via Python3, FUSE and SQLite
backup compression deduplication fuse python python3 storage
Last synced: 31 Jan 2025
https://github.com/shivam5992/dupandas
:bar_chart: python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
deduplication flexible-matching pandas python text-cleaner
Last synced: 14 Apr 2025
https://github.com/immobiliare/ufoid
Ultra Fast Optimized Image Deduplication.
automation computer-vision deduplication images immobiliare-labs python
Last synced: 23 Apr 2025
https://github.com/davidsvy/Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
dataset deduplication fine-tuning fraud gpt2 huggingface lsh minhash nlp pytorch readability scam transformer web-scraping
Last synced: 22 Nov 2024
https://github.com/j535d165/febrl-fork-v0.4.2
Fork of the Freely Extensible Biomedical Record Linkage program
deduplication entity-resolution matching python-library record-linkage
Last synced: 22 Nov 2024
https://github.com/bakdata/dedupe
Java DSL for (online) deduplication
data-cleaning data-cleansing deduplication duplicate-detection duplicate-removal
Last synced: 10 Apr 2025
https://github.com/InexplicableMagic/photodedupe
A utility for locating near duplicate photos irrespective of image resolution, compression settings or file format.
computer-vision computer-vision-tools deduplication duplicate-detection image-deduplication
Last synced: 07 Apr 2025
https://github.com/vmchale/phash
Perceptual hashing command-line tool
command-line-tool deduplication duplication-detection duplication-finder haskell perceptual-hash phash
Last synced: 26 Apr 2025
https://github.com/lkarlslund/stringdedup
String deduplication package for Go
dedup deduplication golang string xxhash
Last synced: 22 Apr 2025
https://github.com/nebucatnetzer/borg-qt
A Qt frontend for the command line software BorgBackup.
backup borg borgbackup borgbackup-gui deduplication gplv3 pyqt5 python3 qt5
Last synced: 22 Jan 2025
https://github.com/marcnuth/deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
algorithms cv deduplication google imagehash shingling simhash
Last synced: 19 Nov 2024
https://github.com/ragibson/sms-mms-deduplication
Tool to remove duplicate text messages (SMS/MMS/RCS). RCS support is available for some clients.
deduplication mms rcs sms text-message
Last synced: 22 Apr 2025
https://github.com/nickcrews/mismo
The SQL/Ibis powered sklearn of record linkage
deduplication duckdb entity-resolution ibis python record-linkage sql
Last synced: 18 Nov 2024
https://github.com/deadsoul/dugu
Find, remove and avoid duplicates with dugu: The Duplicates Guru
deduplication dugu duplicate-detection duplicate-files duplicatefilefinder duplicates duplicates-guru python
Last synced: 05 Apr 2025
https://github.com/juntaki/bucketsync
S3 backed FUSE Filesystem written in Go with dedup and encryption.
deduplication filesystem fuse golang s3
Last synced: 14 Apr 2025
https://github.com/junkurihara/rust-gd
An Implementation of Generalized Deduplication, written in Rust
deduplication error-correcting-codes generalized-deduplication hamming-codes reed-solomon-codes rust
Last synced: 10 Apr 2025
https://github.com/gamemann/linux-btrfs-lab
A small lab using Ubuntu 23.04 with the BTRFS file system to test deduplication feature.
23-04 btrfs dd deduplication disk disk-space documentation duperemove filesystem hard-drive kvm lab linux qemu save-space ssd ubuntu vm
Last synced: 18 Mar 2025
https://github.com/glau-bd/duplicate-video-finder
A python module to detect duplicate videos in a directory.
cleanup data-hoarder deduplication duplicate-detection python python-3 video-processing
Last synced: 21 Jan 2025
https://github.com/opengene/dedup
Deduplication for cfDNA sequencing data
bioinformatics ctdna deduplication liquid ngs
Last synced: 10 Apr 2025
https://github.com/infinisil/soph
Efficiently import pictures while handling duplicates gracefully
blockhash deduplication haskell perceptual-hashing pictures-organizer similarity-search
Last synced: 22 Mar 2025
https://github.com/gerald-lnj/duplicate-video-finder
A python module to detect duplicate videos in a directory.
cleanup data-hoarder deduplication duplicate-detection python python-3 video-processing
Last synced: 20 Nov 2024
https://github.com/andrewdalpino/dataloader-php
A speed layer that enables query batching, de-duplication, and caching for efficient data fetching over any storage backend.
buffer cache dataloader deduplication graphql optimization php storage
Last synced: 10 Apr 2025
https://github.com/mk-fg/lafs-backup-tool
Tool to securely push incremental (think "rsync --link-dest") backups to tahoe-lafs
automation backup compression deduplication python tahoe-lafs twisted yaml
Last synced: 23 Apr 2025
https://github.com/dsacms/deduplifhir
Prototype for basic deduplication and aggregation of eCQM data
ai cmsoss-tier3 data-science deduplication electron government healthcare poetry python
Last synced: 13 Apr 2025
https://github.com/glehmann/hld
Hard Link Deduplicator
dedup deduplication hardlinks reflinks rust
Last synced: 16 Mar 2025
https://github.com/shaltielshmid/minhashsharp
A Robust Library in C# for Similarity Estimation
deduplication deduplication-filter lsh lsh-algorithm lsh-implementation minhash statistics
Last synced: 23 Apr 2025
https://github.com/checktor/face_amnesia
Face detection and retrieval in image and video files.
clustering deduplication face-detection face-recognition image-processing locality-sensitive-hashing nearest-neighbors video-processing
Last synced: 31 Mar 2025
https://github.com/arbal/brave-control
Control Brave Browser from the command line. List, close, deduplicate and bring focus to open tabs. Also includes Alfred workflow integration.
alfred alfred-workflow automation brave brave-browser browser cli command-line command-line-tool deduplication focus jxa tabs workflow
Last synced: 06 Apr 2025
https://github.com/yaroslaff/hashget
Deduplication/backup tool with extremely high 'compression' rate
archive backup compression deduplicate deduplication restic
Last synced: 13 Apr 2025
https://github.com/b0ch3nski/backup-toolkit
Collection of scripts for various backup scenarios.
backup bup compression deduplication logical-volumes lvm recovery restore snapshot
Last synced: 06 Apr 2025
https://github.com/pastelsky/throttle-queue
A promise based priority queue with task deduplication, concurrency control, serial resolution and aging
concurrency deduplication promises queue
Last synced: 11 Nov 2024
https://github.com/cybershadow/ripfs
Simple deduplicating userspace filesystem for recordings of Internet radio stations.
deduplication fuse-filesystem internet-radio
Last synced: 17 Mar 2025
https://github.com/samhirtarif/helper-methods-js
A repo that contains helper methods for common and not-so-common use cases
async dedupe deduplication deepcopy indexesof isasync
Last synced: 08 Mar 2025
https://github.com/dobraczka/klinker
🧱 blocking methods for entity resolution
blocking data-integration deduplication entity-alignment entity-resolution link-discovery record-linkage
Last synced: 19 Apr 2025
https://github.com/innovatrics/dedubcheck
dedubcheck - De-Duplicate Dependency Checker for Node.js monorepos
deduplication duplicates duplicity javascript nodejs nodejs-modules
Last synced: 13 Apr 2025
https://github.com/atomic-state/http-react
React hooks for data fetching
axios deduplication fetch fetch-api gql graphql hook hooks http javascript react react-hooks requests ssr suspense swr
Last synced: 14 Apr 2025
https://github.com/naiquevin/dupenukem
A command line file deduplication tool
Last synced: 11 Apr 2025
https://github.com/gblach/reflicate
Deduplicate data by creating reflinks between identical files.
btrfs deduplicate deduplication ocfs2 reflinks rust xfs
Last synced: 26 Mar 2025
https://github.com/brendon1555/panda-cx-deduplicator
A drop in replacement for the PandaCSS `cx` function with deduplication of atomic classes
classname css deduplication hacktoberfest pandacss styling
Last synced: 13 Feb 2025
https://github.com/fgregg/smered
Mirror of https://bitbucket.org/resteorts/smered
deduplication entity-resolution record-linkage
Last synced: 14 Apr 2025
https://github.com/aiursoftweb/nibot
A cli tool helps you to de-duplicate images in a folder.
deduplication dotnet image-processing tool
Last synced: 13 Apr 2025
https://github.com/yybit/zchunk-rs
A pure rust library for parsing and generating zchunk file
chunk compression deduplication sync
Last synced: 08 Apr 2025