An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-deduplication

A curated list of projects in awesome lists tagged with data-deduplication .

https://github.com/dpc/rdedup

Data deduplication engine, supporting optional compression and public key encryption.

backup data-deduplication deduplication encryption

Last synced: 15 May 2025

https://github.com/sail-sg/sailcraft

🚢 Data Toolkit for Sailor Language Models

data-cleaning data-deduplication

Last synced: 05 Oct 2025

https://github.com/zabuzard/fastcdc4j

Fast and efficient content-defined chunking for data deduplication. Java implementation of FastCDC as library.

cdc chunking content-defined-chunking data-deduplication fastcdc java library

Last synced: 05 Mar 2026

https://github.com/gagan3012/polydedupe

PolyDeDupe: Multi-Lingual Data Deduplication

data-deduplication multilingual nlp

Last synced: 16 Mar 2025

https://github.com/fabriziosalmi/text-boundaries

A Python-based tool for preprocessing, cleaning, and analyzing text datasets, designed to filter, deduplicate, sort data, and generate statistical insights.

data-automation data-deduplication data-preprocessing data-sorting data-statistics-generation data-validation dataset-boundaries dataset-cleaning machine-learning natural-language-processing text-data-analysis

Last synced: 07 Apr 2025

https://github.com/tracing-performance-labs/go-dedupe

Go library for deduplicating string data

data-deduplication go otel

Last synced: 10 Oct 2025

https://github.com/keerthanapalanikumar/data-cleaning-on-sql

This repository contains SQL scripts and documentation for cleaning and standardizing data in the NashvilleHousing table within the sqlproject2 database. The project aims to prepare the dataset for analysis by addressing inconsistencies, filling missing values, standardizing formats, and removing duplicates.

data-cleaning data-deduplication data-manipulation data-standardization database-management mssql ssms

Last synced: 27 Jan 2026