https://github.com/jpmarindiaz/deduplicate

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/jpmarindiaz/deduplicate
Owner: jpmarindiaz
License: mit
Created: 2017-03-20T23:28:07.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-08-01T21:50:09.000Z (almost 8 years ago)
Last Synced: 2024-08-13T07:14:41.839Z (11 months ago)
Language: R
Size: 19.5 KB
Stars: 3
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

jimsghstars - jpmarindiaz/deduplicate - (R)

README

# Deduplicate

While working with real life data I have faced quite often the issue of determining if there are duplicated in the data. It is also quite common to question if the new samples are actually new or updates of the records in a dataset.

To tackle this issues I have created __deduplicate__ which is nothing else than wrapper functions around [dplyr](https://github.com/hadley/dplyr)'s joins and [fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin).

So far I have the following functions:

- __get_approx_dup_ids__: Get row numbers of duplicate records.
- __get_approx_dups__: Get all duplicates for manual inspection.
- __new_or_dup__: Get if a set of new records are new or duplicates.

All these functions work with using the naïve approach of creating a unique id out of the multiple columns of the dataset, e.g. FIRSTNAME_LASTNAME_CITY

Planned features include:

- Use of a per-column distance metric rather than a distance metric on built ids.
- Automatic clustering of records
- Automatic merge of records
- Supervised merge with a Shiny App or RStudio add on

# Todo's

Fix "custom_id" name conflict
dids <- create_idcols(d, id_cols)
add_approx_unique_id(dids, col = "custom_id")

exclusive_ids for more than 2 ids.
Use mutate_all() with do()

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jpmarindiaz/deduplicate

Awesome Lists containing this project

README