Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/r13i/spark-record-deduplicating

Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]
https://github.com/r13i/spark-record-deduplicating

big-data deduplication record-linkage records-management scala spark

Last synced: about 1 month ago
JSON representation

Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]

Host: GitHub
URL: https://github.com/r13i/spark-record-deduplicating
Owner: r13i
License: mit
Created: 2018-11-01T23:19:20.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2018-11-20T13:11:13.000Z (about 6 years ago)
Last Synced: 2024-04-25T07:14:43.844Z (9 months ago)
Topics: big-data, deduplication, record-linkage, records-management, scala, spark
Language: Scala
Homepage:
Size: 65.4 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# spark-record-deduplicating
Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ?

## Dataset
We'll be using the UC Irvine Machine Learning Repo dataset. From the book (see the References section) :
> The data set we’ll analyze was curated from a record linkage study per‐
> formed at a German hospital in 2010, and it contains several million pairs of patient
> records that were matched according to several different criteria, such as the patient’s
> name (first and last), address, and birthday. Each matching field was assigned a
> numerical score from 0.0 to 1.0 based on how similar the strings were, and the data
> was then hand-labeled to identify which pairs represented the same person and
> which did not. The underlying values of the fields that were used to create the data set
> were removed to protect the privacy of the patients. Numerical identifiers, the match
> scores for the fields, and the label for each pair (match versus nonmatch) were pub‐
> lished for use in record linkage research.

This data set is available at this URL : https://bit.ly/1Aoywaq

To download it, run this command : `$ curl -L -o data.zip https://bit.ly/1Aoywaq`

## How To

#### Build
- `$ git clone https://github.com/redouane-dev/spark-record-deduplicating.git`
- `$ cd spark-record-deduplicating`
- `$ ./gradlew build`

#### Run with the Data
###### Gathering the Data
- `$ mkdir -p data/linkage`
- `$ curl -L -o data/data.zip https://bit.ly/1Aoywaq`
- `$ unzip -d ./data ./data/data.zip`
- `$ unzip -d ./data/linkage './data/block_*.zip'`
- `$ rm -v ./data/block_*.zip` (Optionaly remove the .zip files)

###### Run the project
- `$ ./gradlew run`

## Follow-up Improvements
- Find how to plot the ROC (Receiver Operating Characteristic) in Scala
- Try new sets of features (consider low-value or deprecated features for the scoring function)

## References
- "Advanced Analytics with Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills (O’Reilly). Copyright 2015 Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills, 978-1-491-91276-8."