Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/r13i/spark-record-deduplicating
Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]
https://github.com/r13i/spark-record-deduplicating
big-data deduplication record-linkage records-management scala spark
Last synced: about 1 month ago
JSON representation
Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ? [Work In Progress]
- Host: GitHub
- URL: https://github.com/r13i/spark-record-deduplicating
- Owner: r13i
- License: mit
- Created: 2018-11-01T23:19:20.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2018-11-20T13:11:13.000Z (about 6 years ago)
- Last Synced: 2024-04-25T07:14:43.844Z (9 months ago)
- Topics: big-data, deduplication, record-linkage, records-management, scala, spark
- Language: Scala
- Homepage:
- Size: 65.4 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spark-record-deduplicating
Data cleansing problem statement: Data in a record are often duplicated. How do we find the duplicate probability ?## Dataset
We'll be using the UC Irvine Machine Learning Repo dataset. From the book (see the References section) :
> The data set we’ll analyze was curated from a record linkage study per‐
> formed at a German hospital in 2010, and it contains several million pairs of patient
> records that were matched according to several different criteria, such as the patient’s
> name (first and last), address, and birthday. Each matching field was assigned a
> numerical score from 0.0 to 1.0 based on how similar the strings were, and the data
> was then hand-labeled to identify which pairs represented the same person and
> which did not. The underlying values of the fields that were used to create the data set
> were removed to protect the privacy of the patients. Numerical identifiers, the match
> scores for the fields, and the label for each pair (match versus nonmatch) were pub‐
> lished for use in record linkage research.This data set is available at this URL : https://bit.ly/1Aoywaq
To download it, run this command : `$ curl -L -o data.zip https://bit.ly/1Aoywaq`
## How To
#### Build
- `$ git clone https://github.com/redouane-dev/spark-record-deduplicating.git`
- `$ cd spark-record-deduplicating`
- `$ ./gradlew build`#### Run with the Data
###### Gathering the Data
- `$ mkdir -p data/linkage`
- `$ curl -L -o data/data.zip https://bit.ly/1Aoywaq`
- `$ unzip -d ./data ./data/data.zip`
- `$ unzip -d ./data/linkage './data/block_*.zip'`
- `$ rm -v ./data/block_*.zip` (Optionaly remove the .zip files)###### Run the project
- `$ ./gradlew run`## Follow-up Improvements
- Find how to plot the ROC (Receiver Operating Characteristic) in Scala
- Try new sets of features (consider low-value or deprecated features for the scoring function)## References
- "Advanced Analytics with Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills (O’Reilly). Copyright 2015 Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills, 978-1-491-91276-8."