https://github.com/robinl/data_linking_example

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/robinl/data_linking_example
Owner: RobinL
Created: 2019-11-05T07:59:45.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-11-05T08:04:00.000Z (over 6 years ago)
Last Synced: 2025-12-25T14:56:33.064Z (6 months ago)
Language: Jupyter Notebook
Size: 469 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Sample AWS Glue script for data linking

A first attempt at a Glue job that performs data deduplication.

![image](dag.png?raw=true)

High level overview of the job (use VS Code Markdown Preview Enhanced to view).

```mermaid

graph TD

A[Ground truth dataset] -->|from fake data| B

B[Train test split] --> C

C[Tokenise records] -->|Concat all columns we're using for matching and split into array of tokens| C2

C2[Compute lookup table containing relative frequency of each token] -->D

 D[Apply Blocking rules] --> |Apply series of OR rules like 'firstname' and 'surname' or 'firstname' and 'dob' to produc|E

E[Dataset of potentially matching pairs] --> F

F[Compute features] -->G

F -->H

G[Edit distance] -->J

H[Probability score] -->|Lookup each matching token in token frequency table and multply together to produce score|J

J[Train logit model] --> K

K[Apply trained model to test data] --> L

L[Compute accuracy statistics on test data]

```

## Further details

You can find a full example with output dataframes at each stage [here](https://github.com/moj-analytical-services/data_linking_glue_job_test/blob/master/match/step_by_step_example.ipynb)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/robinl/data_linking_example

Awesome Lists containing this project

README