https://github.com/lefteris-souflas/entity-resolution
Addressed Entity Resolution challenges. Tasks include schema-agnostic blocking, pairwise comparisons, Meta-Blocking graph construction, and Jaccard similarity computation. Deliverables include source code, reports, and reproducibility guidelines in Python
https://github.com/lefteris-souflas/entity-resolution
edge-pruning entity-resolution graph jaccard-similarity meta-blocking token-blocking
Last synced: 4 months ago
JSON representation
Addressed Entity Resolution challenges. Tasks include schema-agnostic blocking, pairwise comparisons, Meta-Blocking graph construction, and Jaccard similarity computation. Deliverables include source code, reports, and reproducibility guidelines in Python
- Host: GitHub
- URL: https://github.com/lefteris-souflas/entity-resolution
- Owner: Lefteris-Souflas
- License: mit
- Created: 2024-04-09T15:44:19.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-18T20:11:26.000Z (over 1 year ago)
- Last Synced: 2025-01-12T20:33:35.921Z (9 months ago)
- Topics: edge-pruning, entity-resolution, graph, jaccard-similarity, meta-blocking, token-blocking
- Language: Jupyter Notebook
- Homepage:
- Size: 4.54 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Entity Resolution
Assignment 2 for the Advanced Data Engineering Course of AUEB's MSc in Business Analytics
## Introduction
This assignment tackles various challenges in Entity Resolution using the provided `ER-Data.csv` file. Tasks A to D collectively aim to enhance data quality and accuracy through schema-agnostic methods, pairwise comparisons, Meta-Blocking graph construction, and similarity computation.## Task A [30 points]
- Implement Token Blocking method as a schema-agnostic approach.
- Generate blocks in the form of K-V (Key-value) pairs.
- Use all attributes (except id) for creating blocks.
- Ensure accurate matching by transforming strings to lowercase during token creation and filtering out stop-words.
- Pretty-print the index for clear readability.## Task B [25 points]
- Compute all possible comparisons to resolve duplicates within the created blocks.
- Print the final calculated number of comparisons.## Task C [30 points]
- Create a Meta-Blocking graph of the block collection from Task A.
- Utilize the CBS Weighting Scheme to refine the graph.
- Prune edges with weight < 2 to reduce unnecessary comparisons.
- Re-calculate the final number of comparisons after edge pruning.## Task D [15 points]
- Develop a function to compute Jaccard similarity based on the 'title' attribute.
- The function takes two entities as input and computes their similarity.
- No actual comparisons using this function are required.## Deliverables:
1. Source code with useful comments.
2. A **small report** for each task justifying the code and describing the methodology.
3. For Task C ONLY, a partially solved answer with proper justification will also be accepted.
4. Programming Languages: Python was used.## Code Reproducibility
Ensuring the reproducibility of the results presented in this report is of paramount importance. To facilitate the readers' ability to reproduce the outcomes, the following steps provide guidance on accessing, setting up, and executing the code.### Accessing the Code
The complete code used for Tasks A, B, C, and D is available in the child folders `Code` and `Jupyter` of the root folder. Readers are encouraged to download the code files from the provided source.### Environment Setup
Depending on the specific tasks and functions, certain libraries and dependencies are required. Ensure that you have the necessary libraries installed.### Executing the Code
The code can be executed in a Jupyter Notebook for the `ipynb` file or any Python environment for the `py` file. Open the respective code file for each task and follow the instructions within the comments.### Task-Specific Instructions
For the assignment’s Tasks, refer to the corresponding sections in the Jupyter Notebook code or the exported PDF file (if unable to run the `ipynb` file) for an in-depth explanation of the code and the methodology used. This report presents only a summary justification of the methodology and code used. The code is designed to be modular and organized, making it straightforward to follow along and reproduce the results.**Note:** Ensure that the `ER-Data.csv` file is placed in the `Data` directory before running the code.
By following these steps, readers can confidently reproduce the presented results and gain a deeper understanding of the methodologies applied in this study.