Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/soroush-04/sql-data-validation

sql sqlite3

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/soroush-04/sql-data-validation
Owner: soroush-04
Created: 2023-10-25T07:52:25.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2023-10-25T13:07:14.000Z (about 1 year ago)
Last Synced: 2023-10-26T08:58:24.340Z (about 1 year ago)
Topics: sql, sqlite3
Homepage:
Size: 5.71 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# Data Validation Sample - SQL
---

This folder contains 3 data spot checks, each in progressively increasing difficulty. Each folder contains its own readme and the data extracts which are the subject of the validation (and files which might be useful for reproducing the source data).

Table of contents
=======

- [Level 1 - Score Alignment](#level1)
- [Level 2 - NR Check](#level2)
- [Level 3 - Questionnaires](#level3)

---

#### All the sql files contain step by step commenting for clear understanding of each specific approach.

## Level 1 - Score Alignment

In this task we deal with inaccurate variables correlated to score and response columns.
I used SQL queries to analyze the database and then adjusted the scores to their appropriate and expected value. The main approach that I used for this task was to separate the database by `question_id` and then evaluate them one by one; Since each question has it's own specific response type and it makes error handling easier and more concise.

## Level 2 - NR Check

I used SQL queries for database analysis and then identified records which are discrepant in the alignment of `is_nr=1` and `none(response_raw[entryId].isResponded=true)`. The main pattern between all of these rows were about the exact same json queries in response column which was the reason caused this issue.

## Level 3 - Questionnaires

After data analysis, there is one specific evidence that shows an inaccurate data for `Question 15` and `Question 17`. The main issue in the table was related to two various `test_design_version_id` variable that created duplicated output to the database. In this scenario we can deal with this duplication issue in multiple approaches such as:
- Data Deduplication: removing the duplicate entries and retaining only one of each. This will eliminate redundancy and keep our data cleaner.
- Data Versioning: If retaining both versions is important, we can consider creating a versioning system. We would add a field to indicate the version of each question, allowing us to distinguish between them.

In this approach, I utilized data deduplication technique and keep `test_design_version_id = 1772` which was more complete version regarding to the other duplication `test_design_version_id = 1695`.
Also, this duplcation issue can cause serious problem for the future usage in our project which needs to be investigated and solved because of these principal factors:
- Data Consolidation: If the duplicates are the result of data import errors or other issues, we may need to correct the data import process to prevent such duplicates in the future.
- Data Integration: If the duplicates stem from data integration from multiple sources, we should create a clear integration strategy and perform data transformation to eliminate duplicates.