https://github.com/mele0/air2lung
SQL and Python pipeline for analyzing the impact of air pollution and lifestyle factors on lung disease using a synthetic UK cohort.
https://github.com/mele0/air2lung
air-pollution data-privacy encryption k-anonymity lung-health-severity mysql python sql
Last synced: 6 days ago
JSON representation
SQL and Python pipeline for analyzing the impact of air pollution and lifestyle factors on lung disease using a synthetic UK cohort.
- Host: GitHub
- URL: https://github.com/mele0/air2lung
- Owner: Mele0
- Created: 2025-07-01T10:31:29.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-07-01T11:35:13.000Z (12 months ago)
- Last Synced: 2025-07-01T11:39:36.135Z (12 months ago)
- Topics: air-pollution, data-privacy, encryption, k-anonymity, lung-health-severity, mysql, python, sql
- Language: Python
- Homepage:
- Size: 365 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Lung Disease Cohort Analysis
This project focuses on constructing, querying, analyzing, and securing a relational database for a simulated clinical study. The study investigates associations between air pollution, lifestyle factors, and lung disease among 1,000 UK participants. The project integrates SQL (MySQL), Python, data privacy techniques, and encryption.
## 🧪 Project Objectives
- Create a MySQL database and import cohort data from multiple CSV sources.
- Explore demographic and exposure variables using SQL queries.
- Assess air quality exposure and its relation to disease status.
- Implement privacy-preserving techniques including k-anonymity and encryption.
- Use Python to query the database and manipulate patient-level data.
## 📁 Dataset Description
The project uses a synthetic cohort study consisting of:
- **covars.csv**: Demographic and clinical data including sex, age at recruitment, disease status, smoking data, etc.
- **monitor.csv**: Environmental exposure data at monitoring sites (PM2.5, NO2).
- **customers.csv**: Insurance records including personal identifiers and lifestyle factors.
## ⚙️ Tools and Technologies
- **SQL**: MySQL 8.0, DBeaver
- **Python**: `pandas`, `sqlalchemy`, `mysql-connector-python`, `cryptography`
- **Environment**: Jupyter Notebook, VSCode
## 🔍 Key Features
### 1. Database Construction
MySQL scripts build a relational database `lung_disease_DB` with foreign key relations and correct datatypes, enabling clean integration of environmental and participant-level data.
### 2. Data Analysis (SQL)
Queries assess cohort characteristics:
- Age distribution
- Disease prevalence across regions
- Environmental exposure (PM2.5, NO2)
- Smoking intensity (Pack Years)
Views and updated schema elements (e.g., `pack_years`) were added to support repeated queries and visualization.
### 3. Python Integration
Python scripts:
- Establish secure connection to the MySQL database
- Extract and manipulate cohort subsets
- Validate SQL queries within a Python data science pipeline
### 4. Data Privacy & Anonymization
#### HIPAA-Informed Classification:
- **Sensitive Identifiers**: Name, phone number, bank details
- **Quasi-identifiers**: Age group, sex, ethnicity, education level, area
Data split into two linked CSVs:
- `Sensitive_information.csv`
- `Raw_information.csv`
A re-identification risk assessment showed **83 individuals** could be uniquely identified using just quasi-identifiers—highlighting the insufficiency of basic de-identification.
---
#### 🧠 Mondrian Method for K-Anonymity
To anonymize quasi-identifiers in our dataset, we used a custom binning and generalization strategy inspired by the **Mondrian multidimensional k-anonymity** algorithm.
The Mondrian method recursively partitions data into multidimensional regions until no further division is possible without violating the desired k-anonymity threshold. It balances privacy and data utility by minimizing information loss.
For more details, refer to the [original paper](https://pages.cs.wisc.edu/~lefevre/MultiDim.pdf).
Figure: A simplified illustration of Mondrian partitioning in two dimensions.
---
#### K-Anonymity Strategy:
Using the Mondrian-inspired approach, we achieved:
K-Anonymity Level
Samples Retained
≥1
1000
≥2
367
≥3
109
≥4
16
Final anonymized dataset: `Anonymized_information.csv`
### 5. Data Encryption
Used the `cryptography` Python package and Fernet symmetric encryption to securely encrypt anonymized datasets. Decryption requires the private key shared separately.
```python
from cryptography.fernet import Fernet
key = Fernet.generate_key()
f = Fernet(key)
# Encrypt
encrypted = f.encrypt(b"your_data_here")
# Decrypt
decrypted = f.decrypt(encrypted)