{"id":29311460,"url":"https://github.com/mele0/air2lung","last_synced_at":"2026-06-08T16:03:10.528Z","repository":{"id":302254089,"uuid":"1011782329","full_name":"Mele0/Air2Lung","owner":"Mele0","description":"SQL and Python pipeline for analyzing the impact of air pollution and lifestyle factors on lung disease using a synthetic UK cohort.","archived":false,"fork":false,"pushed_at":"2025-07-01T11:35:13.000Z","size":374,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-01T11:39:36.135Z","etag":null,"topics":["air-pollution","data-privacy","encryption","k-anonymity","lung-health-severity","mysql","python","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Mele0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-01T10:31:29.000Z","updated_at":"2025-07-01T11:35:16.000Z","dependencies_parsed_at":"2025-07-01T11:39:40.265Z","dependency_job_id":"9436d01f-b67f-466d-bb9d-cf2fa2a20129","html_url":"https://github.com/Mele0/Air2Lung","commit_stats":null,"previous_names":["mele0/lung-disease-cohort"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Mele0/Air2Lung","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mele0%2FAir2Lung","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mele0%2FAir2Lung/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mele0%2FAir2Lung/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mele0%2FAir2Lung/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Mele0","download_url":"https://codeload.github.com/Mele0/Air2Lung/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mele0%2FAir2Lung/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264040975,"owners_count":23548077,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["air-pollution","data-privacy","encryption","k-anonymity","lung-health-severity","mysql","python","sql"],"created_at":"2025-07-07T08:14:50.031Z","updated_at":"2026-06-08T16:03:10.481Z","avatar_url":"https://github.com/Mele0.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lung Disease Cohort Analysis\n\nThis project focuses on constructing, querying, analyzing, and securing a relational database for a simulated clinical study. The study investigates associations between air pollution, lifestyle factors, and lung disease among 1,000 UK participants. The project integrates SQL (MySQL), Python, data privacy techniques, and encryption.\n\n## 🧪 Project Objectives\n\n- Create a MySQL database and import cohort data from multiple CSV sources.\n- Explore demographic and exposure variables using SQL queries.\n- Assess air quality exposure and its relation to disease status.\n- Implement privacy-preserving techniques including k-anonymity and encryption.\n- Use Python to query the database and manipulate patient-level data.\n\n## 📁 Dataset Description\n\nThe project uses a synthetic cohort study consisting of:\n\n- **covars.csv**: Demographic and clinical data including sex, age at recruitment, disease status, smoking data, etc.\n- **monitor.csv**: Environmental exposure data at monitoring sites (PM2.5, NO2).\n- **customers.csv**: Insurance records including personal identifiers and lifestyle factors.\n\n## ⚙️ Tools and Technologies\n\n- **SQL**: MySQL 8.0, DBeaver\n- **Python**: `pandas`, `sqlalchemy`, `mysql-connector-python`, `cryptography`\n- **Environment**: Jupyter Notebook, VSCode\n\n## 🔍 Key Features\n\n### 1. Database Construction\nMySQL scripts build a relational database `lung_disease_DB` with foreign key relations and correct datatypes, enabling clean integration of environmental and participant-level data.\n\n### 2. Data Analysis (SQL)\nQueries assess cohort characteristics:\n- Age distribution\n- Disease prevalence across regions\n- Environmental exposure (PM2.5, NO2)\n- Smoking intensity (Pack Years)\n\nViews and updated schema elements (e.g., `pack_years`) were added to support repeated queries and visualization.\n\n### 3. Python Integration\nPython scripts:\n- Establish secure connection to the MySQL database\n- Extract and manipulate cohort subsets\n- Validate SQL queries within a Python data science pipeline\n\n### 4. Data Privacy \u0026 Anonymization\n\n#### HIPAA-Informed Classification:\n- **Sensitive Identifiers**: Name, phone number, bank details\n- **Quasi-identifiers**: Age group, sex, ethnicity, education level, area\n\nData split into two linked CSVs:\n- `Sensitive_information.csv`\n- `Raw_information.csv`\n\nA re-identification risk assessment showed **83 individuals** could be uniquely identified using just quasi-identifiers—highlighting the insufficiency of basic de-identification.\n\n---\n\n#### 🧠 Mondrian Method for K-Anonymity\n\nTo anonymize quasi-identifiers in our dataset, we used a custom binning and generalization strategy inspired by the **Mondrian multidimensional k-anonymity** algorithm.\n\nThe Mondrian method recursively partitions data into multidimensional regions until no further division is possible without violating the desired k-anonymity threshold. It balances privacy and data utility by minimizing information loss.\n\nFor more details, refer to the [original paper](https://pages.cs.wisc.edu/~lefevre/MultiDim.pdf).\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://frontdesk.co.in/wp-content/uploads/2024/05/image-15.png\" width=\"400\"/\u003e\n  \u003cp\u003e\u003cem\u003eFigure: A simplified illustration of Mondrian partitioning in two dimensions.\u003c/em\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n---\n\n#### K-Anonymity Strategy:\nUsing the Mondrian-inspired approach, we achieved:\n\n\u003cdiv align=\"center\"\u003e\n\n\u003ctable\u003e\n  \u003cthead\u003e\n    \u003ctr\u003e\n      \u003cth\u003eK-Anonymity Level\u003c/th\u003e\n      \u003cth\u003eSamples Retained\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e≥1\u003c/td\u003e\n      \u003ctd\u003e1000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e≥2\u003c/td\u003e\n      \u003ctd\u003e367\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e≥3\u003c/td\u003e\n      \u003ctd\u003e109\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003e≥4\u003c/td\u003e\n      \u003ctd\u003e16\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\n\u003c/div\u003e\n\n\nFinal anonymized dataset: `Anonymized_information.csv`\n\n### 5. Data Encryption\n\nUsed the `cryptography` Python package and Fernet symmetric encryption to securely encrypt anonymized datasets. Decryption requires the private key shared separately.\n\n```python\nfrom cryptography.fernet import Fernet\nkey = Fernet.generate_key()\nf = Fernet(key)\n# Encrypt\nencrypted = f.encrypt(b\"your_data_here\")\n# Decrypt\ndecrypted = f.decrypt(encrypted)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmele0%2Fair2lung","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmele0%2Fair2lung","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmele0%2Fair2lung/lists"}