https://github.com/dkoh2018/sql_db
ETL pipeline for processing LinkedIn company data: extracts JSON, normalizes, and stores in MySQL
https://github.com/dkoh2018/sql_db
data-engineering data-pipeline etl json linkedin mysql sql
Last synced: about 2 months ago
JSON representation
ETL pipeline for processing LinkedIn company data: extracts JSON, normalizes, and stores in MySQL
- Host: GitHub
- URL: https://github.com/dkoh2018/sql_db
- Owner: dkoh2018
- Created: 2025-03-03T05:18:25.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-18T06:20:49.000Z (over 1 year ago)
- Last Synced: 2025-05-05T21:25:09.804Z (about 1 year ago)
- Topics: data-engineering, data-pipeline, etl, json, linkedin, mysql, sql
- Language: Python
- Homepage:
- Size: 1.62 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LinkedIn Company Data ETL System



## Overview
This repository contains an ETL (Extract, Transform, Load) pipeline for processing LinkedIn company profile data from JSON files and storing it in a normalized MySQL database. The system covers conceptual, logical, and physical design aspects to ensure data consistency and referential integrity.
## Key Features
- **Data Modeling**
- Conceptual, Logical, and Physical diagrams (Crow's Foot notation, 3NF design).
- Focus on proper normalization principles to achieve Third Normal Form (3NF).
- **Staging**
- Loads raw JSON data into a staging database (`staging`) using Pentaho Data Integration.
- **Transformation**
- Normalizes data into intermediate tables, correcting inconsistencies and handling duplicates.
- **Loading**
- Moves transformed data into the final `company` database with proper constraints and relationships.
## Technology Stack
- **MySQL** for relational storage
- **Pentaho Data Integration (Spoon)** for ETL workflows
- **Python 3.x** for scripting and data manipulation
- **JSON** as source format
## Repository Structure
```
sql_db/
├── README.md # This documentation
├── instructions.md # Setup and execution instructions
├── normalization_script.py # Python data processing script
├── stage-json-to-table.ktr # Pentaho transformation definition
├── staging.sql # SQL schema and normalization queries
└── company-profile/ # Source JSON data files
└── [company_*.json] # Multiple company profile files
```
## Data Flow
1. **Extract**
- JSON files from the `company-profile` directory are identified and read.
2. **Staging**
- Using Pentaho Data Integration (Spoon), data is inserted into `staging` tables in MySQL.
3. **Transform**
- **Primary Method:** Run the SQL statements in `staging.sql` for normalization:
- First execute the table creation section to establish the schema
- Then execute the remaining transformation queries in order
- **Optional:** Execute the `normalization_script.py` only if additional custom logic is required.
4. **Load**
- Final normalized data is moved into the `company` database with enforced primary/foreign keys.
## Getting Started
### Prerequisites
- MySQL Server 5.7+
- Pentaho Data Integration 8.0+
- Python 3.6+ (optional, only if using the Python script)
### Quick Setup
1. **Clone** this repository.
2. **Launch Pentaho** with the `./spoon` command.
3. **Configure** database connections in Pentaho.
4. **Run** the Pentaho transformation to load staging data.
5. **Normalize** the data:
- Run the remaining SQL transformation querie after the first one:
```bash
mysql -u -p < staging_transformations.sql
```
- Note: Python script is optional and only needed for specialized transformations.
6. **Confirm** the final structure in the `company` database.
## Example Query
```sql
-- Retrieve companies with their specialties
SELECT c.name, GROUP_CONCAT(s.specialty_name) AS specialties
FROM companies c
JOIN company_specialties cs ON c.id = cs.company_id
JOIN specialties s ON cs.specialty_id = s.id
GROUP BY c.name;
```
## Additional Documentation
See [instructions.md](instructions.md) for more details on the SQL normalization process and how the ERD was developed to achieve proper 3NF design.