https://github.com/camilajaviera91/sql-mock-data
Generate a synthetic dataset with one million records of employee information from a fictional company, load it into a PostgreSQL database, create analytical reports using PySpark and large-scale data analysis techniques, and implement machine learning models to predict trends in hiring and layoffs on a monthly and yearly basis.
https://github.com/camilajaviera91/sql-mock-data
connection faker locale logging matplotlib os postgresql psycopg2 pyspark pyspark-sql python random random-python shutil sparksession sql sys unicode
Last synced: 7 months ago
JSON representation
Generate a synthetic dataset with one million records of employee information from a fictional company, load it into a PostgreSQL database, create analytical reports using PySpark and large-scale data analysis techniques, and implement machine learning models to predict trends in hiring and layoffs on a monthly and yearly basis.
- Host: GitHub
- URL: https://github.com/camilajaviera91/sql-mock-data
- Owner: CamilaJaviera91
- Created: 2025-04-01T16:09:52.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-04-29T22:56:54.000Z (10 months ago)
- Last Synced: 2025-07-06T20:05:08.126Z (8 months ago)
- Topics: connection, faker, locale, logging, matplotlib, os, postgresql, psycopg2, pyspark, pyspark-sql, python, random, random-python, shutil, sparksession, sql, sys, unicode
- Language: Python
- Homepage:
- Size: 217 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🪪 Synthetic Employee Dataset: SQL, PySpark & ML Pipeline
## SQL Mock Data
- This project generates synthetic employee data using Python and Faker, stores it in a PostgreSQL database, and performs analytics and machine learning modeling using PySpark and Scikit-learn. It's designed for data engineering and data science practice, focusing on realistic HR-style datasets and workflows.
- **Key features include:**
- Synthetic data generation with customizable logic
- PostgreSQL integration
- PySpark data processing and transformations
- Predictive modeling for employee attrition
## 🚀 Getting Started
### 1. Clone the repository
```
git clone https://github.com/CamilaJaviera91/sql-mock-data.git
```
### 2. Open the folder in your computer
```
cd your/route/sql-mock-data
```
### 3. Create a file named **requirements.txt** with the following content:
```
pandas
numpy
faker
psycopg2-binary
pyspark
scikit-learn
matplotlib
seaborn
```
### 4. Create a virtual environment and install dependencies
```
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
### 5. Set up the PostgreSQL database
1. Create a new database called **employees**.
2. Generate mock data.
3. Insert mock data into our new schema.
### 6. Generate and insert mock data into the database
```
python your/route/sql-mock-data/sql_mock_data.py
python your/route/sql-mock-data/insert.py
```
# 📚 Data Dictionary
| Column | Description | Type |
|------------------|-------------------------------------|---------|
| id | Unique identifier | Integer |
| name | full name of the employee | Text |
| date_birth | Date of birth if the employee | Date |
| department | Department where the employee works | Text |
| email | Employee work email | Text |
| phonenumber | Work phonenumber of the employee | Text |
| yearly_salary | Yearly salary in USD | Integer |
| city | City where the employee lives | Text |
| hire_date | Date when the employee was hired | Date |
| termination_date | Date when the employee was fired | Date |
# 📁 Project Structure
```
sql-mock-data/
├── data/
│ └── *.csv # Synthetic employee data files
├── images/
│ └── pic*.png # Visualizations and example outputs
├── python/
│ ├── sql_mock_data.py # Script to generate synthetic data
│ ├── insert.py # Script to insert data into PostgreSQL
│ ├── analysis.py # Data analysis using PySpark
│ ├── queries.py # SQL queries for data retrieval
│ ├── show_results.py # Visualization of query results
│ └── connection.py # Database connection setup
├── sql/
│ └── schema.sql # SQL schema definitions
├── .gitignore # Specifies files to ignore in Git
└── README.md # Project documentation
```
# 🔥 Introduction to PySpark
- **PySpark** it's the Python API for Apache Spark, enabling the use of Spark with Python.
## 🔑 Key Features:
1. **Distributed Computing:** Processes large datasets across a cluster of computers for scalability.
2. **In-Memory Processing:** Speeds up computation by reducing disk I/O.
3. **Lazy Evaluation:** Operations are only executed when an action is triggered, optimizing performance.
4. **Rich Libraries:**
- **Spark SQL:** Structured data processing (like SQL operations).
- **MLlib:** Machine learning library for scalable algorithms.
- **GraphX:** Graph processing (via RDD API).
- **Spark Streaming:** Real-time stream processing.
5. **Compatibility:** Works with Hadoop, HDFS, Hive, Cassandra, etc.
6. **Resilient Distributed Datasets (RDDs):** Low-level API for distributed data handling.
7. **DataFrames & Datasets:** High-level APIs for structured data with SQL-like operations.
## ✅ Pros — ❌ Cons
| Pros | Cons |
|-------------------------------------------------------|-------------------------------------------------|
| Handles massive datasets efficiently. | Can be memory-intensive. |
| Compatible with many tools (Hadoop, Cassandra, etc.). | Complex configuration for cluster environments. |
| Built-in libraries for SQL, Machine Learning. | |
## 🔧 Install pyspark
1. Install via pip
```
pip install pyspark
```
2. Verify installation
```
python3 -c "import pyspark; print(pyspark.__version__)"
```
---
# 🗃️ Introduction to SQL (Structured Query Language)
- **SQL** is how we read, write, and manage data stored in databases.
## 🔑 Key Features:
1. **Data Querying:** You can retrieve exactly the data you need using the SELECT statement.
```
SELECT * FROM employees WHERE department = 'HR';
```
2.**Data Manipulation:** SQL lets you insert, update, or delete records.
- INSERT
- UPDATE
- DELETE
3. **Data Definition:** You can create or change the structure of tables and databases.
- CREATE
- ALTER
- DROP
4. **Data Control:** SQL allows you to control access to the data.
- GRANT
- REVOKE
5. **Transaction Control:** Manage multiple steps as a single unit.
- BEGIN
- COMMIT
- ROLLBACK
6. **Filtering and Sorting:**
- WHERE
- ORDER BY
- GROUP BY
- HAVING
7. **Joins:** Combine data from multiple tables.
8. **Built-in Functions:** SQL includes powerful functions for calculations, text handling, dates, etc.
9. **Standardized Language:** SQL is used across most relational database systems (like PostgreSQL, MySQL, SQL Server, etc.), with only slight differences.
10. **Declarative Nature:** You tell SQL what you want, not how to do it. The database figures out the best way.
## ✅ Pros — ❌ Cons
| Pros | Cons |
|---------------------------------|--------------------------------|
| Easy to Learn and Use. | Not Ideal for Complex Logic. |
| Efficient Data Management. | Different Dialects. |
| Powerful Querying Capabilities. | Can Get Complicated. |
| Standardized Language. | Limited for Unstructured Data. |
| Scalable. | Performance Tuning Required. |
| Secure. | |
| Supports Transactions. | |
---
# 🐳 Introduction to Docker
- **Docker** is a tool that lets you package your app with everything it needs, so it can run anywhere, without problems.
- It does this using something called containers, which are like small, lightweight virtual machines.
## 🔑 Key Features:
1. **Containers:** Run apps in isolated environments.
2. **Images:** Blueprints for containers (created using a Dockerfile).
3. **Portability:** Works the same on any system with Docker.
4. **Speed:** Starts apps quickly.
5. **Docker Hub:** A place to share and download app images.
## ✅ Pros — ❌ Cons
| Pros | Cons |
|-----------------------------------|-------------------------------------------------------|
| Works the same everywhere. | Takes some time to learn. |
| Fast and lightweight. | Not ideal for apps that need a full operating system. |
| Easy to share apps. | Security risks if not set up properly. |
| Good for automating deployments. | Managing data storage can be tricky. |
| Great for teams working together. | |
## 🔧 Install Docker on Fedora
1. Update the system:
```
sudo dnf update -y
```
2. Install necessary packages for using HTTPS repositories:
```
sudo dnf install dnf-plugins-core -y
```
3. Add the official Docker repository:
```
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
```
4. Install Docker Engine:
```
sudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
```
5. Enable and start the Docker service:
```
sudo systemctl enable docker
sudo systemctl start docker
```
6. Verify that Docker is running:
```
sudo docker run hello-world
```
7. (Optional) Run Docker without sudo:
- If you want to use Docker without typing sudo every time:
```
sudo usermod -aG docker $USER
```
Then, log out and log back in (or reboot your system) for the change to take effect.
---
# 🛠️ Code Explanation
## 👩💻 Script 1: sql_mock_data.py — Generate Mock Data
### 🔧 Libraries that we are going to need:
| Library | Description |
|-----------|----------------------------------------------------------------|
| PySpark | Apache Spark Python API (for big data). |
| Faker | Fake data generator (used for names, etc.). |
| unidecode | Removes accents from characters (e.g., é → e). |
| random | For generating random numbers, probabilities, selections, etc. |
| os | For cross-platform file handling and directory management. |
| shutil | For managing file system operations in automation scripts. |
### 📖 Explanation of the Code:
- This script:
- Creates 1 million fake employee records.
- Each with realistic personal and job data.
- Saves them across 12 cleanly named CSV files.
- Makes sure names and phones are unique.
- Can be scaled easily or reused for testing, demos, or training.
### ✅ Example Output:

---
## 👩💻 Script 2: edit_data.py — Edit Mock Data
### 🔧 Libraries that we are going to need:
| Library | Description |
|---------|------------------------------------------------------------------------|
| pandas | For working with CSVs and DataFrames. |
| os | For cross-platform file handling and directory management. |
| random | For generate random numbers, shuffle data, and make random selections. |
### 📖 Explanation of the Code:
- This script:
- Reads all .csv files from a folder called data, and saves enriched versions to data_enriched.
- Reads a list of known female first names from a text file (female_names.txt) to help determine gender.
- Provides a list of 20 possible job titles for each department like Sales, IT, HR, etc., to assign randomly.
- For every CSV:
- Adds a status column (Active or Inactive depending on termination_date).
- Adds a gender column using the first name.
- Adds a job_title column based on the department.
- Writes the enriched data to a new CSV in the data_enriched folder and prints a confirmation.
### ✅ Example Output:

---
## 👩💻 Script 3: insert.py — Insert data into postgres
### 🔧 Libraries that we are going to need:
| Library | Description |
|---------------|------------------------------------------------------------|
| pandas | For working with CSVs and DataFrames. |
| sqlalchemy | Python SQL toolkit and ORM. |
| psycopg2 | PostgreSQL driver required by SQLAlchemy. |
| python-dotenv | helps you load environment variables from `.env` file. |
| glob | Standard library for file pattern matching. |
| os | For cross-platform file handling and directory management. |
### 📖 Explanation of the Code:
- This script:
- Finds all CSV files in the ./data/ folder using glob.
- Reads and combines all the CSVs into a single pandas DataFrame.
- Creates a connection to a PostgreSQL database using SQLAlchemy.
- Uploads the combined data to the employees table in the database.
### ✅ Example Output:
---
## 👩💻 Script 4: analysis.py — First analysis of the data
### 🔧 Libraries that we are going to need:
| Library | Description |
|--------------------|-------------------------------------------------------|
| PySpark | Apache Spark Python API (for big data). |
| matplotlib.pyplot | To create visualizations (histograms and bar charts). |
| logging | To track execution flow and info messages. |
### 📖 Explanation of the Code:
- This script:
- Reads multiple CSV files using PySpark and combines them into a single DataFrame.
- Calculates the age of each employee based on their date of birth and shows basic statistics.
- Generates age distribution plots using matplotlib (histogram + bar chart with labels).
- Performs department and city analysis, including counts and turnover (employees who left).
- Logs activity and minimizes Spark output verbosity for clarity.
### ✅ Example Output:





---
## 👩💻 Script 5: queries.py — Create SQL queries
### 🔧 Libraries that we are going to need:
| Library | Description |
|------------|-------------------------------------------------|
| psycopg2 | PostgreSQL driver required by SQLAlchemy. |
| pandas | For working with CSVs and DataFrames. |
| connection | Custom local module to establish DB connection. |
| locale | Built-in module for localization/formatting. |
| sys | Built-in module to modify the system path. |
### 📖 Explanation of the Code:
- This script:
- Uses a custom connection() function to establish a PostgreSQL connection.
- Tries to set locale to Spanish (es_ES.UTF-8) for formatting purposes.
- Runs SQL queries using run_query(), returning results as a pandas DataFrame.
- Includes 6 analysis (more to add) functions by city, department, and age, calculating turnover rates and salaries for active employees.
- Executes all analyses and prints them when the script is run directly.
### ✅ Example Output:
- **by_city()**

- **by_department()**

- **by_age()**

- **salary_by_city()**

- **salary_by_department()**

- **salary_by_age()**

- **hired_and_terminated()**

- **hired_and_terminated_department()**

---
## 👩💻 Script 6: show_results.py — Plot SQL queries
### 🔧 Libraries that we are going to need:
| Library | Description |
|-------------------|-------------------------------------------------------|
| matplotlib.pyplot | To create visualizations (histograms and bar charts). |
| seaborn | For making nice statistical plots easily. |
| queries | Custom local module to establish DB connection. |
### 📖 Explanation of the Code:
- This script:
- Imports data from predefined SQL queries (like by_city, by_age, etc.) using custom functions.
- Creates charts with Seaborn and Matplotlib to visualize employee data.
- Plots bar charts for active employees and salaries by city and department.
- Plots a line chart showing turnover rate by age, with value labels.
- Plots a line chart showing yearly hires and terminations, including count labels.
### ✅ Example Output:
- **plot_by_city()**

- **plot_by_department()**

- **plot_by_age()**

- **plot_salary_by_city()**

- **plot_salary_by_department()**

- **plot_hired_and_terminated()**

---
## 👩💻 Script 7: prediction.py — Predict employees hired and terminated
### 🔧 Libraries that we are going to need:
| Library | Description |
|----------------------|-------------------------------------------------------------|
| sys | Built-in module to modify the system path. |
| connection | Custom local module to establish DB connection. |
| queries | Custom local module to establish DB connection. |
| psycopg2 | PostgreSQL driver required by SQLAlchemy. |
| pandas | For working with CSVs and DataFrames. |
| locale | Built-in module for localization/formatting. |
| matplotlib.pyplot | To create visualizations (histograms and bar charts). |
| numpy | For working with numerical data, especially arrays/matrices.|
| sklearn.linear_model | To predict future values. |
### 📖 Explanation of the Code:
- This script:
- Connects to a database and gets data about how many people were hired and fired each year.
- Learns the trend using machine learning (linear regression).
- Predicts how many people will be hired and fired in the next 3 years.
- Shows the results in a table.
- Draws a chart to compare real and predicted numbers.
### ✅ Example Output:


---
# 🔮 Future Enhancements
- [x] Add DBT models for transformation and documentation.
- [Link Repo](https://github.com/CamilaJaviera91/dbt-transformations-sql-mock-data)
- [x] Streamline data generation for large-scale datasets.
- [Link Repo](https://github.com/CamilaJaviera91/mock-data-factory)
- [ ] Add Airflow DAG for orchestration.
- [x] Deploy insights via Looker Studio or Power BI dashboard.
- [Link Looker Studio](https://lookerstudio.google.com/u/0/reporting/2f57d2bd-7afe-4c5b-8793-303f85687b22/page/tEnnC)