{"id":29029149,"url":"https://github.com/camilajaviera91/sql-mock-data","last_synced_at":"2025-07-28T07:09:09.944Z","repository":{"id":285622058,"uuid":"958699634","full_name":"CamilaJaviera91/sql-mock-data","owner":"CamilaJaviera91","description":"Generate a synthetic dataset with one million records of employee information from a fictional company, load it into a PostgreSQL database, create analytical reports using PySpark and large-scale data analysis techniques, and implement machine learning models to predict trends in hiring and layoffs on a monthly and yearly basis.","archived":false,"fork":false,"pushed_at":"2025-04-29T22:56:54.000Z","size":227273,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-06T20:05:08.126Z","etag":null,"topics":["connection","faker","locale","logging","matplotlib","os","postgresql","psycopg2","pyspark","pyspark-sql","python","random","random-python","shutil","sparksession","sql","sys","unicode"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CamilaJaviera91.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-01T16:09:52.000Z","updated_at":"2025-04-29T22:56:58.000Z","dependencies_parsed_at":"2025-07-06T20:15:08.665Z","dependency_job_id":null,"html_url":"https://github.com/CamilaJaviera91/sql-mock-data","commit_stats":null,"previous_names":["camilajaviera91/sql-mock-data"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CamilaJaviera91/sql-mock-data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fsql-mock-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fsql-mock-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fsql-mock-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fsql-mock-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CamilaJaviera91","download_url":"https://codeload.github.com/CamilaJaviera91/sql-mock-data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CamilaJaviera91%2Fsql-mock-data/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267476240,"owners_count":24093459,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["connection","faker","locale","logging","matplotlib","os","postgresql","psycopg2","pyspark","pyspark-sql","python","random","random-python","shutil","sparksession","sql","sys","unicode"],"created_at":"2025-06-26T08:05:45.935Z","updated_at":"2025-07-28T07:09:04.918Z","avatar_url":"https://github.com/CamilaJaviera91.png","language":"Python","readme":"# 🪪 Synthetic Employee Dataset: SQL, PySpark \u0026 ML Pipeline\n\n## SQL Mock Data\n\n- This project generates synthetic employee data using Python and Faker, stores it in a PostgreSQL database, and performs analytics and machine learning modeling using PySpark and Scikit-learn. It's designed for data engineering and data science practice, focusing on realistic HR-style datasets and workflows.\n\n- **Key features include:**\n    - Synthetic data generation with customizable logic\n    - PostgreSQL integration\n    - PySpark data processing and transformations\n    - Predictive modeling for employee attrition\n\n## 🚀 Getting Started\n\n### 1. Clone the repository\n```\ngit clone https://github.com/CamilaJaviera91/sql-mock-data.git\n```\n\n### 2. Open the folder in your computer\n```\ncd your/route/sql-mock-data\n```\n\n### 3. Create a file named **requirements.txt** with the following content:\n\n```\npandas\nnumpy\nfaker\npsycopg2-binary\npyspark\nscikit-learn\nmatplotlib\nseaborn\n```\n\n### 4. Create a virtual environment and install dependencies\n```\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n```\n\n### 5. Set up the PostgreSQL database\n\n1. Create a new database called **employees**.\n\n2. Generate mock data.\n\n3. Insert mock data into our new schema.\n\n### 6. Generate and insert mock data into the database\n```\npython your/route/sql-mock-data/sql_mock_data.py\n\npython your/route/sql-mock-data/insert.py\n```\n\n# 📚 Data Dictionary\n\n| Column           | Description                         | Type    |\n|------------------|-------------------------------------|---------|\n| id               | Unique identifier                   | Integer |\n| name             | full name of the employee           | Text    |\n| date_birth       | Date of birth if the employee       | Date    |\n| department       | Department where the employee works | Text    |\n| email            | Employee work email                 | Text    |\n| phonenumber      | Work phonenumber of the employee    | Text    |\n| yearly_salary    | Yearly salary in USD                | Integer |\n| city             | City where the employee lives       | Text    |\n| hire_date        | Date when the employee was hired    | Date    |\n| termination_date | Date when the employee was fired    | Date    |\n\n# 📁 Project Structure\n\n```\nsql-mock-data/\n├── data/\n│   └── *.csv                  # Synthetic employee data files\n├── images/\n│   └── pic*.png               # Visualizations and example outputs\n├── python/\n│   ├── sql_mock_data.py       # Script to generate synthetic data\n│   ├── insert.py              # Script to insert data into PostgreSQL\n│   ├── analysis.py            # Data analysis using PySpark\n│   ├── queries.py             # SQL queries for data retrieval\n│   ├── show_results.py        # Visualization of query results\n│   └── connection.py          # Database connection setup\n├── sql/\n│   └── schema.sql             # SQL schema definitions\n├── .gitignore                 # Specifies files to ignore in Git\n└── README.md                  # Project documentation\n```\n\n# 🔥 Introduction to PySpark\n- **PySpark** it's the Python API for Apache Spark, enabling the use of Spark with Python.\n\n## 🔑 Key Features:\n\n1. **Distributed Computing:** Processes large datasets across a cluster of computers for scalability.\n\n2. **In-Memory Processing:** Speeds up computation by reducing disk I/O.\n\n3. **Lazy Evaluation:** Operations are only executed when an action is triggered, optimizing performance.\n\n4. **Rich Libraries:**\n    - **Spark SQL:** Structured data processing (like SQL operations).\n    - **MLlib:** Machine learning library for scalable algorithms.\n    - **GraphX:** Graph processing (via RDD API).\n    - **Spark Streaming:** Real-time stream processing.\n\n5. **Compatibility:** Works with Hadoop, HDFS, Hive, Cassandra, etc.\n\n6. **Resilient Distributed Datasets (RDDs):** Low-level API for distributed data handling.\n\n7. **DataFrames \u0026 Datasets:** High-level APIs for structured data with SQL-like operations.\n\n## ✅ Pros — ❌ Cons\n\n| Pros                                                  | Cons                                            |\n|-------------------------------------------------------|-------------------------------------------------|\n| Handles massive datasets efficiently.                 | Can be memory-intensive.                        |\n| Compatible with many tools (Hadoop, Cassandra, etc.). | Complex configuration for cluster environments. |\n| Built-in libraries for SQL, Machine Learning.         |                                                 |\n\n## 🔧 Install pyspark\n\n1. Install via pip\n\n```\npip install pyspark\n```\n\n2. Verify installation\n\n```\npython3 -c \"import pyspark; print(pyspark.__version__)\"\n```\n\n---\n\n# 🗃️ Introduction to SQL (Structured Query Language)\n\n- **SQL** is how we read, write, and manage data stored in databases.\n\n## 🔑 Key Features:\n\n1. **Data Querying:** You can retrieve exactly the data you need using the SELECT statement.\n```\nSELECT * FROM employees WHERE department = 'HR';\n```\n\n2.**Data Manipulation:** SQL lets you insert, update, or delete records.\n\n    - INSERT\n    - UPDATE\n    - DELETE\n\n3. **Data Definition:** You can create or change the structure of tables and databases.\n\n    - CREATE\n    - ALTER\n    - DROP\n\n4. **Data Control:** SQL allows you to control access to the data.\n\n    - GRANT\n    - REVOKE\n\n5. **Transaction Control:** Manage multiple steps as a single unit.\n\n    - BEGIN\n    - COMMIT\n    - ROLLBACK\n\n6. **Filtering and Sorting:**\n    \n    - WHERE\n    - ORDER BY\n    - GROUP BY\n    - HAVING\n\n7. **Joins:** Combine data from multiple tables.\n\n8. **Built-in Functions:** SQL includes powerful functions for calculations, text handling, dates, etc.\n\n9. **Standardized Language:** SQL is used across most relational database systems (like PostgreSQL, MySQL, SQL Server, etc.), with only slight differences.\n\n10. **Declarative Nature:** You tell SQL what you want, not how to do it. The database figures out the best way.\n\n## ✅ Pros — ❌ Cons\n\n| Pros                            | Cons                           |\n|---------------------------------|--------------------------------|\n| Easy to Learn and Use.          | Not Ideal for Complex Logic.   |\n| Efficient Data Management.      | Different Dialects.            |\n| Powerful Querying Capabilities. | Can Get Complicated.           |\n| Standardized Language.          | Limited for Unstructured Data. |\n| Scalable.                       | Performance Tuning Required.   |\n| Secure.                         |                                |\n| Supports Transactions.          |                                |\n\n---\n\n# 🐳 Introduction to Docker\n\n- **Docker** is a tool that lets you package your app with everything it needs, so it can run anywhere, without problems.\n\n- It does this using something called containers, which are like small, lightweight virtual machines.\n\n## 🔑 Key Features:\n\n1. **Containers:** Run apps in isolated environments.\n\n2. **Images:** Blueprints for containers (created using a Dockerfile).\n\n3. **Portability:** Works the same on any system with Docker.\n\n4. **Speed:** Starts apps quickly.\n\n5. **Docker Hub:** A place to share and download app images.\n\n## ✅ Pros — ❌ Cons\n\n| Pros                              | Cons                                                  |\n|-----------------------------------|-------------------------------------------------------|\n| Works the same everywhere.        | Takes some time to learn.                             |\n| Fast and lightweight.             | Not ideal for apps that need a full operating system. |\n| Easy to share apps.               | Security risks if not set up properly.                |\n| Good for automating deployments.  | Managing data storage can be tricky.                  |\n| Great for teams working together. |                                                       |\n\n## 🔧 Install Docker on Fedora\n\n1. Update the system:\n\n```\nsudo dnf update -y\n```\n\n2. Install necessary packages for using HTTPS repositories:\n\n```\nsudo dnf install dnf-plugins-core -y\n```\n\n3. Add the official Docker repository:\n\n```\nsudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo\n```\n\n4. Install Docker Engine:\n\n```\nsudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y\n```\n\n5. Enable and start the Docker service:\n\n```\nsudo systemctl enable docker\nsudo systemctl start docker\n```\n\n6. Verify that Docker is running:\n\n```\nsudo docker run hello-world\n```\n\n7. (Optional) Run Docker without sudo:\n\n- If you want to use Docker without typing sudo every time:\n\n```\nsudo usermod -aG docker $USER\n```\n\nThen, log out and log back in (or reboot your system) for the change to take effect.\n\n---\n\n# 🛠️ Code Explanation\n\n## 👩‍💻 Script 1: sql_mock_data.py — Generate Mock Data\n\n### 🔧 Libraries that we are going to need:\n\n| Library   | Description                                                    |\n|-----------|----------------------------------------------------------------|\n| PySpark   | Apache Spark Python API (for big data).                        |\n| Faker\t    | Fake data generator (used for names, etc.).                    |\n| unidecode | Removes accents from characters (e.g., é → e).                 |\n| random    | For generating random numbers, probabilities, selections, etc. |\n| os        | For cross-platform file handling and directory management.     |\n| shutil    | For managing file system operations in automation scripts.     |\n\n### 📖 Explanation of the Code:\n\n- This script:\n\n    - Creates 1 million fake employee records.\n\n    - Each with realistic personal and job data.\n\n    - Saves them across 12 cleanly named CSV files.\n\n    - Makes sure names and phones are unique.\n\n    - Can be scaled easily or reused for testing, demos, or training.\n\n### ✅ Example Output:\n\n\u003cimg src=\"./images/pic2.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n---\n\n## 👩‍💻 Script 2: edit_data.py — Edit Mock Data\n\n### 🔧 Libraries that we are going to need:\n\n| Library | Description                                                            |\n|---------|------------------------------------------------------------------------|\n| pandas  | For working with CSVs and DataFrames.                                  |\n| os      | For cross-platform file handling and directory management.             |\n| random  | For generate random numbers, shuffle data, and make random selections. |\n\n### 📖 Explanation of the Code:\n\n- This script:\n\n    - Reads all .csv files from a folder called data, and saves enriched versions to data_enriched.\n\n    - Reads a list of known female first names from a text file (female_names.txt) to help determine gender.\n\n    - Provides a list of 20 possible job titles for each department like Sales, IT, HR, etc., to assign randomly.\n\n    - For every CSV:\n        - Adds a status column (Active or Inactive depending on termination_date).\n        - Adds a gender column using the first name.\n        - Adds a job_title column based on the department.\n\n    - Writes the enriched data to a new CSV in the data_enriched folder and prints a confirmation.\n\n### ✅ Example Output:\n\n\u003cimg src=\"./images/pic23.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n---\n\n## 👩‍💻 Script 3: insert.py — Insert data into postgres\n\n### 🔧 Libraries that we are going to need:\n\n| Library       | Description                                                |\n|---------------|------------------------------------------------------------|\n| pandas        | For working with CSVs and DataFrames.                      |\n| sqlalchemy    | Python SQL toolkit and ORM.                                |\n| psycopg2      | PostgreSQL driver required by SQLAlchemy.                  |\n| python-dotenv | helps you load environment variables from `.env` file.     |\n| glob\t        | Standard library for file pattern matching.                |\n| os\t        | For cross-platform file handling and directory management. |\n\n### 📖 Explanation of the Code:\n\n- This script:\n\n    - Finds all CSV files in the ./data/ folder using glob.\n\n    - Reads and combines all the CSVs into a single pandas DataFrame.\n\n    - Creates a connection to a PostgreSQL database using SQLAlchemy.\n\n    - Uploads the combined data to the employees table in the database.\n\n### ✅ Example Output:\n\n\u003cimg src=\"./images/pic3.png\" alt=\"mock_data\" width=\"500\"/\u003e\n---\n\n## 👩‍💻 Script 4: analysis.py — First analysis of the data\n\n### 🔧 Libraries that we are going to need:\n\n| Library            | Description                                           |\n|--------------------|-------------------------------------------------------|\n| PySpark            | Apache Spark Python API (for big data).               |\n| matplotlib.pyplot  | To create visualizations (histograms and bar charts). |\n| logging            | To track execution flow and info messages.            |\n\n### 📖 Explanation of the Code:\n\n- This script:\n\n    - Reads multiple CSV files using PySpark and combines them into a single DataFrame.\n\n    - Calculates the age of each employee based on their date of birth and shows basic statistics.\n\n    - Generates age distribution plots using matplotlib (histogram + bar chart with labels).\n\n    - Performs department and city analysis, including counts and turnover (employees who left).\n\n    - Logs activity and minimizes Spark output verbosity for clarity.\n\n### ✅ Example Output:\n\n\u003cimg src=\"./images/pic4.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n\u003cbr\u003e\n\n\u003cimg src=\"./images/pic5.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n\u003cbr\u003e\n\n\u003cimg src=\"./images/pic6.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n\u003cbr\u003e\n\n\u003cimg src=\"./images/pic7.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n\u003cbr\u003e\n\n\u003cimg src=\"./images/pic8.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n---\n\n## 👩‍💻 Script 5: queries.py — Create SQL queries\n\n### 🔧 Libraries that we are going to need:\n\n| Library    | Description                                     |\n|------------|-------------------------------------------------|\n| psycopg2   | PostgreSQL driver required by SQLAlchemy.       |\n| pandas     | For working with CSVs and DataFrames.           |\n| connection | Custom local module to establish DB connection. |\n| locale     | Built-in module for localization/formatting.    |\n| sys        | Built-in module to modify the system path.      |\n\n### 📖 Explanation of the Code:\n\n- This script:\n\n    - Uses a custom connection() function to establish a PostgreSQL connection.\n\n    - Tries to set locale to Spanish (es_ES.UTF-8) for formatting purposes.\n\n    - Runs SQL queries using run_query(), returning results as a pandas DataFrame.\n\n    - Includes 6 analysis (more to add) functions by city, department, and age, calculating turnover rates and salaries for active employees.\n\n    - Executes all analyses and prints them when the script is run directly.\n\n### ✅ Example Output:\n\n- **by_city()**\n\n\u003cimg src=\"./images/pic9.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **by_department()**\n\n\u003cimg src=\"./images/pic10.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **by_age()**\n\n\u003cimg src=\"./images/pic11.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **salary_by_city()**\n\n\u003cimg src=\"./images/pic12.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **salary_by_department()**\n\n\u003cimg src=\"./images/pic14.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **salary_by_age()**\n\n\u003cimg src=\"./images/pic13.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **hired_and_terminated()**\n\n\u003cimg src=\"./images/pic15.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **hired_and_terminated_department()**\n\n\u003cimg src=\"./images/pic16.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n---\n\n## 👩‍💻 Script 6: show_results.py — Plot SQL queries\n\n### 🔧 Libraries that we are going to need:\n\n| Library           | Description                                           |\n|-------------------|-------------------------------------------------------|\n| matplotlib.pyplot | To create visualizations (histograms and bar charts). |\n| seaborn           | For making nice statistical plots easily.             |\n| queries           | Custom local module to establish DB connection.       |\n\n### 📖 Explanation of the Code:\n\n- This script:\n\n    - Imports data from predefined SQL queries (like by_city, by_age, etc.) using custom functions.\n\n    - Creates charts with Seaborn and Matplotlib to visualize employee data.\n\n    - Plots bar charts for active employees and salaries by city and department.\n\n    - Plots a line chart showing turnover rate by age, with value labels.\n\n    - Plots a line chart showing yearly hires and terminations, including count labels.\n\n### ✅ Example Output:\n\n- **plot_by_city()**\n\n\u003cimg src=\"./images/pic17.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **plot_by_department()**\n\n\u003cimg src=\"./images/pic18.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **plot_by_age()**\n\n\u003cimg src=\"./images/pic19.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **plot_salary_by_city()**\n\n\u003cimg src=\"./images/pic20.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **plot_salary_by_department()**\n\n\u003cimg src=\"./images/pic21.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n- **plot_hired_and_terminated()**\n\n\u003cimg src=\"./images/pic22.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n---\n\n## 👩‍💻 Script 7: prediction.py — Predict employees hired and terminated \n\n### 🔧 Libraries that we are going to need:\n\n| Library              | Description                                                 |\n|----------------------|-------------------------------------------------------------|\n| sys                  | Built-in module to modify the system path.                  |\n| connection           | Custom local module to establish DB connection.             |\n| queries              | Custom local module to establish DB connection.             |\n| psycopg2             | PostgreSQL driver required by SQLAlchemy.                   |\n| pandas               | For working with CSVs and DataFrames.                       |\n| locale               | Built-in module for localization/formatting.                |\n| matplotlib.pyplot    | To create visualizations (histograms and bar charts).       |\n| numpy                | For working with numerical data, especially arrays/matrices.|\n| sklearn.linear_model | To predict future values.                                   |\n\n### 📖 Explanation of the Code:\n\n- This script:\n\n    - Connects to a database and gets data about how many people were hired and fired each year.\n\n    - Learns the trend using machine learning (linear regression).\n\n    - Predicts how many people will be hired and fired in the next 3 years.\n\n    - Shows the results in a table.\n\n    - Draws a chart to compare real and predicted numbers.\n\n### ✅ Example Output:\n\n\u003cimg src=\"./images/pic24.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n\u003cimg src=\"./images/pic25.png\" alt=\"mock_data\" width=\"500\"/\u003e\n\n---\n\n# 🔮 Future Enhancements\n\n- [x] Add DBT models for transformation and documentation.\n    - [Link Repo](https://github.com/CamilaJaviera91/dbt-transformations-sql-mock-data)\n- [x] Streamline data generation for large-scale datasets.\n    - [Link Repo](https://github.com/CamilaJaviera91/mock-data-factory)\n- [ ] Add Airflow DAG for orchestration.\n- [x] Deploy insights via Looker Studio or Power BI dashboard.\n    - [Link Looker Studio](https://lookerstudio.google.com/u/0/reporting/2f57d2bd-7afe-4c5b-8793-303f85687b22/page/tEnnC)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamilajaviera91%2Fsql-mock-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcamilajaviera91%2Fsql-mock-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamilajaviera91%2Fsql-mock-data/lists"}