https://github.com/athul64/exploratory-data-analysis
To preprocess and analyze the given employee dataset, present the findings graphically, and derive meaningful insights to help better understand the company’s workforce.
https://github.com/athul64/exploratory-data-analysis
colab-notebook data-analysis data-visualization matplotlib numpy pandas python seaborn statistical-analysis
Last synced: about 2 months ago
JSON representation
To preprocess and analyze the given employee dataset, present the findings graphically, and derive meaningful insights to help better understand the company’s workforce.
- Host: GitHub
- URL: https://github.com/athul64/exploratory-data-analysis
- Owner: Athul64
- Created: 2024-12-30T14:41:23.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-31T10:20:17.000Z (over 1 year ago)
- Last Synced: 2025-11-08T14:03:36.281Z (5 months ago)
- Topics: colab-notebook, data-analysis, data-visualization, matplotlib, numpy, pandas, python, seaborn, statistical-analysis
- Language: Jupyter Notebook
- Homepage:
- Size: 376 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Employee Data Analysis Project
Welcome to the Employee Data Analysis Project! This project is a comprehensive exploration of a dataset from ABC Company, aimed at deriving valuable insights into the employee data through preprocessing, analysis, and visualization. Below, you'll find an overview of the project components, methodologies, and findings.
## Project Objective
To preprocess and analyze the given employee dataset, present the findings graphically, and derive meaningful insights to help better understand the company’s workforce.
## Dataset
The dataset contains **458 rows** and **9 columns** and includes information about employees across various teams and positions. The columns include:
- `Team`
- `Position`
- `Age`
- `Salary`
- `Height`
- `Name`
- `Number`
- `Weight`
- `College`
## Preprocessing Steps
1. **Handling Missing Data**:
- Missing values in the `Salary` column were replaced with the **median salary**.
- Missing values in the `College` column were replaced with the **most frequent value (mode)**.
2. **Data Correction**:
- Randomly replaced inconsistent values in the `Height` column with values between **150 cm and 180 cm**, using `np.random.seed(42)` for reproducibility.
3. **Data Cleaning**:
- Verified the dataset for duplicates and null values after preprocessing.
4. **Export**:
- Saved the cleaned dataset as `cleaned_data.csv` for further analysis.
## Analysis Tasks
1. **Distribution of Employees Across Teams**:
- Calculated the percentage split of employees across teams.
- Visualized the distribution using a **pie chart**.
2. **Employee Segregation by Position**:
- Grouped employees based on their positions.
- Visualized the counts using a **horizontal bar chart**.
3. **Predominant Age Group**:
- Identified the most frequent age group among employees.
- Presented the data using a **histogram**.
4. **Salary Expenditure Analysis**:
- Determined which team and position had the highest total salary expenditure.
- Visualized the data using a **stacked bar chart**.
5. **Correlation Between Age and Salary**:
- Computed the correlation coefficient to identify relationships.
- Represented the data using a **scatter plot**.
## Visualizations
The project includes the following visualizations:
1. **Pie Chart**: Employee distribution across teams.
2. **Bar Chart**: Number of employees in each position.
3. **Histogram**: Predominant age group.
4. **Stacked Bar Chart**: Salary expenditure by team and position.
5. **Scatter Plot**: Age vs. Salary correlation.
## Key Findings
- The team with the highest salary expenditure is **[Team Name]**, and the position contributing most to this expenditure is **[Position Name]**.
- The most predominant age group among employees is **[Age Group]**.
- There is a **[weak/moderate/strong] correlation** between age and salary, indicating **[specific insight, e.g., older employees tend to earn more/less].**
## How to Run the Project
1. Clone this repository:
```bash
git clone
```
2. Install the required Python libraries:
```bash
pip install numpy pandas matplotlib seaborn
```
3. Run the Jupyter Notebook:
```bash
jupyter notebook Exploratory Data Analysis.ipynb
```
## Files in the Repository
- `Exploratory Data Analysis.ipynb`: The main Jupyter Notebook containing code and analysis.
- `data.csv`: The original dataset.
- `cleaned_data.csv`: The preprocessed dataset.
- `README.md`: Project overview and instructions.
## Tools Used
- **Python Libraries**:
- `numpy` for data manipulation
- `pandas` for data analysis
- `matplotlib` and `seaborn` for visualizations
## Future Improvements
- Enhance visualizations by adding interactive plots using `plotly` or `dash`.
- Perform advanced statistical analysis to uncover deeper insights.
- Automate the preprocessing and analysis steps for scalability.
## License
This project is licensed under the [MIT License](LICENSE).
---
If you have any questions or feedback, feel free to raise an issue or reach out. Thank you for exploring this project!