Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mochsyahrizal/project_data

This is my repo for practicing python for data analytics
https://github.com/mochsyahrizal/project_data

data-analytics jupyter-notebook pandas-library python visual-studio-code

Last synced: 24 days ago
JSON representation

This is my repo for practicing python for data analytics

Awesome Lists containing this project

README

        

# Project Data Practicing Basic and Capstone

## Project Background
### Basic Practices
In this project, im practicing basic python, libraries like pandas, matplotlib, seaborn etc. And also practicing new environment of visual studio code, setting extensions and control version system like git and github.

### Capstone Project
Welcome to my analysis of the data job market, focusing on data analyst roles. This project was created out of a desire to navigate and understand the job market more effectively. It delves into the top-paying and in-demand skills to help find optimal job opportunities for data analysts.

The data sourced from [Luke Barousse's Python Course](https://lukebarousse.com/python) which provides a foundation for my analysis, containing detailed information on job titles, salaries, locations, and essential skills. Through a series of Python scripts, I explore key questions such as the most demanded skills, salary trends, and the intersection of demand and salary in data analytics.

# The Questions

Below are the questions I want to answer in my project:

1. What are the skills most in demand for the top 3 most popular data roles?
2. How are in-demand skills trending for Data Analysts?
3. How well do jobs and skills pay for Data Analysts?
4. What are the optimal skills for data analysts to learn? (High Demand AND High Paying)

# Tools I Used

For my deep dive into the data analyst job market, I harnessed the power of several key tools:

- **Python:** The backbone of my analysis, allowing me to analyze the data and find critical insights.I also used the following Python libraries:
- **Pandas Library:** This was used to analyze the data.
- **Matplotlib Library:** I visualized the data.
- **Seaborn Library:** Helped me create more advanced visuals.
- **Jupyter Notebooks:** The tool I used to run my Python scripts which let me easily include my notes and analysis.
- **Visual Studio Code:** My go-to for executing my Python scripts.
- **Git & GitHub:** Essential for version control and sharing my Python code and analysis, ensuring collaboration and project tracking.

# Data Preparation and Cleanup

This section outlines the steps taken to prepare the data for analysis, ensuring accuracy and usability.

## Import & Clean Up Data

I start by importing necessary libraries and loading the dataset, followed by initial data cleaning tasks to ensure data quality.

```python
# Importing Libraries
import ast
import pandas as pd
import seaborn as sns
from datasets import load_dataset
import matplotlib.pyplot as plt

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])
df['job_skills'] = df['job_skills'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else x)
```

## Filter US Jobs

To focus my analysis on the U.S. job market, I apply filters to the dataset, narrowing down to roles based in the United States.

```python
df_US = df[df['job_country'] == 'United States']

```

## Focus Learn
- Python
- Libraries
- Visual Studio Code
- Git & Github
- Data Cleaning, Analysis & Visualization

## The Analysis (Capstone Project)
### 1. Basic EDA
view my notebook with detail steps here:
[1_EDA_Intro.ipynb](2_Capstone_Project\1_EDA_Intro.ipynb)

Result :
![2_Capstone_Project\images\data_analyst_job_per_location.png](https://github.com/MochSyahrizal/project_data/blob/main/2_Capstone_Project/images/data_analyst_job_per_location.png)

![2_Capstone_Project\images\eda2.png](https://github.com/MochSyahrizal/project_data/blob/main/2_Capstone_Project/images/eda2.png)

![2_Capstone_Project\images\eda3.png](https://github.com/MochSyahrizal/project_data/blob/main/2_Capstone_Project/eda3.png)

### 2. What are the most demanded skills for the top 3 most popular roles?
To find the most demanded skills for the top 3 most popular data roles. I filtered out those positions by which ones were the most popular, and got the top 5 skills for these top 3 roles. This query highlights the most popular job titles and their top skills, showing which skills i should pay attention to depending on the role I'm targetting.

view my notebook with detail steps here:
[2_Skill_Demand.ipynb](2_Capstone_Project\2_Skill_Demand.ipynb)

**Visualize the Data**
```python
fig, ax = plt.subplots(len(job_titles), 1)

sns.set_theme(style='ticks')

for i, job_title in enumerate(job_titles):
df_plot = df_skills_perc[df_skills_perc['job_title_short'] == job_title].head(5)
sns.barplot(data=df_plot, x='skill_percentage', y='job_skills', ax=ax[i], palette='dark:b_r')
ax[i].set_title(job_title)
ax[i].set_ylabel('')
ax[i].set_xlabel('')
ax[i].get_legend()
ax[i].set_xlim(0, 78)

for n, v in enumerate(df_plot['skill_percentage']):
ax[i].text(v + 1, n, f'{v:.0f}%', va='center')

if i != len(job_titles) - 1 :
ax[i].set_xticks([])

fig.suptitle('Likelihood Request of Skills in Job Posting', fontsize=15)
fig.tight_layout(h_pad=0.5)
plt.show()
```
**RESULTS**

![2_Capstone_Project\images\skill_demand_all_data_roles.png](https://github.com/MochSyahrizal/project_data/blob/main/2_Capstone_Project/images/skill_demand_all_data_roles.png)

## Insights
- Python is a versatile skill, highly demanded across all three roles, but most prominently for Data Scientists (72%) and Data Engineers (65%).
- SQL is the most requested skill for Data Analysts and Data Scientist, with it in over half the job postings for both roles. For Data Engineers, Python is the most sought-after skill, appearing in 68^ of job postings.
- data Engineer require more specialized technial skill (AWS, Azure, Spark) compared to Data Analysts and Data Scientis who are expected to be proficient in more general data management and analysis tools (Excels, Tableau).

## 3. How are in-demand skills trending for Data Analysts?

To find how skills are trending in 2023 for Data Analysts, I filtered data analyst positions and grouped the skills by the month of the job postings. This got me the top 5 skills of data analysts by month, showing how popular skills were throughout 2023.

view my notebook with detail steps here:
[3_Skills_Trend.ipynb](2_Capstone_Project\3_Skills_Trend.ipynb)

### Visualize Data

```python

from matplotlib.ticker import PercentFormatter

df_plot = df_DA_US_percent.iloc[:, :5]
sns.lineplot(data=df_plot, dashes=False, legend='full', palette='tab10')

plt.gca().yaxis.set_major_formatter(PercentFormatter(decimals=0))

plt.show()

```

### Results

![2_Capstone_Project\images\trending_top_skills_da.png](https://github.com/MochSyahrizal/project_data/blob/main/2_Capstone_Project/images/trending_top_skills_da.png)
*Bar graph visualizing the trending top skills for data analysts in the US in 2023.*

### Insights:
- SQL remains the most consistently demanded skill throughout the year, although it shows a gradual decrease in demand.
- Excel experienced a significant increase in demand starting around September, surpassing both Python and Tableau by the end of the year.
- Both Python and Tableau show relatively stable demand throughout the year with some fluctuations but remain essential skills for data analysts. Power BI, while less demanded compared to the others, shows a slight upward trend towards the year's end.