{"id":22158115,"url":"https://github.com/windjammer6/9.-employee-exit-data-analysis-python","last_synced_at":"2025-03-24T14:48:45.139Z","repository":{"id":155247702,"uuid":"630836965","full_name":"WindJammer6/9.-Employee-Exit-Data-Analysis-Python","owner":"WindJammer6","description":"A personal project to analyse data from a Employee Exit survey from DETE and TAFE. Python libraries used: Numpy, Pandas, Matplotlib","archived":false,"fork":false,"pushed_at":"2023-08-27T18:00:19.000Z","size":236,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-29T19:49:07.046Z","etag":null,"topics":["data-analysis","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WindJammer6.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-21T09:05:47.000Z","updated_at":"2023-06-24T05:28:26.000Z","dependencies_parsed_at":"2025-01-29T19:43:08.175Z","dependency_job_id":"dc27d7d3-d111-41c9-bdb7-77a2a2571772","html_url":"https://github.com/WindJammer6/9.-Employee-Exit-Data-Analysis-Python","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WindJammer6%2F9.-Employee-Exit-Data-Analysis-Python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WindJammer6%2F9.-Employee-Exit-Data-Analysis-Python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WindJammer6%2F9.-Employee-Exit-Data-Analysis-Python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WindJammer6%2F9.-Employee-Exit-Data-Analysis-Python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WindJammer6","download_url":"https://codeload.github.com/WindJammer6/9.-Employee-Exit-Data-Analysis-Python/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245294754,"owners_count":20591899,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","python"],"created_at":"2024-12-02T03:18:35.144Z","updated_at":"2025-03-24T14:48:45.131Z","avatar_url":"https://github.com/WindJammer6.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 9.-Employee-Exit-Data-Analysis-Python :office_worker::arrow_right::door:\nA personal project to analyse data from a Employee Exit survey from the Department of Education, Training and Employment (DETE) and the \nTechnical and Further Education (TAFE) institute in Queensland, Australia. Python Libraries used: Numpy, Pandas, Matplotlib\n\n## Thoughts on starting this project\nMy eighth programming project, in Python. \n\nAfter my previous programming project on data analysis (8. Star-Wars-Data-Analysis-Python), I wanted to further familiarise myself on the data analysis aspect of\nPython programming and its commonly used Python libraries. I spent significantly less time on analysing this dataset and was able to complete more complex tasks such as combining the DETE and TAFE datasets after cleaning them to create a new dataset to use for further data analysis, with more resources to refer on from both the\n7.-NumPy-Pandas-Matplotlib-Learning-and-Practice-Python and 8. Star-Wars-Data-Analysis-Python and that I have more tools under my belt for data analysis. (more knowledge of different functions)\n\n\u003cbr\u003e\n\nFor this Employee Exit Data Analysis Project, we will answer in the guided aspect: **'Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?'** and\n**'Are younger employees resigning due to some kind of dissatisfaction? What about older employees?'**. For the non-guided aspect, we will answer **'How many people in each age group resgined due to some kind of dissatisfaction?'**\n\n\u003cbr\u003e\n\nComputer program used for coding: VS Code\n\n## Table of Contents:\n+ [Code Description](#codedescription)\n    1. [Sources and Context](#sourcesandcontext)\n    2. [Cleaning and Preparing Data](#cleaningandpreparingdata)\n        + [Task 1](#task1)\n        + [Task 2](#task2)\n        + [Task 3](#task3)\n        + [Task 4](#task4)\n        + [Task 5](#task5)\n        + [Task 6](#task6)\n        + [Task 7](#task7)\n        + [Task 8](#task8)\n        + [Task 9](#task9)\n    3. [Data Modelling and Analysis](#datamodellingandanalysis)\n        + [Task 10](#task10)\n    4. [Analysing Other Aspects of the Dataset](#analysingother)\n        + [Task 11](#task11)\n+ [Thoughts after the project](#thoughts)\n\n\u003cbr\u003e\n\n\u003cbr\u003e\n\n## Code Description \u003ca name = \"codedescription\"\u003e\u003c/a\u003e\n\n### 1. Sources and Context \u003ca name = \"sourcesandcontext\"\u003e\u003c/a\u003e\n\nThe Dataquest website provides some guidance and provides the tasks in order to analyse some of the data in the dataset. The 'Employeeexit_data_analysis_project' folder is organised to tasks from the website. '1.Task1.py' to '9.Task9.py' is on Cleaning and Preparing the data while '10.Task10.py' is on Data Modelling and Analysis. '11.Task11.py' is to analyse another aspect of data from the dataset.\n\nI have transferred the tasks from the website as instructions into my respective code files.\n\nTook reference from another sample attempt on the guided question in the Dataquest guide community discussion. Some of my code (labelled in comment 'Copied from online') are from this sample. (in source(s))\n\nSource(s): https://app.dataquest.io/c/60/m/348/guided-project%3A-clean-and-analyze-employee-exit-surveys/1/introduction (Dataquest (main project guide)), https://community.dataquest.io/t/guided-project-clean-and-analyze-employee-exit-surveys-by-feelingcxld/568982 (Dataquest sample attempt from the community discussion page)\n\nDatasets analysed [here](https://github.com/plasmagirl/Clean-And-Analyze-Employee-Exit-Surveys/blob/master/dete_survey.csv) (DETE Survey) and [here](https://github.com/plasmagirl/Clean-And-Analyze-Employee-Exit-Surveys/blob/master/tafe_survey.csv) (TAFE Survey)\n\n\u003cbr\u003e\n\n\u003cbr\u003e\n\n### 2. Cleaning and Preparing Data \u003ca name = \"cleaningandpreparingdata\"\u003e\u003c/a\u003e\n\n#### _Task 1_ \u003ca name = \"task1\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\n\ndete_survey = pd.read_csv('dete_survey.csv')\ntafe_survey = pd.read_csv('tafe_survey.csv')\n```\nNot much for Task 1, just to load the dataset into the file.(more explanation of the code in comments in the code)\n\n```python\n#'.info()' print a concise summary of a DataFrame.\n#This method prints information about a DataFrame including the index dtype and columns, \n#non-null values and memory usage.\nprint(dete_survey.info())\nprint(dete_survey.head(5))\nprint(tafe_survey.info())\nprint(tafe_survey.head(5))\n```\nTo print out the dete_survey and tafe_survey datasets out for checking. Used the '.head(5)' function to only print out the top 5 rows to not flood the output.\n\n'.info()' prints out a concise summary of a DataFrame containing the different names of the column titles, number of non-null values per column, and the data type each column is holding.\n\n\u003cbr\u003e\n\n#### _Task 2_ \u003ca name = \"task2\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\n\n#The 'read_csv()' function actually accepts a lot of different parameters (see documentation), one\n#of whuich is 'na_values='\ndete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')\ntafe_survey = pd.read_csv('tafe_survey.csv')\n\n#To check if 'Not Stated' inputs are now NAN\nprint(dete_survey)\n```\nFor Task 2, the task is to clean the datasets and removing any excess columns that we will not need in this data analysis project. 'na_values='Not Stated'' is to recognise any strings in the dete_survey dataset that has 'Not Stated' as NaN. \n\n```python\n#Dropping columns of index 28 to 49 from the dete_survey dataset\ndete_survey_updated = dete_survey.drop(dete_survey.iloc[:, 28:49], axis=1)\nprint(dete_survey_updated)\n\n#Dropping columns of index 17 to 66 from the dete_survey dataset\ntafe_survey_updated = tafe_survey.drop(tafe_survey.iloc[:, 17:66], axis=1)\nprint(tafe_survey_updated)\n```\nThese lines of code is to remove the indexed columns in the dataset. The indexing for the columns to be removed is provided in the Dataquest guide for this project.\n\n```python\n#'index=False' to prevent creating a new column of new indexes in the new csv file\ndete_survey_updated.to_csv('dete_survey2.csv', index=False)\ntafe_survey_updated.to_csv('tafe_survey2.csv', index=False)\n```\nLoaded edited dataset into a new file so I can load it into the next code file. And so I can view the dataset as a whole as VS Code dosen't allow me to see the full thing in its terminal. \n\nAs stated in the comment in the code, to prevent new indexes from being added into the dataset everytime a new copy of the dataset is made, in this task and all the other tasks I included the 'index=False' parameter in the '.to_csv()' function.\n\n\u003cbr\u003e\n\n#### _Task 3_ \u003ca name = \"task3\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\n\ndete_survey_updated = pd.read_csv('dete_survey2.csv')\ntafe_survey_updated = pd.read_csv('tafe_survey2.csv')\n\n#Taking out the headers as an array in the dete_survey dataset, and changing all the headers to lowercase\ndete_columns = list(dete_survey_updated.columns)\ndete_columns_lower = []\nfor element in dete_columns:\n    dete_columns_lower.append(element.lower())\n\n#Changing all the headers' spaces into underscore\ndete_columns_lower_underscore = []\nfor element in dete_columns_lower:\n    dete_columns_lower_underscore.append(element.replace(' ', '_')) \n\n#Removing any spaces at the front and end of the headers\ndete_columns_lower_underscore_stripped = []\nfor element in dete_columns_lower_underscore:\n    dete_columns_lower_underscore_stripped.append(element.strip())\n\n```\nFor task 3, the task is to rename the column titles to something more understandable and standardised.\n\nFor the dete_survey dataset:\n\nI made a new list storing all the column titles and working on this new list seperately from the main dataset. I will then later piece the updated column names back into the dataset. \n\nFor renaming of the column titles in the new list, I first changed all the column titles to lowercase (in first for loop), then replacing any spaces into underscore (in second for loop) and lastly removing any spaces at the front and end (in third for loop).\n\n```python\n#(Copied from online) To add the edited headers array into the dete_survey_updated dataset as top row \n#(below headers)\n\n#Adding a row assigning it with index of -1 and will appear at the bottom of the dataset\ndete_survey_updated.loc[-1] = dete_columns_lower_underscore_stripped\n#Shifting index causing the newly assigned row to be now of index 0\ndete_survey_updated.index = dete_survey_updated.index + 1\n#Re-sorting by index so the new row will appear at the top as it is index 0 now\ndete_survey_updated = dete_survey_updated.sort_index()\n\n#(Copied from online) To replace headers with top row as the new headers\n\n#Store the first row after the header in a variable\nnew_header = dete_survey_updated.iloc[0]\n#Take all the data less the first row after the header\ndete_survey_updated = dete_survey_updated[1:]\n#Set the first row as the new dete_survey_updated header\ndete_survey_updated.columns = new_header\n\nprint(dete_survey_updated.head(5))\n```\nFor adding the new list of renamed column title back into the existing dataset, I had first attached the new list of column titles as the top row of the existing dataset (below the existing/old column titles) from the first chunk of code above (exact description of how each line works is in the code). Then I stored the rest of the dataset minus the top row (the new list of column titles) into a new variable, before declaring the new list of column titles as the headers for the dataset stored in the new variable hence replacing the existing/old column titlesfrom the second chunk of code above(exact description of how each line works is in the code).\n\n```python\n#Renaming the column names for the tafe_survey_updated dataset as per requested in Dataquest\ntafe_survey_updated = tafe_survey_updated.rename(columns={'Record ID': 'id','CESSATION YEAR': 'cease_date','Reason for ceasing employment': 'separationtype', 'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age', 'Employment Type. Employment Type': 'employment_status', 'Classification. Classification': 'position', 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'})\n\nprint(tafe_survey_updated.head(5))\n```\nFor tafe_survey dataset:\n\nThe Dataquest guide provide the desired column titles for the tafe_survey which I simply used the 'rename()' function that will individually substitute the new titles with the old ones using a dictionary input.\n\n\u003cbr\u003e\n\n#### _Task 4_ \u003ca name = \"task4\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\n\ndete_survey= pd.read_csv('dete_survey3.csv')\ntafe_survey = pd.read_csv('tafe_survey3.csv')\n\nprint(dete_survey['separationtype'].value_counts())\nprint(tafe_survey['separationtype'].value_counts())\n\n#To select all the rows with people that has indicated resignation (all 3 types) in the \n#'seperationtype' column in dete_survey\ndete_resignations = dete_survey.loc[(dete_survey['separationtype'] == 'Resignation-Other reasons') | (dete_survey['separationtype'] == 'Resignation-Other employer') | (dete_survey['separationtype'] == 'Resignation-Move overseas/interstate')]\nprint(dete_resignations)\n\n#To select all the rows with people that has indicated resignation in the \n#'seperationtype' column in tafe_survey\ntafe_resignations = tafe_survey.loc[tafe_survey['separationtype'] == 'Resignation']\nprint(tafe_resignations)\n```\nFor Task 4, the task is to filter out rows/respondants that indicated they exited the organisation by resignation in both the dete_survey and tafe_survey datasets. \n\nFrom '.value_counts()' function we can see there are 3 types of resignation indicated by respondants in the dete_survey ('Resignation-Other reasons', 'Resignation-Other employer' and 'Resignation-Move overseas/interstate'), all of which we will consider as resignation and will filter all 3 of them into consideration.\n\nFrom '.value_counts()' function we can see there is only 1 type of resignation indicated by respondants in the tafe_survey ('Resignation') which is the only type of respondants we will take into consideration.\n\n\u003cbr\u003e\n\n#### _Task 5_ \u003ca name = \"task5\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\n\ndete_survey= pd.read_csv('dete_survey4(resignations).csv')\ntafe_survey = pd.read_csv('tafe_survey4(resignations).csv')\n\n#The 'cease_date' column for dete_survey is currently an object datatype as seen using the 'info()' \n#function and not float so we'll need to convert it so it can more easily used later on in data analysis\nprint(dete_survey.info())\n\n#The 'cease_date' column for tafe_survey is already a float datatype as seen using the 'info()' \n#function so there is no need for us to convert its data type\nprint(tafe_survey.info())\n\n#Need to add 'dropna=False' as by default 'dropna=True' which ignores counts of NAN inputs which we do\n#not want, as we want to consider them in the counts as well to see the full picture for both dete_survey\n#and tafe_survey. \n\n#We can see there is some issue with dete_survey's 'cease_date' column while tafe_survey's 'cease_date'\n#column has no issue. We need to tackle the column for dete_survey\nprint(dete_survey['cease_date'].value_counts(dropna=False))\nprint(tafe_survey['cease_date'].value_counts(dropna=False))\n```\nFor task 5, the task is to clean the 'cease_date' column in the dete_survey dataset so that it can be used for later's of calculating the duration of service for respondants in the dete_survey for data analysis. Technically there is no need to check that for the tafe_survey dataset as it already has a duration of service column but we checked its data type anyway.\n\nTo check the data type of the 'cease_date' column for both datasets, we used the '.info()' function.\n\nTo check the type of values under the 'cease_date' column for both datasets, we used the 'value_counts()' function. \n\nWe discovered that the 'cease_date' column in dete_survey has undesired inputs (strings) that we need to change to floats while the 'cease_date' column in tafe_survey already has the desired data type (float) and no unusual inputs.\n\n```python\n#(Copied from online) It can split the string using the slash as the delimiter into 2 elements \n#in a list and only selecting the last element (from the '.str[-1]' function from the list as the input \n#(which will always be the year be it the list has one or two elements as the year will always be the \n#first element from the back (last element))\ndete_survey['cease_date'] = dete_survey['cease_date'].str.split('/').str[-1]\n\n#By checking here we can see that we removed the inputs with months and gave us a more accurate number of \n#the counts of how many people resigned from the dete_survey per year\nprint(dete_survey['cease_date'].value_counts(dropna=False))\n\n#Converting data type of the 'cease_date' column for the dete_survery from object to a float\ndete_survey['cease_date'] = dete_survey['cease_date'].astype(float)\n\n#To check if the data type for the 'cease_date' column for the dete_survery has indeed changed \nprint(dete_survey.info())\n\n\n#To print out final dete_survey's and tafe_survey's value counts\nprint(dete_survey['cease_date'].value_counts(dropna=False).sort_index(ascending=True))\nprint(tafe_survey['cease_date'].value_counts(dropna=False).sort_index(ascending=True))\n```\nFor dete_survey dataset:\n\nTo tackle the unusual inputs, it is covered by the comment in the code, using functions such as '.str.split()' and '.str[]' (vectorized string methods)\n\nTo tackle the wrong data type, we use the '.astype()' function to convert the data type from object to float.\n\n\u003cbr\u003e\n\n#### _Task 6_ \u003ca name = \"task6\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\n\n#We need an 'institute_service' column in dete_survey so we can compare with its corresponding column\n#in the tafe_survey dataset\ndete_survey= pd.read_csv('dete_survey5(resignations).csv')\n\n#Subtracting the dete_start_date from the cease_date and assign the result to a new column named \n#institute_service.\ndete_survey['institute_service'] = dete_survey['cease_date'] - dete_survey['dete_start_date']\n\nprint(dete_survey.head(5))\n```\nFor task 6, the task is to create the 'institute_service' column representing how long the respondant has been with the institute/organisation in the dete_survey dataset as it is missing in this dataset, which we need to compare to the corresponding column in the tafe_survey dataset.\n\nWe do this by subtracting the 'dete_start_date' inputs from the 'cease_date' inputs and putting the result in a newly created 'institute_service' column in the dete_survey dataset.\n\n\u003cbr\u003e\n\n#### _Task 7_ \u003ca name = \"task7\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\nimport numpy as np\n\ndete_survey= pd.read_csv('dete_survey6(resignations).csv')\ntafe_survey = pd.read_csv('tafe_survey5(resignations).csv')\n\ndef main():\n    #1. To check the value counts for some of the columns we deem that the employee is dissatisfied such as the\n    #   'job_dissatisfaction' and 'dissatisfaction_with_the_department' columns in the dete_survey dataset.\n    #   As we can see, it is already all in Boolean marking (True, False, NaN)\n    print(dete_survey['job_dissatisfaction'].value_counts(dropna=False))\n    print(dete_survey['dissatisfaction_with_the_department'].value_counts(dropna=False))\n\n    #To check the value counts for the 'Contributing Factors. Dissatisfaction' and 'Contributing Factors. \n    #Job Dissatisfaction' columns in the tafe_survey dataset.\n    #As we can see, they are not yet in Boolean hence we need to first change that\n    print(tafe_survey['Contributing Factors. Dissatisfaction'].value_counts(dropna=False))\n    print(tafe_survey['Contributing Factors. Job Dissatisfaction'].value_counts(dropna=False))\n\n    #2. Making the columns in tafe_survey to only contains True, False or NaN values\n\n    #Note: The usual brute force method\n    #tafe_survey.loc[tafe_survey['Contributing Factors. Dissatisfaction'] == 'Contributing Factors. Dissatisfaction ', 'Contributing Factors. Dissatisfaction'] = True\n    #tafe_survey.loc[tafe_survey['Contributing Factors. Dissatisfaction'] == '-', 'Contributing Factors. Dissatisfaction'] = False\n    #tafe_survey.loc[tafe_survey['Contributing Factors. Job Dissatisfaction'] == 'Job Dissatisfaction', 'Contributing Factors. Job Dissatisfaction'] = True\n    #tafe_survey.loc[tafe_survey['Contributing Factors. Job Dissatisfaction'] == '-', 'Contributing Factors. Job Dissatisfaction'] = False\n    \n\n    #'apply()', 'map()' and 'applymap()' functions  all works in the same way to applying a function to \n    #each element in a Data Structure. The difference in these functions is to which type of \n    #Data Structure they work for.\n\n    #- 'apply()' works for both DataFrames and Series\n    #- 'map()' only works for Series\n    #- 'applymap()' only works for DataFrames\n\n    #In this case, we are dealing with a Series so 'applymap()' dosen't work, only 'apply()' and 'map()'\n    #works.\n    #Using the self made 'update_val' and 'apply()' function\n    tafe_survey['Contributing Factors. Dissatisfaction'] = tafe_survey['Contributing Factors. Dissatisfaction'].apply(update_val)\n    tafe_survey['Contributing Factors. Job Dissatisfaction'] = tafe_survey['Contributing Factors. Job Dissatisfaction'].apply(update_val)\n    print(tafe_survey['Contributing Factors. Dissatisfaction'].value_counts(dropna=False))\n    print(tafe_survey['Contributing Factors. Job Dissatisfaction'].value_counts(dropna=False))\n\n\n    #3. Using the '.any()' function, to create a new column with input True if any of the column (of the same row) \n    #   is True, or False if all the columns (of the same row) is False. (I treated if all NaN or False it will be \n    #   False)\n    #(The opposite function of this is '.all()' where if all columns needs to be True for the output to be True\n    #and if any 1 is False, the resulting output will be False)\n    tafe_survey['dissatisfied'] = tafe_survey[tafe_survey.columns[10:12]].any(axis=1)\n    dete_survey['dissatisfied'] = dete_survey.iloc[: , [13,14,15,16,17,18,19,25,26]].any(axis=1)\n    print(dete_survey['dissatisfied'].value_counts(dropna=False))\n    print(tafe_survey['dissatisfied'].value_counts(dropna=False))\n\n    dete_survey.to_csv('dete_survey7(resignations).csv', index=False)\n    tafe_survey.to_csv('tafe_survey6(resignations).csv', index=False)\n```\nFor task 7, the task is to create a 'dissatisfied' column that shows if a respondant has resigned due to dissatisfaction (apart from any other reasons). In the Dataquest guide, we have already identify some factors (as column titles) that we deem as the respondant showing they have exited due to dissatisfaction. Hence, in both datasets, as long as the respondant declared at least one factor (that we deem as a sign of dissatisfaction) a reason for exit, we will show they have resigned due to dissatisfaction and will mark it as True in the 'dissatisfied' column. If none of the factors (that we deem as a sign of dissatisfaction) are a reason for exit, we will show that they did not resign due to dissatisfaction and will mark it as False in the 'dissatisfied' column. If all factors (that we deem as a sign of dissatisfaction) are NaN, we will treat it as False as well.\n\n(Regarding technicalities of the code I will leave it to the comments in the code to explain too much to type otherwise 😫)\n\n```python\n#Creating the update_val function\ndef update_val(val):\n    if pd.isnull(val):\n        return np.nan\n    elif val == '-':\n        return False\n    else:\n        return True\n\nmain()\n```\nThis is the function used applied to every single element in the tafe_survey columns 'Contributing Factors. Dissatisfaction' and 'Contributing Factors. Job Dissatisfaction' using the '.apply()' function in the main code. Creating a self-made function and applying the '.apply()' function makes the code look neater compared to the brute force method (that I commented out in the main code)\n\nThere is no need to change the elements for the factors (columns) (that we deem as a sign of dissatisfaction) in the dete_survey dataset as they are already in Boolean.\n\n\u003cbr\u003e\n\n#### _Task 8_ \u003ca name = \"task8\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\n\ndete_survey= pd.read_csv('dete_survey7(resignations).csv')\ntafe_survey = pd.read_csv('tafe_survey6(resignations).csv')\n\n#1. Adding a new 'institute' column to both datasets with all its values as 'DETE' and 'TAFE' respectively\ndete_survey['institute'] = 'DETE'\ntafe_survey['institute'] = 'TAFE'\n\n#To check the columns are added\nprint(dete_survey.head(5))\nprint(tafe_survey.head(5))\n\nprint(dete_survey.info())\nprint(tafe_survey.info())\n```\nFor task 8, the task is to create an 'institute' column in both the dete_survey and tafe_survey dataset and filling its values with 'DETE' and 'TAFE' respectively to indicate which respondant is from which organisation after we combined the 2 datasets. Then, remove any excess columns we do not need for this project's data analysis, ensuring the columns are the same for both datasets and also combining these 2 datasets.\n\nHere, we create the new 'institute' column in both datasets as well as filling said columns.\n\n```python\n#Removing excess columns that we won't need in our data analysis. \n#From both the dete_survey and tafe_survey datasets we will only keep the columns 'id', 'seperationtype', 'gender', 'age', \n#'institute_service', 'dissatisfied' and 'institute' (from earlier in this code)\ndete_survey_updated = dete_survey[['id', 'separationtype', 'gender', 'age', 'institute_service', 'dissatisfied', 'institute']]\ntafe_survey_updated = tafe_survey[['id', 'separationtype', 'gender', 'age', 'institute_service', 'dissatisfied', 'institute']]\n\n#To check the excess columns are removed\nprint(dete_survey_updated.head(5))\nprint(tafe_survey_updated.head(5))\n\nprint(dete_survey_updated.info())\nprint(tafe_survey_updated.info())\n```\nRemoving excess columns in both datasets we don't need for this project's data analysis and as a good habit, to check to ensure they are removed.\n\n```python\n#2. Combining the datasets (top and bottom)\ncombined_updated = pd.concat([dete_survey_updated, tafe_survey_updated])\nprint(combined_updated)\nprint(combined_updated.info())\n```\nCombining the 2 datasets with the 'pd.concat()' function\n\n```python\n#3. Using the 'dropna()' function to drop any rows that has at least 1 NaN value to eliminate any NaN values in our new combined_updated dataset\n\n#The 'inplace=True' parameter is important! By default 'inplace=False' and the 'dropna()' function will return you a\n#new modified DataFrame instead of the existing one. In order to modify the existing dataset with 'dropna()'\n#you need to make 'inpplace=True'\ncombined_updated.dropna(axis=0, how='any', inplace=True)\nprint(combined_updated.info())\n```\nNow, we try to make the new combined dataset cleaner by removing any rows that contains any NaN values using the '.dropna()' function (take note of the 'inplace=True' parameter!). Data loss is not too bad looking at the '.info()'\n\n\u003cbr\u003e\n\n#### _Task 9_ \u003ca name = \"task9\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\n\ncombined_updated = pd.read_csv('combined_updated.csv')\n\ndef main():\n\n    #1. To view the types of values present in the 'institute_service' column\n    print(combined_updated['institute_service'].value_counts())\n    print(combined_updated.info())\n\n    #There are string values 'Less than 1 year' and 'More than 20 years' in the 'institute_service' column that \n    #we need to deal with. I decided to convert the 'Less than 1 year' values to just a value of 1 and 'More than \n    #20 years' value to a value of 20\n    combined_updated.loc[combined_updated['institute_service'] == 'Less than 1 year', 'institute_service'] = 1\n    combined_updated.loc[combined_updated['institute_service'] == 'More than 20 years', 'institute_service'] = 20\n\n    #Checking the values in the 'institute_service' column again\n    print(combined_updated['institute_service'].value_counts())\n\n    #There are string values '1-2', '3-4', '5-6' '7-10' and '11-20' in the 'institute_service' column that we also\n    #need to deal with. I decided to use vectorized string method to split them into a list with '-' as the\n    #delimiter and taking the first element in those lists as the new value\n    combined_updated['institute_service'] = combined_updated['institute_service'].str.split('-').str[0]\n\n    #Checking the values in the 'institute_service' column again\n    print(combined_updated['institute_service'].value_counts())\n```\nFor task 9, the task is to create a 'service_cat' column in both the dete_survey and tafe_survey (now combined) datasets indicating how long a respondant has been working in the organisation. In the Dataquest guide the service categories are defined as:\n\n-New: Less than 3 years at a company\n\n-Experienced: 3-6 years at a company\n\n-Established: 7-10 years at a company\n\n-Veteran: 11 or more years at a company\n\nTo do so, we take reference from the 'institute_service' columns for both datasets. Checking the 'institute_service' columns for the combined dataset, there is unusual input in the tafe_survey part of the combined dataset while dete_survey dataset's 'institute_service' column have no unusual input (Because in Task 6, we already cleaned as we created the 'institute_service' columns).\n\nThere are unusual inputs (strings) such as 'Less than one year', 'More than 20 years', '1-2', and '11-20' that we dealt with seperately in the code above (technicalities explained in comments in code).\n\n```python\n    #Using the Series.astype() method to change the type of the 'institute_service' column to 'float'\n    combined_updated['institute_service'] = combined_updated['institute_service'].astype(float)\n    \n    #Checking the values in the 'institute_service' column again as well as its data type\n    print(combined_updated['institute_service'].value_counts())\n    print(combined_updated.info())\n```\nConverting the data type of the 'institute_service' column from object to float after dealing with the unusual inputs.\n\n```python\n    #Applying the 'career_stage' function to the 'institute_service' column and adding the new labels into a new\n    #column 'service_cat'\n    combined_updated['service_cat'] = combined_updated['institute_service'].apply(career_stage)\n    print(combined_updated['service_cat'].value_counts())\n    print(combined_updated.info())\n```\nApplying the self-made function 'career_stage' to assign each element/respondants career stage according to the definitions provided by Dataquest's guide through the '.apply()' function once again and creating a 'service_cat' column in the combined dataset in the process.\n\n```python\n#Creating the career_stage function\ndef career_stage(val):\n    if val \u003c 3:\n        return 'New'\n    elif val \u003e= 3 and val \u003c= 6:\n        return 'Experienced'\n    elif val \u003e 6 and val \u003c= 10:\n        return 'Established'\n    elif val \u003e 10:\n        return 'Veteran'\n\n\nmain()\n```\nThis is how the self-made function 'career_stage' look like, quite similar to the self-made function 'update_val' in Task 7.\n\n\u003cbr\u003e\n\n\u003cbr\u003e\n\n### 3. Data Modelling and Analysis \u003ca name = \"datamodellingandanalysis\"\u003e\u003c/a\u003e\n\nUp till now, the cleaning and preparing of the data will enable us to do some data analysis via looking at the patterns of the graphs and answer some questions about the Employee Exit surveys. In my code, I will be exploring questions like 'Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?' and 'Are younger employees resigning due to some kind of dissatisfaction? What about older employees?' (Task 10)\n\nFor visualisation, I drew a bar graph of percentage of resigned employees due to dissatisfaction (vs other reasons for exit) for each defined career stage. (Task 10)\n\n#### _Task 10_ \u003ca name = \"task10\"\u003e\u003c/a\u003e\n```python\nimport matplotlib.pyplot as plt\nimport pandas as pd\n```\nImported the matplotlib library to draw graphs.\n\n```python\ncombined_updated = pd.read_csv('combined_updated2.csv')\n\n#1 and 2. As we can see from this code, we only have True or False values in the 'dissatisfied' column as we have already\n#   decided to drop all the rows with NaN values in any of the columns. We may have lost some data, but it is not\n#   a significant lost during this cleaning process\nprint(combined_updated['dissatisfied'].value_counts(dropna=False))\n\n#3 and 4. Using the '.pivot_table()' function to create a pivot table. It takes many parameters (see documentation). \n#   -\u003e The first parameter is the name of dataframe\n#   -\u003e 'index=' represents the row headers\n#   -\u003e 'values=' represents the column headers\n#   -\u003e 'aggfunc=' (by default will be mean so technically no need put here but I just did it so its clear), \n#      represents what operator to use (see documentation for list of accepted operators). Since the values are all =\n#      Boolean and Python can take Boolean values as int 1 (for True) and 0 (for False), finding mean will give\n#      us percentage of respondants that said True and False for dissatisfied, and categorized by their service_cat\n#      shown in row header in the pivot table\ndissatisfied_pivot_table = pd.pivot_table(combined_updated, index='service_cat', values='dissatisfied', aggfunc='mean')\nprint(dissatisfied_pivot_table)\n```\nFor Task 10, the task is to create a graph to have a visual aid for us to answer some questions of this guided project's data analysis.\n\nWe first printed out '.value_counts()' of the combined dataset to have a idea of what kind of data we are working with, noticing to show NaN values (if any).\n\nWe now made use of Panda's '.pivot_table()' function (quite a powerful and new tool) that enabled us to apply different kinds of operators to any dataframe such as 'count', 'mean', 'sum', etc. to reveal interesting statistics of our dataframe. In this case, we wish to see the percentage of people that has True in the 'dissatisfied' column for every career group in the combined dataset. Since Python can take Boolean values True as 1 and False as 0, the mean value (under the 'pivot_table()' function) of the 'dissatisfied' column for each career group will show us the percentage of people that has True in the 'dissatisfied' column for every career group in the combined dataset.\n\n(Again, regarding technicalities of the code I will leave it to the comments in the code to explain😫)\n\n```python\n#5. Plotting the results in a bar graph\nxaxis = ['Established', 'Experienced', 'New', 'Veteran']\n#Using the new 'dissatisfied' column given to us in the pivot table as the y-axis data\nyaxis = dissatisfied_pivot_table['dissatisfied']\n\nplt.bar(xaxis, yaxis)\n\nplt.title('Bar Graph of Percentage of Resigned Dissatisfied Employees\\n(by Service Category)')\nplt.xlabel('Service Category')\nplt.ylabel('Percentage of Resigned Dissatisfied Employees\\n(vs other reasons for exiting the company)')\n\nplt.yticks([0,0.1,0.2,0.3,0.4,0.5,0.6])\n\nplt.savefig('bargraph(resigned_dissatisfied_employees_(by_service_category)).png', dpi=100)\n\nplt.show()\n```\n\n![My Image](bargraph(resigned_dissatisfied_employees_(by_service_category)).png)\n\nCode for drawing the graph. (matplotlib stuffs)\n\nFrom the graph, we can see that career groups 'Established' (7-10 years institute service) and 'Veteran' (more than 11 years of institute service) has the highest percentage of resignation due to dissatisfaction, while the career group 'New' (Less than 3 years of institute service) has the lowest percentage of resignation due to dissatisfaction.\n\n\u003cbr\u003e\n\n\u003cbr\u003e\n\n### 4. Analysing Other Aspects of the Dataset \u003ca name = \"analysingother\"\u003e\u003c/a\u003e\n\nTask 10 marked the end of the guided part of the data analysis of the Employee Exit survey, and that Dataquest recommends that we try to analyse other aspects and find any interesting results from our analysis such as 'Decide how to handle the rest of the missing values. Then, aggregate the data according to the service_cat column again. How many people in each career stage resigned due to some kind of dissatisfaction?', 'Clean the age column. How many people in each age group resgined due to some kind of dissatisfaction?' and 'Instead of analyzing the survey results together, analyze each survey separately. Did more employees in the DETE survey or TAFE survey end their employment because they were dissatisfied in some way?'\n\nIn Task 11, I decided to see 'How many people in each age group resgined due to some kind of dissatisfaction?'\n\n#### _Task 11_ \u003ca name = \"task11\"\u003e\u003c/a\u003e\n```python\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\ncombined_updated = pd.read_csv('combined_updated.csv')\n\ndef main():\n    #Checking all the different types of values present in the 'age' column. (Noticed there are no NaN values as we\n    #cleared them earlier in the project already)\n    print(combined_updated['age'].value_counts(dropna=False))\n\n    #I plan to split the age groups into 6 groups:\n    #-\u003e \u003c21\n    #-\u003e 21 to 30\n    #-\u003e 31 to 40 \n    #-\u003e 41 to 50 \n    #-\u003e 51 to 60 \n    #-\u003e \u003e60\n```\nIn task 11, the task is to create a graph to have a visual aid for us to answer the selected question (stated above) of this project's data analysis.\n\nWe define the age groups in the commented part of the code after checking the types of values we have in the 'age' column.\n\n```python\n    #Cleaning the age column (I plan to split the current values into a list and taking only the first element)\n    #First split those values with spaces and taking the first element\n    combined_updated['age'] = combined_updated['age'].str.split(' ').str[0]\n    print(combined_updated['age'].value_counts(dropna=False))\n\n    combined_updated['age'] = combined_updated['age'].str.split('-').str[0]\n    print(combined_updated['age'].value_counts(dropna=False))\n```\nSimilar to a part of the code Task 9, we encountered unusual inputs such as '41  45' and '26-30' that we need to deal with using vectorized string methods to get the desired input (of a single float).\n\n```python\n    #Making the data type of the 'age' column into floats\n    combined_updated['age'] = combined_updated['age'].astype(float)\n```\nConverting the data type of the 'age' column from object to float after dealing with the unusual inputs.\n\n```python\n    #Creating a function and mapping it to the 'age' column to split the respondants to different age groups and \n    #assigning the age groups to an 'age_group' column\n    combined_updated['age_group'] = combined_updated['age'].apply(age_group)\n    print(combined_updated['age_group'].value_counts(dropna=False))\n    print(combined_updated.head(5))\n```\nSimilar to Task 9 again, creating a self-made function 'age_group' and later applying to the 'age' column using '.apply()' to split each element'respondant to an age group while creating a new 'age_group' column in the combined dataset.\n\n```python\n    #Plotting out the results in a bar graph\n    #5. Plotting the results in a bar graph\n    xaxis = ['21 to 30', '31 to 40', '41 to 50', '51 to 60', '\u003c21', '\u003e60']\n\n    #Creating a pivot table and using the new 'dissatisfied' column given to us in the pivot table as the \n    #y-axis data\n    dissatisfied_pivot_table = pd.pivot_table(combined_updated, index='age_group', values='dissatisfied', aggfunc='count')\n    print(dissatisfied_pivot_table)\n    yaxis = dissatisfied_pivot_table['dissatisfied']\n```\nSimilar to task 10, making use of Panda's '.pivot_table()' function and passing the 'count' operator this time (instead of 'mean' in task 10) to obtain the count of number of resignation due to dissatisfaction per age group.\n\nThen using the count number of resignation due to dissatisfaction per age group obtained from the pivot table as data for the y-axis of the bar graph later.\n\n```python\n    plt.bar(xaxis, yaxis)\n\n    plt.title('Bar Graph of Number of Resigned Dissatisfied Employees\\n(by Age Group)')\n    plt.xlabel('Age Group')\n    plt.ylabel('Number of Resigned Dissatisfied Employees')\n\n    plt.yticks([0,20,40,60,80,100,120,140,160,180])\n\n    plt.savefig('bargraph(number_of_dissatisfied_employees_(by_age_group)).png', dpi=100)\n\n    plt.show()\n```\n![My Image](bargraph(number_of_dissatisfied_employees_(by_age_group)).png)\n\nCode for drawing the graph. (matplotlib stuffs)\n\nFrom the graph, we can see that age group of '41-50' has the highest number of resignation due to dissatisfaction among all the age groups, with '\u003c21' and '\u003e60', the extreme ends of the age group having the least number of resignation due to dissatisfaction. \n\n```python\n#Creating the age_group function\ndef age_group(val):\n    if val \u003c 21:\n        return '\u003c21'\n    elif val \u003e= 21 and val \u003c= 30:\n        return '21 to 30'\n    elif val \u003e= 31 and val \u003c= 40:\n        return '31 to 40'\n    elif val \u003e= 41 and val \u003c= 50:\n        return '41 to 50'\n    elif val \u003e= 51 and val \u003c= 60:\n        return '51 to 60'\n    elif val \u003e 60:\n        return '\u003e60'\n\nmain()\n```\nThis is how the self-made function 'age_group' look like at the end of the code for earlier, quite similar to the self-made function 'career_group' from Task 9 and 'update_val' in Task 7.\n\n\u003cbr\u003e\n\n_Analysis of Task 10 bar graph_\n\nA possible reason I can think of for why a higher percentage of respondants that have had longer institute service resigned due to some sort of dissatisfaction is that it gets boring doing the same job for so many years due to same tasks, people and environment everyday, and disagreements may build up. While those with shorter institute service still have much to learn and that since they worked hard to be able to join the institute, and everything may seem more exciting and new for them to learn that they are more unlikely to leave at an early stage.\n\n\u003cbr\u003e\n\n_Analysis of Task 11 bar graph_\n\nI feel there might be many reasons for the trend.\n\nOne reason could be poor dataset as there are fewer respondants doing the surveys under the extreme age groups of \u003c21 and \u003e60 hence lower number of people under these age group indicating resignation by dissatisfaction. (haven't really checked the datasets fully, may be proven wrong)\n\nAnother reason (provided the dataset is good and spread among the age groups evenly), is that younger people (\u003c21) may find the workplace new and are learning a lot and are obviously won't resign so early immdiately after getting a job. While older people (\u003e60) may have spent many years on the job and the fact that they have spent so many years on that job may show it is very satisfactory for them and are hence also more unlikely to resign from the job due to dissatisfaction.\n\nMeanwhile, higher number resign due to dissatisfaction at the from 30 to 50s due to boredom from working at a job they dislike after a while or maybe they are able to find other jobs suitable for their skillset and can switch jobs more easily hence are less tolerable to dissatisfaction as they have more freedom to choose from the younger age groups.\n\n\u003cbr\u003e\n\n\u003cbr\u003e\n\n## Thoughts after the project \u003ca name = \"thoughts\"\u003e\u003c/a\u003e\nUrghhhhhhhhhh I almost reached 1000 lines in this readme file... Will probably shorten this next time, wayyyyyyyy too much typing.\n\nI feel that this project definitely expanded my tool box for data analysis through discovering newer and powerful functions such as '.pivot_table()' and '.apply()'.\n\nThrough talking with other programmers, I've learnt that at my current stage of coding I would like to keep exploring the different aspects of programming such as using other more complex libraries such as TensorFlow, Scikit-learn and PyTorch for machine learning, neural network, making backend/frontend/full stack websites and not to stick to one for now. \n\nI believe that I will be moving on to projects of a different nature in the next one, or to more learning journey repositories such as understanding more on different Algorithms and Data Structures.\n\n\u003cbr\u003e\n\nTo be improved:\n* I might stop making this section in the future as there will always be more and more things to be improved the more you look at your code and the list will be endless. One quick one is I haven't tried figure out how to align the x-axis of both graphs in ascending order for the career group and age group. But I believe this can be a quick fix with a quick google search.\n\n\u003cbr\u003e\n\nHave a gif:\n\n![Semantic description of image](https://media0.giphy.com/media/l4KibK3JwaVo0CjDO/200.webp?cid=ecf05e47r4gqnxqenqjxnqy3gwrzfmasouh2iglayhepxcfx\u0026rid=200.webp\u0026ct=g)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwindjammer6%2F9.-employee-exit-data-analysis-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwindjammer6%2F9.-employee-exit-data-analysis-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwindjammer6%2F9.-employee-exit-data-analysis-python/lists"}