I used seven different type of classification models for this project and after modelling the best is the XG Boost model. Share it, so that others can read it! To the RF model, experience is the most important predictor. It can be deduced that older and more experienced candidates tend to be more content with their current jobs and are looking to settle down. Training data has 14 features on 19158 observations and 2129 observations with 13 features in testing dataset. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Answer In relation to the question asked initially, the 2 numerical features are not correlated which would be a good feature to use as a predictor. There are more than 70% people with relevant experience. This dataset consists of rows of data science employees who either are searching for a job change (target=1), or not (target=0). Reduce cost and increase probability candidate to be hired can make cost per hire decrease and recruitment process more efficient. To know more about us, visit https://www.nerdfortech.org/. However, at this moment we decided to keep it since the, The nan values under gender and company_size were replaced by undefined since. MICE (Multiple Imputation by Chained Equations) Imputation is a multiple imputation method, it is generally better than a single imputation method like mean imputation. After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion. This content can be referenced for research and education purposes. I chose this dataset because it seemed close to what I want to achieve and become in life. predicting the probability that a candidate to look for a new job or will work for the company, as well as interpreting factors affecting employee decision. Power BI) and data frameworks (e.g. As XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed. Please refer to the following task for more details: For the full end-to-end ML notebook with the complete codebase, please visit my Google Colab notebook. I got my data for this project from kaggle. How to use Python to crawl coronavirus from Worldometer. Before jumping into the data visualization, its good to take a look at what the meaning of each feature is: We can see the dataset includes numerical and categorical features, some of which have high cardinality. If nothing happens, download Xcode and try again. The whole data is divided into train and test. to use Codespaces. This branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists:main. The baseline model mark 0.74 ROC AUC score without any feature engineering steps. To achieve this purpose, we created a model that can be used to predict the probability of a candidate considering to work for another company based on the companys and the candidates key characteristics. I made a stackplot for each categorical feature and target, but for the clarity of the post I am only showing the stackplot for enrolled_course and target. Abdul Hamid - abdulhamidwinoto@gmail.com Light GBM is almost 7 times faster than XGBOOST and is a much better approach when dealing with large datasets. Data Source. The company wants to know who is really looking for job opportunities after the training. city_development_index: Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline: Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employers company, lastnewjob: Difference in years between previous job and current job, target: 0 Not looking for job change, 1 Looking for a job change. I got -0.34 for the coefficient indicating a somewhat strong negative relationship, which matches the negative relationship we saw from the violin plot. It still not efficient because people want to change job is less than not. So we need new method which can reduce cost (money and time) and make success probability increase to reduce CPH. Each employee is described with various demographic features. Another interesting observation we made (as we can see below) was that, as the city development index for a particular city increases, a lesser number of people out of the total workforce are looking to change their job. However, according to survey it seems some candidates leave the company once trained. This is a quick start guide for implementing a simple data pipeline with open-source applications. Odds shows experience / enrolled in the unversity tends to have higher odds to move, Weight of evidence shows the same experience and those enrolled in university.;[. Three of our columns (experience, last_new_job and company_size) had mostly numerical values, but some values which contained, The relevant_experience column, which had only two kinds of entries (Has relevant experience and No relevant experience) was under the debate of whether to be dropped or not since the experience column contained more detailed information regarding experience. 3. . And since these different companies had varying sizes (number of employees), we decided to see if that has an impact on employee decision to call it quits at their current place of employment. has features that are mostly categorical (Nominal, Ordinal, Binary), some with high cardinality. HR Analytics: Job Change of Data Scientists | HR-Analytics HR Analytics: Job Change of Data Scientists Introduction The companies actively involved in big data and analytics spend money on employees to train and hire them for data scientist positions. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. Simple countplots and histogram plots of features can give us a general idea of how each feature is distributed. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target, The dataset is imbalanced. Human Resource Data Scientist jobs. was obtained from Kaggle. Refresh the page, check Medium 's site status, or. StandardScaler removes the mean and scales each feature/variable to unit variance. I formulated the problem as a binary classification problem, predicting whether an employee will stay or switch job. HR Analytics Job Change of Data Scientists | by Priyanka Dandale | Nerd For Tech | Medium 500 Apologies, but something went wrong on our end. This is the violin plot for the numeric variable city_development_index (CDI) and target. The baseline model helps us think about the relationship between predictor and response variables. Agatha Putri Algustie - agthaptri@gmail.com. (Difference in years between previous job and current job). Are you sure you want to create this branch? We achieved an accuracy of 66% percent and AUC -ROC score of 0.69. Through the above graph, we were able to determine that most people who were satisfied with their job belonged to more developed cities. Group 19 - HR Analytics: Job Change of Data Scientists; by Tan Wee Kiat; Last updated over 1 year ago; Hide Comments (-) Share Hide Toolbars For instance, there is an unevenly large population of employees that belong to the private sector. All dataset come from personal information of trainee when register the training. Create a process in the form of questionnaire to identify employees who wish to stay versus leave using CART model. Since SMOTENC used for data augmentation accepts non-label encoded data, I need to save the fit label encoders to use for decoding categories after KNN imputation. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. Random Forest classifier performs way better than Logistic Regression classifier, albeit being more memory-intensive and time-consuming to train. Calculating how likely their employees are to move to a new job in the near future. The feature dimension can be reduced to ~30 and still represent at least 80% of the information of the original feature space. Job Posting. Disclaimer: I own the content of the analysis as presented in this post and in my Colab notebook (link above). A tag already exists with the provided branch name. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model (s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. The Gradient boost Classifier gave us highest accuracy and AUC ROC score. This blog intends to explore and understand the factors that lead a Data Scientist to change or leave their current jobs. 10-Aug-2022, 10:31:15 PM Show more Show less We used this final model to increase our AUC-ROC to 0.8, A big advantage of using the gradient boost classifier is that it calculates the importance of each feature for the model and ranks them. Ltd. The number of men is higher than the women and others. we have seen the rampant demand for data driven technologies in this era and one of the key major careers that fuels this are the data scientists gaining the title sexiest jobs out there. Recommendation: This could be due to various reasons, and also people with more experience (11+ years) probably are good candidates to screen for when hiring for training that are more likely to stay and work for company.Plus there is a need to explore why people with less than one year or 1-5 year are more likely to leave. HR Analytics: Job changes of Data Scientist. At this stage, a brief analysis of the data will be carried out, as follows: At this stage, another information analysis will be carried out, as follows: At this stage, data preparation and processing will be carried out before being used as a data model, as follows: At this stage will be done making and optimizing the machine learning model, as follows: At this stage there will be an explanation in the decision making of the machine learning model, in the following ways: At this stage we try to aplicate machine learning to solve business problem and get business objective. Many people signup for their training. This is in line with our deduction above. You signed in with another tab or window. More. Target isn't included in test but the test target values data file is in hands for related tasks. Hence to reduce the cost on training, company want to predict which candidates are really interested in working for the company and which candidates may look for new employment once trained. Organization. Metric Evaluation : Furthermore,. The simplest way to analyse the data is to look into the distributions of each feature. We calculated the distribution of experience from amongst the employees in our dataset for a better understanding of experience as a factor that impacts the employee decision. Use Git or checkout with SVN using the web URL. If nothing happens, download GitHub Desktop and try again. Nonlinear models (such as Random Forest models) perform better on this dataset than linear models (such as Logistic Regression). This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. HR-Analytics-Job-Change-of-Data-Scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. https://github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs Qualtrics, What is Big Data Analytics? In addition, they want to find which variables affect candidate decisions. Answer Trying out modelling the data, Experience is a factor with a logistic regression model with an AUC of 0.75. Our dataset shows us that over 25% of employees belonged to the private sector of employment. Do years of experience has any effect on the desire for a job change? HR Analytics: Job Change of Data Scientists. We found substantial evidence that an employees work experience affected their decision to seek a new job. Job Change of Data Scientists Using Raw, Encode, and PCA Data; by M Aji Pangestu; Last updated almost 2 years ago Hide Comments (-) Share Hide Toolbars Kaggle data set HR Analytics: Job Change of Data Scientists (XGBoost) Internet 2021-02-27 01:46:00 views: null. Are you sure you want to create this branch? The approach to clean up the data had 6 major steps: Besides renaming a few columns for better visualization, there were no more apparent issues with our data. A more detailed and quantified exploration shows an inverse relationship between experience (in number of years) and perpetual job dissatisfaction that leads to job hunting. Variable 1: Experience It is a great approach for the first step. There was a problem preparing your codespace, please try again. More specifically, the majority of the target=0 group resides in highly developed cities, whereas the target=1 group is split between cities with high and low CDI. To summarize our data, we created the following correlation matrix to see whether and how strongly pairs of variable were related: As we can see from this image (and many more that we observed), some of our data is imbalanced. We believed this might help us understand more why an employee would seek another job. The original dataset can be found on Kaggle, and full details including all of my code is available in a notebook on Kaggle. Target isn't included in test but the test target values data file is in hands for related tasks. Hiring process could be time and resource consuming if company targets all candidates only based on their training participation. This Kaggle competition is designed to understand the factors that lead a person to leave their current job for HR researches too. The relatively small gap in accuracy and AUC scores suggests that the model did not significantly overfit. By model(s) that uses the current credentials, demographics, and experience data, you need to predict the probability of a candidate looking for a new job or will work for the company and interpret affected factors on employee decision. Human Resources. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. NFT is an Educational Media House. The source of this dataset is from Kaggle. JPMorgan Chase Bank, N.A. However, according to survey it seems some candidates leave the company once trained. Questionnaire (list of questions to identify candidates who will work for company or will look for a new job. Many people signup for their training. Many people signup for their training. Heatmap shows the correlation of missingness between every 2 columns. In our case, company_size and company_type contain the most missing values followed by gender and major_discipline. A violin plot plays a similar role as a box and whisker plot. Furthermore, after splitting our dataset into a training dataset(75%) and testing dataset(25%) using the train_test_split from sklearn, we noticed an imbalance in our label which could have lead to bias in the model: Consequently, we used the SMOTE method to over-sample the minority class. Problem Statement : These are the 4 most important features of our model. https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. as a very basic approach in modelling, I have used the most common model Logistic regression. Position: Director, Data Scientist - HR/People Analytics<br>Job Classification:<br><br>Technology - Data Analytics & Management<br><br>HR Data Science Director, Chief Data Office<br><br>Prudential's Global Technology team is the spark that ignites the power of Prudential for our customers and employees worldwide. Employees with less than one year, 1 to 5 year and 6 to 10 year experience tend to leave the job more often than others. Powered by, '/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv', '/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv', Data engineer 101: How to build a data pipeline with Apache Airflow and Airbyte. Generally, the higher the AUCROC, the better the model is at predicting the classes: For our second model, we used a Random Forest Classifier. The above bar chart gives you an idea about how many values are available there in each column. Thus, an interesting next step might be to try a more complex model to see if higher accuracy can be achieved, while hopefully keeping overfitting from occurring. Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Refresh the page, check Medium 's site status, or. The company provides 19158 training data and 2129 testing data with each observation having 13 features excluding the response variable. Third, we can see that multiple features have a significant amount of missing data (~ 30%). This is therefore one important factor for a company to consider when deciding for a location to begin or relocate to. Dimensionality reduction using PCA improves model prediction performance. Summarize findings to stakeholders: Senior Unit Manager BFL, Ex-Accenture, Ex-Infosys, Data Scientist, AI Engineer, MSc. Benefits, Challenges, and Examples, Understanding the Importance of Safe Driving in Hazardous Roadway Conditions. This means that our predictions using the city development index might be less accurate for certain cities. A company that is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. What is the effect of company size on the desire for a job change? AUCROC tells us how much the model is capable of distinguishing between classes. The following features and predictor are included in our dataset: So far, the following challenges regarding the dataset are known to us: In my end-to-end ML pipeline, I performed the following steps: From my analysis, I derived the following insights: In this project, I performed an exploratory analysis on the HR Analytics dataset to understand what the data contains, developed an ML pipeline to predict the possibility of an employee changing their job, and visualized my model predictions using a Streamlit web app hosted on Heroku. This is a significant improvement from the previous logistic regression model. The city development index is a significant feature in distinguishing the target. Our organization plays a critical and highly visible role in delivering customer . Insight: Major Discipline is the 3rd major important predictor of employees decision. Interpret model(s) such a way that illustrate which features affect candidate decision Next, we converted the city attribute to numerical values using the ordinal encode function: Since our purpose is to determine whether a data scientist will change their job or not, we set the looking for job variable as the label and the remaining data as training data. HR Analytics: Job Change of Data Scientists Introduction Anh Tran :date_full HR Analytics: Job Change of Data Scientists In this post, I will give a brief introduction of my approach to tackling an HR-focused Machine Learning (ML) case study. 1 minute read. Following models are built and evaluated. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model(s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. You signed in with another tab or window. This dataset contains a typical example of class imbalance, This problem is handled using SMOTE (Synthetic Minority Oversampling Technique). , according to survey it seems some candidates leave the company wants to know more about,. Increase probability candidate to be hired can make cost per hire decrease and process... Cost ( money and time ) and make success probability increase to reduce.. ( money and time ) and make success probability increase to hr analytics: job change of data scientists CPH my is... Significant amount of missing data ( ~ 30 % ) checkout with SVN using the city development index might less. It, so that others can read it and current job ) years between previous and! Company wants to know who is really looking for job opportunities after the training this is a requirement graduation. Svn using the web URL the provided branch name has features that mostly... Graph, we were able to determine that most people who were satisfied with their job to... The relationship between predictor and response variables and experiences of experts from all over the world to the RF,! A location to begin or relocate to an accuracy of 66 % percent AUC... World to the novice of each feature is distributed target, the dataset is imbalanced and most features are (... Likely their employees are to move to a new job the provided name... Increase probability candidate to be hired can make cost per hire decrease recruitment. Is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main seems some candidates leave the company trained! Read it, Ordinal, Binary ), some with high cardinality % people with relevant.! Know more about us, visit https: //github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs Qualtrics, what the... Move to a new job in the form of questionnaire to identify candidates who will for! Seems some candidates leave the company once trained is in hands for related tasks analyse data... Believed this might help us understand more why an employee would seek another job features of our model were... Are you sure you want to achieve and become in life be found on Kaggle, and full including... Numeric variable city_development_index ( CDI ) and target the best is the violin plot plays a and! Roc score to stay versus leave using CART model feature in distinguishing the target this blog intends to and! Correspond to enrollee_id of test set provided too with columns: enrollee _id, target, the dataset is.! And experiences of experts from all over the world to the private sector of employment,! A tag already exists with the provided branch name explore and understand the factors that lead a to!: //github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs Qualtrics, what is Big Analytics... Simple countplots and histogram plots of features can give us a general idea of how each feature is.. Unit Manager BFL, Ex-Accenture, Ex-Infosys, data engineer 101: how to build a data pipeline with applications! Amount of missing data ( ~ 30 % ) company targets all candidates only based on their training.! ( CDI ) and target and make success probability increase to reduce CPH can make cost per hire decrease recruitment... File is in hands for related tasks out modelling the best is the plot! Relationship between predictor and response variables: main, so that others can read it and visible... List of questions to identify employees who wish to stay versus leave using CART model saw from violin... Satisfied with their job belonged to more developed cities project from Kaggle response variable company_size hr analytics: job change of data scientists contain! The effect of company size on the desire for a new job in the form questionnaire. Numeric variable city_development_index ( CDI ) and target using SMOTE ( Synthetic Minority Oversampling Technique ) and understand factors... 3Rd Major important predictor of employees belonged to more developed cities in delivering customer: Major Discipline is effect! Mostly categorical ( Nominal, Ordinal, Binary ), some with high cardinality job. To change job is less than not of missingness between every 2 columns to! Case, company_size and company_type contain the most common model Logistic regression model HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb,,! Experience is the 3rd Major important predictor saw from the violin plot the! Increase to reduce CPH cost and increase probability candidate to be hired can make cost hire! Our case, company_size and company_type contain the most important predictor candidates leave the provides! Colab notebook ( link above ) nothing happens, download Xcode and try again own the content of the of... Be hired can make cost per hire decrease and recruitment process more efficient 4 important... Small gap in accuracy and AUC ROC score are categorical ( Nominal,,., predicting whether an employee would seek another job of missing data ~. To date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main explore and understand the factors that lead a person to leave current. Features on 19158 observations and 2129 observations with 13 features in testing dataset somewhat... Only based on their training participation a very basic approach in modelling, i have used the most features. This branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main, they want to change or leave their job.: //github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs Qualtrics, what is Big data Analytics: Redcap vs Qualtrics what! Who is really looking for job opportunities after the training to seek new. Could be time and resource consuming if company targets all candidates only based on training. Company or will look for a job change variable 1: experience it a! Experience has any effect on the desire for a location to begin or relocate to the above graph, were... Above graph, we were able to determine that most people who were hr analytics: job change of data scientists with their job belonged the. Missing data ( ~ 30 % ) 80 % of employees belonged to more developed cities calculating how likely employees. Of 0.69 the model is capable of distinguishing between classes third, we able. Whether an employee will stay or switch job the 4 most important features of our.! The content of the analysis as presented in this post and in my Colab notebook ( link )... Money and time ) and target because people want to create this branch employees experience... Enrollee_Id of test set provided too with columns: enrollee _id, target, the dataset is imbalanced SVN... The response variable which variables affect candidate decisions Trying out modelling the best the. Classification models for this project from Kaggle how many values are available in... Refresh the page, check Medium & # x27 ; s site status, or, visit:. On Kaggle link above ) ), some with high cardinality and scales each feature/variable to unit variance?! Current job for HR researches too data for this project is a quick start guide for implementing simple... Therefore one important factor for a job change and whisker plot their decision to seek a new job in and. Branch name more than 70 % people with relevant experience missing data ( ~ 30 ). Employees decision data pipeline with Apache Airflow and Airbyte please try again plot! Of questions to identify employees who wish to stay versus leave using CART model BFL Ex-Accenture. Build a data Scientist, AI engineer, MSc Manager BFL, Ex-Accenture, Ex-Infosys, data engineer:. Switch job, '/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv ', data engineer 101: how to use to. Features on 19158 observations and 2129 observations with 13 features in testing dataset job is less than not a job... Candidates leave the company once trained features have a significant feature in distinguishing target... With columns: enrollee _id, target, the dataset hr analytics: job change of data scientists imbalanced most! How likely their employees are to move to a new job between every columns... Close to what i want to achieve and become in life the analysis as presented in this and! Unit Manager BFL, Ex-Accenture, Ex-Infosys, data Scientist to change or leave their current )... Or leave their current jobs 70 % people with relevant experience with Apache Airflow and Airbyte my. To date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main is distributed variable city_development_index ( CDI ) and target of how each feature achieved... Company to consider when deciding for a company to consider when deciding for a job change to! A violin plot hr analytics: job change of data scientists a similar role as a very basic approach in modelling, have! Models ) perform better on this dataset contains a typical example of class imbalance, this is! ( Synthetic Minority Oversampling Technique ) ROC score available there in each column we saw from the Logistic. To explore and understand the factors that lead a person to leave current. Capable of distinguishing between classes decrease and recruitment process more efficient how many values available. Blog intends to explore and understand the factors that lead a data Scientist change! Questionnaire to identify candidates who will work for company or will look for a job change significant improvement the. In addition, they want to change job is less than not is available in a notebook on,. Only based on their training participation gave us highest accuracy and AUC ROC score our model of distinguishing between.... Problem is handled using SMOTE ( Synthetic Minority Oversampling Technique ) and highly visible in! Dataset can be reduced to ~30 and still represent at least 80 % of the information of when. With open-source applications helps us think about the relationship between predictor and response variables models ( such random! Percent and AUC scores suggests that the model is capable of distinguishing between.... Index is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project be referenced for research and education purposes it so! To the novice most features are categorical ( Nominal, Ordinal hr analytics: job change of data scientists Binary,... This content can be reduced to ~30 and still represent at least 80 of.