Written by our guest blogger: Lyndon Sundmark, MBA People (HR) Analytics Consultant / Data Scientist
In Part 1 of this article we:
- Indicated that People Analytics is something that organizations can do themselves
- Indicated that it can be done in the R programming language as at least one possible tool
- Provided some basic terminology for People Analytics and Data Science.
- Provided a sample suggested framework for Data Science.
- Covered the first 2 steps of that framework for a People Analytics example- defining the problem and collecting and managing data.
Let us know move on to step 3.
Reload the data and adjust it for corrections again
MFGEmployees <- read.csv ("~/R Files/MFGEmployees4.csv" ) #write.csv(MFGEmployees,"~/R Files/MFGEmployees4.csv") MFGEmployees$AbsenceRate<-MFGEmployees$AbsentHours/2080 *100 summary (MFGEmployees) MFGEmployees<-subset (MFGEmployees,MFGEmployees$Age>=18 ) MFGEmployees<-subset (MFGEmployees,MFGEmployees$Age<=65 )
3. Build The model
One of the questions asked in the defining the goal step was ‘whether it was possible to predict absenteeism?’ Absence Rate is a numeric continuous value. In the ‘Building a model’ step we have to chose what models/statistical algorithms to use. Prediction of a numerics continous values suggests a couple of models that could be brought to bear: Regression trees and linear regression. There are many more but for purposes of this article we will look at these
3.1 Regression Trees
Regression Trees will allow for use of both categorical and numeric values as predictors.Lets choose the following data as potential predictors in this analysis:
- Gender
- Department Name
- Store Location
- Division
- Age
- Length of Service
- Business Unit
Absence Rate will be the the ‘target’ or thing to be predicted.
library(rattle) library(magrittr) building <- TRUE scoring <- ! building crv$seed <- 42 MYdataset <- MFGEmployees str(MYdataset) MYinput <- c ("Gender" , "DepartmentName" , "StoreLocation" , "Division" ,"Age" , "LengthService" , "BusinessUnit" ) MYnumeric <- c ("Age" , "LengthService" ) MYcategoric <- c ("Gender" , "DepartmentName" , "StoreLocation" , "Division" , "BusinessUnit" ) MYtarget <- "AbsenceRate" MYrisk <- NULL MYident <- "EmployeeNumber" MYignore <- c ("Surname" , "GivenName" , "City" , "JobTitle" , "AbsentHours" ) MYweights <- NULL library (rpart, quietly=TRUE ) set.seed (crv$seed) MYrpart <- rpart (AbsenceRate ~ ., data= MYdataset[, c (MYinput, MYtarget)], method="anova" , parms=list (split="information" ), control=rpart.control (minsplit=10 , maxdepth=10 , usesurrogate=0 , maxsurrogate=0 )) fancyRpartPlot (MYrpart, main="Decision Tree MFGEmployees $ AbsenceRate" )
The regression decision tree shows that age is a big factor in determining absence rate with gender playing a small part in one of the age ranges: >43 and <52 with males having a lower absence rate in this group.
Almost all categorical information other than gender doesnt look like its helps in prediction.
Now lets look at linear regression as another model. The restriction in linear regression is that it can only accept
non-categorical variables. Categorical variables can sometimes be made numeric through transformation, but that is beyond the scope of this article.
3.2 Linear Regression
In linear regression, then, we will need to restrict it to numeric variables:
- Age
- Length of Service
being used to predict absence rate.
#Linear Regression Model RegressionCurrentData <- lm(AbsenceRate~Age+LengthService, data=MFGEmployees) summary(RegressionCurrentData)
The summary shows an adjusted R-squared of .68 which means approximately 68% of the variance is accounted by age and length of service. The variables are both significant at Pr(>|t|) of <2e-16. These results are using the entirety of the existing data to predict itself. Graphically it look like this:
#2D plot of Age and AbsenceRate library(ggplot2) ggplot() + geom_point(aes(x = Age,y = AbsenceRate),data=MFGEmployees) +geom_smooth(aes(x = Age,y = AbsenceRa #3D Scatterplot of Age and Length of Service with Absence Rate - with Coloring and Vertical Lines # and Regression Plane library(scatterplot3d) s3d <-scatterplot3d(MFGEmployees$Age,MFGEmployees$LengthService,MFGEmployees$AbsenceRate, pch=16, highlight.type="h", main="Absence Rate By Age And Length of Service") fit <- lm(MFGEmployees$AbsenceRate ~ MFGEmployees$Age+MFGEmployees$LengthService) s3d$plane3d(fit)
4.Evaluate And Critique Model
Up till now we have concentrated on producing a couple of models. The effort so far has had one weakness.
We have used all of our data for 2015 to generate the models. They can both predict, but the prediction are based on existing data- dat already known. We dont know how well it will predict on data it hasnt seen yet.
To evaluate and critique the models, we need to train the model using part of the data and hold out a portion to test on.We will divide the data into 10 parts- using 9 parts as training data and 1 part as testing data, and alternate which are the 9 and the 1, so that each of the 10 parts gets to be training data 9 times and testing data once.
The R “caret” library helps us do that. We will run both a regression tree and linear regression and compare how they do against each other.
First the Linear Regression
MFGEmployees <- readRDS (file="MFGEmployees.Rda" ) library (caret) ## Loading required package: lattice set.seed (998 ) inTraining <- createDataPartition (MFGEmployees$BusinessUnit, p = .75 , list = FALSE ) training <- MFGEmployees[inTraining,] testing <- MFGEmployees[ - inTraining,] fitControl <- trainControl (## 10-fold CV method = "repeatedcv" , number = 10 , ## repeated ten times repeats = 10 ) set.seed (825 ) lmFit1 <- train (AbsenceRate ~ Age + LengthService, data = training, method = "lm" , trControl = fitControl) lmFit1
The rSquared shows a value of .688 which means even with sampling different parts of the data on 10 fold cross validation the use of age and length of service seems to be pretty robust so far.
Next the decision tree. The first time with just the numeric variables.
set.seed(825) rpartFit1 <- train(AbsenceRate ~ Age + LengthService, data = training, method = "rpart", trControl = fitControl, maxdepth = 5) You will notice that the decision tree with 10 fold cross validation didnt perform as well with an RSquared of approximately .60 The second time with the original categorical and numeric varibles used. set.seed(825) rpartFit2 <- train(AbsenceRate ~ Gender + DepartmentName + StoreLocation + Division + Age + LengthService method = "rpart", trControl = fitControl, maxdepth = 5) rpartFit2
Here when you include all originally used vaiables in 10 fold cross validation, the RSquared changed little and is still around .60.
So far the linear regression is performing better
5.Present Results and Document
The presenting of results and documenting is something that R helps in. You may not have realized it, but the R Markdown language has been used to create the full layout of these two blog articles. HTML,PDF and Word formats can be produced.
R Markdown allows the reader to see exactly what you having been doing , so that an independent person can replicate your results, to confirm what you have done. It shows the R code/commands, the statistical results and graphics.
These are:
• Absenteeism-Part1.Rmd and
• Absenteeism-Part2.Rmd
For presentation formats beyond this, you may have to use other tools.
6.Deploy Model
Once you have evaluated your model(s) and chosen to use them, they need to be deployed so that they can be used. At the simplest level, ‘deploy’ can mean using the ‘predict’ function in R (where applicable) in conjunction with your model.
In R, you can also ‘publish’ you model as an R HTTP service so that others can use it.(That is beyond the scope of this article)
Can it predict next year absenteeism?
Lets predict the 2016 Absenteeism from the 2015 model.
If we make the simplifying assumption that nobody quits and nobody new comes in, we can take the 2015 data and add 1 to age and 1 to years of service for an approximation of new 2016 data before we get to 2016.
To get single estimate for 2016 we ask for mean of absence rate.
mean(Absence2016Data$AbsenceRate) mean(MFGEmployees$AbsenceRate) ## [1] 2.907265
The first figure above is the 2016 prediction, the second is the 2015 actual for comparison. If so, how well can it predict?
As mentioned previously, about 68% of the variation is accounted for in a linear regression model using age and length of service.
Can we reduce our absenteeism?
On the surface, only getting the age reduced and length of service increased will reduce absensteeism with this model. Obviously, absenteeism is much more complex that just the rudimentary data we have collected. A serious look at this metric and problem would require more and different kinds of data. As mentioned before , the raw data used in this article and analysis is totally contrived to illustrate an example.
Final Comments
The purposes of these two blogs articles was to:
- show that R could be used to do People Analytics
- show that People Analytics is the application of the Data Science to People (HR) Management and decision making.
- show by a rudimentary/simple (not necessarily rigorous) example that the ‘data science’ capability is in your hands.
- show that ‘free’ tools can be used to start the ‘People Analytics’ journey.
Its time to apply data science to People Management, and be data-driven.
Enjoy the journey!