Written by our guest blogger: Lyndon Sundmark, MBA People (HR) Analytics Consultant / Data Scientist
Introduction
Over last 18 months or so I have been writing LinkedIn blog articles on analytics and their potential use for HR. Most of these articles have hinted that there are many obstacles to the application of analytics to HR.
These obstacles for widespread evidence and use are predominantly human not technical:
- Data generation is not the problem. HR data availability and accessability has never been greater to HR professionals.
- Technology is not problem. Data science, machine learning, and analytic tools are exploding onto the scene these days. Whether it is enhancements to the R statistical package, or other statistical packages, or developments in other tools like Microsoft Azure Machine Learning or RapidMiner as examples- there have never been more tools to be brought to bear on HR data.
- Application of the tools on the data IS the problem. (In this case lack thereof). This lack of application can occur from:
- The HR profession not seeing that most of HR is highly technical. In the past HR might have acknowleged the traditional areas such as salary and wage administration, or labor relations/collective bargaining costing was technical because it was seen as heavily dependent on calculation and data. But for most of the rest of the HR domain HR professionals think its non-technical- whether it be recruitment,training, Health and Safety, Benefits, engagement. To be sure, there are the ‘human’ sides of all areas of HR, but they are all ’as much technical as well. The technical side has far too long being ignored by HR professionals. Dr. John Sullivan, in the following article: http://www.tlnt.com/2013/02/26/how-google-is-using-people-analytics-to-completely-reinvent-hr/ describes this as ‘reliance on relationships’. ‘Relationships are the antithesis of analytical decision making.’ .
- Not enough HR professionals seeking the informational and analytical side of the picture in their university/college HR studies in preparation for the HR field, and not enough universities/colleges offering studies in People Analytics.
- People Analytics- where it exists- being defining too narrowly as just HR metrics or predictive analytic tools to the HR domain. These are part of the picture-yes- but ‘data driven’ can include much more of the statistical methods than just predictive. And can cover much more data than just traditional HR metrics.
- Obstacles that organizations inadvertently put in the way – the ‘sacred’ requirement to develop a business case to ‘get’ resources to do any of this- particularly when ‘starting’ can be relatively free to develop proof of concepts.
I am a firm believer in ‘skunkworks’ projects-making do with, and showing what can be done with, what you already have or with what can be obtained free. You will often need to show by simple example what can be done to justify more elaborate formal initiatives later and request additional formal resources.
Its in that spirit, I wanted to walk through an example of People Analytics, and using free tools. This wont be a full fledged best practices or even a robust example. Rather it is intended to be a simple example to illustrate the process and some of the tools. Best practices and robustness comes with learning, use, and mastery over a period of time. Most of us need to start ‘simply’ to start ‘somewhere’.
This example will be on one particular HR metric- absenteeism. The data will be contrived. But it will illustrate very rudimentary People Analytics using the R statistical programming language. But before we get into this though, we should cover some basic helpful definitions and frameworks. These will help show why we are doing what we are doing.
Terminology
People Analytics
First of all, I do prefer the term ‘People Analytics’ as compared to some of the other terms used interchangeably. Per the title, I am going to use the term ‘People Analytics’ the way its implicitly defined in the previous article link provided
People analytics is a data-driven approach to managing people at work.
‘Data-driven’ is key. It dispels any notion that data and measurement are not part of managing people. (Sorry to those of you who felt managing people was restricted only to ‘reliance on relationships’) People Analytics is what happens when you apply Data Science and its principles to the realm of People Management (HR).
It means ‘analysis of data’ then ‘action’. And it means analysis and action. You are managing people through being ‘data driven’ A lot of organizations who try to get people analytics started, get stuck at creation of HR metrics and the use of Business Intelligence tools. These can be and are part of the people analytics picture, to be sure. But slicing and dicing , graphics and visualizations by themselves only take you so far. The addition of statistical analyses and taking informed action on your data are what will propel you forward.
Another reason why I like this definition is that ‘data-driven’ doesn’t unintentionally restrict the type of analyses we do to be data driven. This can include exploratory analysis, predictive analysis, and experimental design. (Often people think only ‘predictive’ in the context of ‘data driven’.)
Because we mention ‘Data Science’ in the context of People Analytics, it is important to define it next to understand why it is so tied to People Analytics.
Data Science
I will share a few definitions.
- In their book Practical Data Science With R by Nina Zumel and John Mount https://www.manning.com/books/practical-data-science-with-r on page xix, they define data science as: ‘managing the process that can transform hypotheses and data into actionable predictions.’
- Another definition is from the ‘Field Guide to Data Science’ by Booz, Allen , Hamilton- page 21: http://www.boozallen.com/insights/2015/12/data-science-field-guide-second-edition They define data science as : ‘the art of turning data in actions’
- And still another definition from ‘Data Science For Business’ by Foster Provost And Tom Fawcett: http://shop.oreilly.com/product/0636920028918.do Data Science is a set of fundamental principles that guide the extraction of knowledge from data (page 2) . The ultimate goal of data science .improving decision making. (page 5)
All of these definitions clearly line up as being totally consistent with the above definition provided for People Analytics. It means transforming HR hunches, guesses (or really hypotheses) into information/data, analyses, and management actions- actions supported by the data.
A Framework
Zumel/Mount ‘s definition mentions ’process’. Processes are always require to ‘transform’ something from ‘what it is’ to ‘what it is to become’ – ‘data’ into ‘actions’. This becomes a framework that can guide our efforts and understanding. In their book ‘Practical Data Science With R’ they define the following process for data science on page 6:
Data Science
- Define a goal
- Collect and Manage Data
- Build The Model
- Evaluate and Critique Model
- Present Results and Document
- Deploy Model
I don’t want to belabor the above process significantly in this blog article, because entire books have and are being written on the subjects of data science, predictive analytics, data mining etc. But I will make some general comments about the above steps to set the stage for the illustrative, simple ,rudimentary R example to follow.
1. Define a goal, as mentioned above, means identifying first what HR management business problem you are trying to solve. Without a problem/issue we don’t have a goal.
2. Collect and Manage data. At its simplest, you want a ‘dataset’ of information perceived to be relevant to the problem. The collection and management of data could be a simple extract from the corporate Human Resource Information System, or an output from an elaborate Data Warehousing/Business Intelligence tool used on HR information. For purpose of this blog article illustration we will use a simple CSV file. It also involves exploring the data both for data quality issues, and for an initial look at what the data may be telling you
3. Build The Model. This step really means, after you have defined the HR business problem or goal you are trying to achieve, you pick a data mining approach/tool that is designed to address that type of problem. With absenteeism as an HR issue, are you trying to predict employee with propensity to high absenteeism from those who aren’t? Are you trying to predict future absenteeism rates? Are you trying to define what is normal absenteeism from that which is atypical or and anomaly? The business problem/goal determine the appropriate data mining tools to consider. Not exhaustive as a list, but common data mining approaches used in modelling are classification,regression, anomaly detection, time series, clustering, association analyses to name a few. These approaches take information/data as inputs , run them through statistical algorithms, and produce output.
4. Evaluate and Critique Model. Each data mining approach can have many different statistical algorithms to bring to bear on the data. The evaluation is both what algorithms provide the most consistent accurate predictions on new data, and do we have all the relevant data or do we need more types of data to increase predictive accuracy of model on new data. This can be necessarily repetitive and circular activity over time to improve the model
5. Present Results And Document. When we have gotten out model to an acceptable ,useful predictive level, we document our activity and present results. The definition of acceptable and useful is really relative to the organization, but in all cases would mean , results show improvement over what would have been otherwise. The principle behind data ‘science’ like any science, is that with the same data, people should be able to reproduce our findings/ results.
6. Deploy Model. The whole purpose of building the model ( which is on existing data) is to:
- use the model on future data when it becomes available, to predict or prevent something from happening before it occurs or
- to better understand our existing business problem to tailor more specific responses
Both R and other solutions allow you to save the model, so it can be used on other data. Lets now turn to a rudimentary People Analytics (Data driven People Management(HR) example in R.
This blog article is ‘Part 1’. The length of covering this in a single article would push the limits of comfortable reading length for a single article. So part 1 will cover the first 2 steps of the above process.
An R Example
1. Define the goal (or HR business problem/issue).
A hypothetical company has decided that it needs to at absenteeism. It wants answers to the following questions:
- What is its rate of absenteeism?
- Does anyone have excessive absenteeism?
- Is is the same across the organization?
- Does it vary by gender?
- Does it vary by length of service or age? Its guesses are that initially age and length of service may be related to absenteeism rates.
- Can it predict next year’s absenteeism?
- If so, how well can it predict?
- Can we reduce our absenteeism?
If they can make future People Management decisions “driven” by what the data is telling them, then they will feel they have started the People Analytics journey.
2. Collect and Manage Data.
Let us suppose this is a skunkworks project. Formal separate resources have not be identified for this initiative. Only an initial look at recent data is possible. The HRIS system is able to provide some rudimentary information covering absences only for 2015 It was able to generate the following information as a CSV file (comma separated values):
- EmployeeNumber
- Surname
- GivenName
- Gender
- City
- JobTitle
- DepartmentName
- StoreLocation
- Division
- Age
- LengthService
- AbsentHours
- BusinessUnit
Let’s read in the data provided
MFGEmployees <- read.csv("~/R Files/MFGEmployees4.csv") str(MFGEmployees) #'data.frame': 8336 obs. of 13 variables: # $ EmployeeNumber: int 1 2 3 4 5 6 7 8 9 10 ... # $ Surname : Factor w/ 4051 levels "Aaron","Abadie",..: 1556 1618 941 3415 946 1917 503 2152 3451 222 ... # $ GivenName : Factor w/ 1625 levels "Aaron","Abel",..: 1141 1453 265 688 450 514 1260 625 760 1305 ... # $ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 2 2 2 ... # $ City : Factor w/ 243 levels "Abbotsford","Agassiz",..: 29 52 180 227 144 180 223 192 144 223 ... # $ JobTitle : Factor w/ 47 levels "Accounting Clerk",..: 5 5 5 5 5 5 1 5 5 1 ... # $ DepartmentName: Factor w/ 21 levels "Accounting","Accounts Payable",..: 5 5 5 5 5 5 1 5 5 1 ... # $ StoreLocation : Factor w/ 40 levels "Abbotsford","Aldergrove",..: 5 18 29 37 21 29 35 38 21 35 ... # $ Division : Factor w/ 6 levels "Executive","FinanceAndAccounting",..: 6 6 6 6 6 6 2 6 6 2 ... # $ Age : num 32 40.3 48.8 44.6 35.7 ... # $ LengthService : num 6.02 5.53 4.39 3.08 3.62 ... # $ AbsentHours : num 36.6 30.2 83.8 70 0 ... # $ BusinessUnit : Factor w/ 2 levels "HeadOffice","Stores": 2 2 2 2 2 2 1 2 2 1 ...
The first thing we should do is check on quality of data. Data will rarely be clean or perfect when we receive it. Either questionnable data should be corrected(preferrred) or deleted.
summary(MFGEmployees) #EmployeeNumber Surname GivenName Gender City JobTitle DepartmentName #Min. : 1 Johnson : 106 James : 182 F:4120 Vancouver :1780 Cashier :1703 Customer Service:1737 #1st Qu.:2085 Smith : 86 John : 161 M:4216 Victoria : 690 Dairy Person :1514 Dairy :1515 #Median :4168 Jones : 71 Robert : 136 New Westminster: 540 Meat Cutter :1480 Meats :1514 #Mean :4168 Williams: 71 Mary : 124 Burnaby : 339 Baker :1404 Bakery :1449 3rd Qu.:6252 Brown : 62 William: 121 Surrey : 275 Produce Clerk:1129 Produce :1163 #Max. :8336 Moore : 47 Michael: 107 Richmond : 228 Shelf Stocker: 712 Processed Foods : 746 # (Other) :7893 (Other):7505 (Other) :4484 (Other) : 394 (Other) : 212 # StoreLocation Division Age LengthService AbsentHours BusinessUnit # Vancouver :1836 Executive : 11 Min. : 3.505 Min. : 0.0121 Min. : 0.00 HeadOffice: 173 # Victoria : 853 FinanceAndAccounting: 73 1st Qu.:35.299 1st Qu.: 3.5759 1st Qu.: 19.13 Stores :8163 # Nanaimo : 610 HumanResources : 76 Median :42.115 Median : 4.6002 Median : 56.01 # New Westminster: 525 InfoTech : 10 Mean :42.007 Mean : 4.7829 Mean : 61.28 # Kelowna : 418 Legal : 3 3rd Qu.:48.667 3rd Qu.: 5.6239 3rd Qu.: 94.28 # Kamloops : 360 Stores :8163 Max. :77.938 Max. :43.7352 Max. :272.53 #(Other) :3734
The only thing that stands out initially is that age has some questionable data- some one who is 3 and someone who is 77. The range for purposes of this example should be 18 to 65 .Normally you would want to clean the data by getting the correct information and then changing it. For expediency of the example we will delete the problem records
Clean the data
MFGEmployees<-subset(MFGEmployees,MFGEmployees$Age>=18) MFGEmployees<-subset(MFGEmployees,MFGEmployees$Age<=65)
Now lets summarize again with cleaned up data
summary(MFGEmployees) #EmployeeNumber Surname GivenName Gender City JobTitle DepartmentName #Min. : 1 Johnson : 103 James : 180 F:4017 Vancouver :1751 Cashier :1663 Customer Service:1695 #1st Qu.:2081 Smith : 85 John : 157 M:4148 Victoria : 677 Dairy Person :1476 Meats :1495 #Median :4166 Jones : 70 Robert : 134 New Westminster: 530 Meat Cutter :1461 Dairy :1477 #Mean :4165 Williams: 69 William: 120 Burnaby : 332 Baker :1375 Bakery :1420 #3rd Qu.:6245 Brown : 62 Mary : 118 Surrey : 261 Produce Clerk:1101 Produce :1133 #Max. :8336 Moore : 46 Michael: 104 Richmond : 223 Shelf Stocker: 701 Processed Foods : 735 #(Other) :7730 (Other):7352 (Other) :4391 (Other) : 388 (Other) : 210 #StoreLocation Division Age LengthService AbsentHours BusinessUnit #Vancouver :1807 Executive : 11 Min. :18.20 Min. : 0.05328 Min. : 0.00 HeadOffice: 172 #Victoria : 837 FinanceAndAccounting: 73 1st Qu.:35.46 1st Qu.: 3.58261 1st Qu.: 20.07 Stores :7993 #Nanaimo : 601 HumanResources : 75 Median :42.10 Median : 4.59800 Median : 55.86 #New Westminster: 515 InfoTech : 10 Mean :41.99 Mean : 4.78887 Mean : 60.47 #Kelowna : 405 Legal : 3 3rd Qu.:48.51 3rd Qu.: 5.62358 3rd Qu.: 93.38 #Kamloops : 352 Stores :7993 Max. :65.00 Max. :43.73524 Max. :252.19 (Other) :3648 ##
Transform the data:
Now calculate absenteeism rate by dividing the absent hours by total standard hours for the year (52 weeks * 40 hours =2080)
MFGEmployees$AbsenceRate<-MFGEmployees$AbsentHours/2080*100 str(MFGEmployees) ## 'data.frame': 8165 obs. of 14 variables: ## $ EmployeeNumber: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Surname : Factor w/ 4051 levels "Aaron","Abadie",..: 1556 1618 941 3415 946 1917 503 2152 3451 ## $ GivenName : Factor w/ 1625 levels "Aaron","Abel",..: 1141 1453 265 688 450 514 1260 625 760 1305 ## $ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 2 2 2 ... ## $ City : Factor w/ 243 levels "Abbotsford","Agassiz",..: 29 52 180 227 144 180 223 192 144 ## $ JobTitle : Factor w/ 47 levels "Accounting Clerk",..: 5 5 5 5 5 5 1 5 5 1 ... ## $ DepartmentName: Factor w/ 21 levels "Accounting","Accounts Payable",..: 5 5 5 5 5 5 1 5 5 1 ... ## $ StoreLocation : Factor w/ 40 levels "Abbotsford","Aldergrove",..: 5 18 29 37 21 29 35 38 21 35 ... ## $ Division : Factor w/ 6 levels "Executive","FinanceAndAccounting",..: 6 6 6 6 6 6 2 6 6 2 ... ## $ Age : num 32 40.3 48.8 44.6 35.7 ... ## $ LengthService : num 6.02 5.53 4.39 3.08 3.62 ... ## $ AbsentHours : num 36.6 30.2 83.8 70 0 ... ## $ BusinessUnit : Factor w/ 2 levels "HeadOffice","Stores": 2 2 2 2 2 2 1 2 2 1 ... ## $ AbsenceRate : num 1.76 1.45 4.03 3.37 0 ...
We can now see our metric AbsenceRate has been calculated and created.
Explore The Data
Part of collecting and managing data is ‘Exploratory’ Analysis.
Lets start with bar graphs of some of the categorical data
counts <- table(MFGEmployees$BusinessUnit) barplot(counts, main = "EmployeeCount By Business Units", horiz = TRUE) counts <- table(MFGEmployees$Gender) barplot(counts, main = "EmployeeCount By Gender", horiz = TRUE) counts <- table(MFGEmployees$Division) barplot(counts, main = "EmployeeCount By Division", horiz = TRUE)
Lets ask some of our questions answered through this exploratory analysis.
First of all, what is our absenteeism rate?
mean(MFGEmployees$AbsenceRate) ## [1] 2.907265 library(ggplot2) ggplot() + geom_boxplot(aes(y = AbsenceRate, x =1), data = MFGEmployees) + coord_flip() #[1] 2.907265
The absence rate is 2.9.
Does anyone have excessive absenteeism?
The boxplot shows the mean and standard deviation of the data. Any observations beyond 3 standard deviations shows up as dots. So at least under that definition of outliers, some people show way more absenteeism than 99% of employees.
Does it vary across the organization?
library(ggplot2) library(RcmdrMisc) ## Loading required package: car ## Loading required package: sandwich ggplot() + geom_boxplot(aes(y = AbsenceRate, x = Gender), data = MFGEmployees) + coord_flip() AnovaModel.1 <- (lm(AbsenceRate ~ Gender, data=MFGEmployees)) Anova(AnovaModel.1) ## Anova Table (Type II tests) ## ## Response: AbsenceRate ## Sum Sq Df F value Pr(>F) ## Gender 496 1 97.773 < 2.2e-16 *** ## Residuals 41379 8163 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #means with(MFGEmployees, (tapply(AbsenceRate, list(Gender), mean, na.rm=TRUE))) ## F M ## 3.157624 2.664813
It varies significantly by Gender.
ggplot() + geom_boxplot(aes(y = AbsenceRate, x = Division), data = MFGEmployees) + coord_flip() AnovaModel.2 <- (lm(AbsenceRate ~ Division, data=MFGEmployees)) Anova(AnovaModel.2) ## Anova Table (Type II tests) ## ## Response: AbsenceRate ## Sum Sq Df F value Pr(>F) ## Division 91 5 3.5617 0.003218 ** ## Residuals 41783 8159 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # means with(MFGEmployees, (tapply(AbsenceRate, list(Division), mean, na.rm=TRUE))) ## Executive FinanceAndAccounting HumanResources ## 2.323580 1.921890 2.651743 ## InfoTech Legal Stores 1.925995 2.471724 2.920856
It varies significantly by Division.
AnovaModel.3 <- (lm(AbsenceRate ~ Division*Gender, data=MFGEmployees)) Anova(AnovaModel.3) ## Anova Table (Type II tests) ## ## Response: AbsenceRate ## Sum Sq Df F value Pr(>F) ## Division 92 5 3.6145 0.002877 ** ## Gender 496 1 97.9418 < 2.2e-16 *** ## Division:Gender 5 5 0.1784 0.970783 ## Residuals 41283 8153 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #means with(MFGEmployees, (tapply(AbsenceRate, list(Division, Gender), mean, na.rm=TRUE))) #means ## F M ## Executive 2.976419 1.779546 ## FinanceAndAccounting 2.172804 1.634077 ## HumanResources 3.014491 2.214311 ## InfoTech 3.298112 1.773538 ## Legal 3.298112 2.058530 ## Stores 3.169049 2.680788
If varies significantly by the interaction of gender and division. These are just a handful of the categorical summaries we could do. Does AbsenceRate vary by length of service and age? Scatterplots and correlations help answer this.
library(RcmdrMisc) scatterplot(AbsenceRate ~ Age, reg.line = FALSE, smooth = FALSE, spread = FALSE, boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5, .9), data = MFGEmployees) cor(MFGEmployees$Age, MFGEmployees$AbsenceRate) ## [1] 0.8246129
There is a strong correlation of Age and Absence Rate
library(RcmdrMisc) scatterplot(AbsenceRate ~ LengthService, reg.line = FALSE, smooth = FALSE, spread = FALSE, boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5, .9), data = MFGEmployees) cor(MFGEmployees$LengthService, MFGEmployees$AbsenceRate) [1] -0.04669242
There is not a strong correlation between length of service and Absence Rate.
scatterplot(LengthService ~ Age, reg.line = FALSE, smooth = FALSE, spread = FALSE, boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5, .9), data = MFGEmployees)
cor(MFGEmployees$Age, MFGEmployees$LengthService) ## [1] 0.05623405
There is not much correlation between age and length of service either.
This is as far as we will go in this article. We will defer the rest of the analyses to the part 2 blog article.