data:image/s3,"s3://crabby-images/0932e/0932e833b68ec9589171dd71e6313c31aa783039" alt="The Data Analysis Workshop"
Introduction
In the previous chapter, we looked at some of the main techniques that are used in data analysis. We saw how hypothesis testing can be used when analyzing data, we got a brief introduction to visualizations, and finally, we explored some concepts related to time series analysis. In this chapter, we will elaborate on some of the topics we've already looked at (such as plotting and hypothesis testing) while introducing new ones coming from probability theory and data transformations.
Nowadays, work relationships are becoming more and more trust-oriented, and conservative contracts (in which working time is strictly monitored) are being replaced with more agile ones in which the employee themselves is responsible for accounting working time. This liberty may lead to unregulated absenteeism and may reflect poorly on an employee's candidature, even if absent hours can be accounted for with genuine reasons. This can significantly undermine healthy working relationships. Furthermore, unregulated absenteeism can also have a negative impact on work productivity.
In this chapter, we'll analyze absenteeism data from a Brazilian courier company, collected between July 2007 and July 2010.
Note
The original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work.
If you're interested, take a look at the following paper, which talks about the problem from a machine learning perspective: Martiniano, A., Ferreira, R.P., Sassi, R.J., & Affonso, C. (2012). Application of neuro fuzz network on prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.
This dataset can also be found on our GitHub repository here: https://packt.live/3e4rorX.
Our goal is to discover hidden patterns in the data, which might be useful for distinguishing genuine work absences from fraudulent ones. During this chapter, the following topics will be addressed:
- Introduction to probability, conditional probability, and Bayes' theorem
- Kolmogorov-Smirnov tests for equality of probability distributions
- Box-Cox and Yeo-Johnson transformations
We will apply these techniques to our analysis as we try to identify the main drivers for absenteeism.