data:image/s3,"s3://crabby-images/0932e/0932e833b68ec9589171dd71e6313c31aa783039" alt="The Data Analysis Workshop"
Transportation Costs and Distance to Work Factors
Two possible indicators for absenteeism may also be the distance between home and work (the Distance from Residence to Work column) and transportation costs (the Transportation expense column). Employees who have to travel longer, or whose costs for commuting to work are high, might be more prone to absenteeism.
In this section, we will investigate the relationship between these variables and the absence time in hours. Since we do not believe the aforementioned factors might be indicative of disease problems, we will not consider a possible relationship with the Reason for absence column.
First, let's start our analysis by plotting the previously mentioned columns (Distance from Residence to Work and Transportation expense) against the Absenteeism time in hours column:
# plot transportation costs and distance to work against hours
plt.figure(figsize=(10, 6))
sns.jointplot(x="Distance from Residence to Work", \
y="Absenteeism time in hours", \
data=preprocessed_data, kind="reg")
plt.savefig('figs/distance_vs_hours.png', format='png')
plt.figure(figsize=(10, 6))
sns.jointplot(x="Transportation expense", \
y="Absenteeism time in hours", \
data=preprocessed_data, kind="reg")
plt.savefig('figs/costs_vs_hours.png', format='png')
Note that, here, we used the seaborn jointplot() function, which not only produces the regression plot between the two variables but also estimates their distribution. The output will be as follows:
data:image/s3,"s3://crabby-images/532eb/532ebf1e795f3faf38cc3a07639a3584a47082e1" alt=""
Figure 2.41: Regression plot of distance from work versus absenteeism in hours
data:image/s3,"s3://crabby-images/c1a72/c1a720b14d66c4cf69d7c787fc5b366a1be6302a" alt=""
Figure 2.42: Regression plot of transportation costs versus absenteeism in hours (on the right)
As we can see, the distributions of Distance from Residence to Work and Transportation expense look close to normal distributions, while the absenteeism time in hours is heavily right-skewed. This makes the comparison between the variables difficult to interpret. One solution to this problem is to transform the data into something close to a normal distribution. A handy way to perform this transformation is to use the Box-Cox or Yeo-Johnson transformations. Both are defined as a family of functions, depending on a parameter λ, under which the transformed data is as close to normal as possible.
The Box-Cox transformation is defined as follows:
data:image/s3,"s3://crabby-images/9ccb1/9ccb1ef95ad939e9950e83fc7512ec2d99650549" alt=""
Figure 2.43: Expression for Box-Cox transformation if λ is not equal to 0
data:image/s3,"s3://crabby-images/6c76e/6c76e3faf109cfe98bbd1aeb4619b0b9af9142e2" alt=""
Figure 2.44: Expression for Box-Cox transformation if λ is equal to 0
The optimal value of the parameter λ is the one that results in the best approximation of a normal distribution. Note that the Box-Cox transformation fails if the data assumes negative values or zero. If this is the case, the Yeo-Johnson transformation can be used:
data:image/s3,"s3://crabby-images/84cc3/84cc387275ae599f8264d26467f2b5ed20533cdf" alt=""
Figure 2.45: Expression for Yeo-Johnson transformation
In Python, both transformations can be found in the scipy.stats module (in the boxcox() and yeojohnson() functions, respectively).
Since the Absenteeism time in hours column contains zeros, we will apply the Yeo-Johnson transformation in order to reproduce the plots from Figure 2.42:
# run Yeo-Johnson transformation and recreate previous plots
from scipy.stats import yeojohnson
hours = yeojohnson(preprocessed_data\
["Absenteeism time in hours"].apply(float))
distances = preprocessed_data["Distance from Residence to Work"]
expenses = preprocessed_data["Transportation expense"]
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=distances, y=hours[0], kind="reg")
ax.set_axis_labels("Distance from Residence to Work",\
"Transformed absenteeism time in hours")
plt.savefig('figs/distance_vs_hours_transformed.png', format='png')
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=expenses, y=hours[0], kind="reg")
ax.set_axis_labels("Transportation expense", \
"Transformed absenteeism time in hours")
plt.savefig('figs/costs_vs_hours_transformed.png', format='png')
The output will be as follows:
data:image/s3,"s3://crabby-images/56b76/56b76e3ba4efa43a3f841e9da021216c7fb66e75" alt=""
Figure 2.46: Regression plot of distance from work versus transformed absenteeism in hours
data:image/s3,"s3://crabby-images/e224f/e224f9e8e30fb4d5dbc38334f2423d14b7f87950" alt=""
Figure 2.47: Regression plot of transportation costs versus transformed absenteeism in hours
We can also produce kernel density estimation plots (that is, plots that help us visualize the probability density functions of continuous variables) by just changing the type of the jointplot() function to kde.
# produce KDE plots
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=distances, y=hours[0], kind="kde")
ax.set_axis_labels("Distance from Residence to Work",\
"Transformed absenteeism time in hours")
plt.savefig('figs/distance_vs_hours_transformed_kde.png', \
format='png')
plt.figure(figsize=(10, 6))
ax = sns.jointplot(x=expenses, y=hours[0], kind="kde")
ax.set_axis_labels("Transportation expense", \
"Transformed absenteeism time in hours")
plt.savefig('figs/costs_vs_hours_transformed_kde.png', \
format='png')
The KDE plot for distance from residence to work versus absent hours will be as follows:
data:image/s3,"s3://crabby-images/380f2/380f278ecde9b6942ef7926a29762584152868fb" alt=""
Figure 2.48: KDE plot for distance from residence to work versus absent hours
The KDE plot for transport expense versus absent hours will be as follows:
data:image/s3,"s3://crabby-images/132f9/132f919f636f1b8b281cb32955c47432aa22a773" alt=""
Figure 2.49: KDE plot for transport expense versus absent hours
From Figure 2.46, we can also see that the regression line between the variables is almost flat for the Distance from Residence to Work column (which is a clear indicator of zero correlation) but has a slight upward slope for the Transportation Expense column. Therefore, we can expect a small positive correlation:
# investigate correlation between the columns
distance_corr = pearsonr(hours[0], distances)
expenses_corr = pearsonr(hours[0], expenses)
print(f"Distances correlation: corr={distance_corr[0]:.3f}, \
pvalue={distance_corr[1]:.3f}")
print(f"Expenses comparison: corr={expenses_corr[0]:.3f}, \
pvalue={expenses_corr[1]:.3f}")
The output will be as follows:
Distances correlation: corr=-0.000, pvalue=0.999
Expenses comparison: corr=0.113, pvalue=0.002
These results confirm our observation, stating that there is a slight positive correlation between Transportation expense and Absenteeism time in hours.