The Data Analysis Workshop
上QQ阅读APP看书,第一时间看更新

Age and Education Factors

Age and education may also influence employees' absenteeism. For instance, older employees might need more frequent medical treatment, while employees with higher education degrees, covering positions of higher responsibility, might be less prone to being absent.

First, let's investigate the correlation between age and absence hours. We will create a regression plot, in which we'll plot the Age column on the x axis and Absenteeism time in hours on the y axis. We'll also include the Pearson's correlation coefficient and its p-value, where the null hypothesis is that the correlation coefficient between the two features is equal to zero:

from scipy.stats import pearsonr

# compute Pearson's correlation coefficient and p-value

pearson_test = pearsonr(preprocessed_data["Age"], \

               preprocessed_data["Absenteeism time in hours"])

"""

create regression plot and add correlation coefficient in the title

"""

plt.figure(figsize=(10, 6))

ax = sns.regplot(x="Age", y="Absenteeism time in hours", \

                 data=preprocessed_data, scatter_kws={"alpha":0.1})

ax.set_title(f"Correlation={pearson_test[0]:.03f} \

| p-value={pearson_test[1]:.03f}")

plt.savefig('figs/correlation_age_hours.png', \

            format='png', dpi=300)

The output will be as follows:

Figure 2.33: Correlation plot for absenteeism time and age

As we can observe from the resulting plot, no significant pattern occurs. Furthermore, the correlation coefficient is extremely small (0.066), and its p-value is above the threshold of 0.05, which is an additional indicator that no relationship is present between the Age and Absenteeism time in hours features.

We can also check whether age has some impact on the reason for absence. We'll perform this analysis in the next exercise.

Exercise 2.04: Investigating the Impact of Age on Reason for Absence

In this exercise, we'll investigate the relationship between the Age feature and the various reasons for absence. Please execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

  1. First, create a violin plot between the Age and Disease features. This will give you your first insight into the relationship between the two columns:

    # create violin plot between the Age and Disease columns

    plt.figure(figsize=(8,6))

    sns.violinplot(x="Disease", y="Age", data=preprocessed_data)

    plt.savefig('figs/exercise_204_age_disease.png', \

                format='png', dpi=300)

    The output will be as follows:

    Figure 2.34: Violin plot for disease versus age

  2. From Step 1, you can see some differences between the two distributions of age. In fact, for samples with ICD encoded reasons for absence (labeled Yes in the Disease column), you can observe that slightly more samples are present for older employees. To confirm this difference in distributions, perform hypothesis tests on the means and distributions of the two groups:

    """

    get Age entries for employees with Disease == Yes and Disease == No

    """

    disease_mask = preprocessed_data["Disease"] == "Yes"

    disease_ages = preprocessed_data["Age"][disease_mask]

    no_disease_ages = preprocessed_data["Age"][~disease_mask]

    # perform hypothesis test for equality of means

    test_res = ttest_ind(disease_ages, no_disease_ages)

    print(f"Test for equality of means: \

    statistic={test_res[0]:0.3f}, pvalue={test_res[1]:0.3f}")

    # test equality of distributions via Kolmogorov-Smirnov test

    ks_res = ks_2samp(disease_ages, no_disease_ages)

    print(f"KS test for equality of distributions: \

    statistic={ks_res[0]:0.3f}, pvalue={ks_res[1]:0.3f}")

    The output will be as follows:

    Test for equality of means: statistic=0.630, pvalue=0.529

    KS test for equality of distributions: statistic=0.057,

    pvalue=0.619

    From the results of the two tests, you can conclude that there is no statistically significant difference between the two distributions. Thus, age is neither an indicator for the length of an absence nor for its type.

  3. Now investigate the relationship between age and reason for absence:

    # violin plot of reason for absence vs age

    plt.figure(figsize=(20,8))

    sns.violinplot(x="Reason for absence", y="Age", \

                   data=preprocessed_data)

    plt.savefig('figs/exercise_204_age_reason.png', format='png')

    The output will be as follows:

Figure 2.35: Violin plot for age and reason for absence

In light of the previously performed analysis, you can conclude that age has no impact on the employees' absenteeism.

Note

To access the source code for this specific section, please refer to https://packt.live/2Y7jEj6.

You can also run this example online at https://packt.live/3d7q5qD. You must execute the entire Notebook in order to get the desired result.

Now, let's analyze the impact of education level on absenteeism.

Exercise 2.05: Investigating the Impact of Education on Reason for Absence

In this exercise, you will analyze the existing relationship between the Reason for absence and Education columns. You will start by looking at the percentage of employees with a certain educational degree, and then relate those numbers to the various reasons for absence. Please execute the code mentioned in the previous section and exercises before attempting this exercise. Now, follow these steps:

  1. Before starting the analysis, check the percentage of employees in the data that hold a certain degree:

    # compute percentage of employees per education level

    education_types = ["high_school", "graduate", \

                       "postgraduate", "master_phd"]

    counts = preprocessed_data["Education"].value_counts()

    percentages = preprocessed_data["Education"]\

                  .value_counts(normalize=True)

    for educ_type in education_types:

        print(f"Education type: {educ_type:12s} \

    | Counts : {counts[educ_type]:6.0f} \

    | Percentage: {100*percentages[educ_type]:4.1f}")

    The output will be as follows:

    Education type: high_school | Counts : 611 | Percentage: 82.6

    Education type: graduate | Counts : 46 | Percentage: 6.2

    Education type: postgraduate | Counts : 79 | Percentage: 10.7

    Education type: master_phd | Counts : 4 | Percentage: 0.5

    You can see that most of the employees in the data have a high school degree (82.6%), which means that the data is highly biased toward these employees.

  2. Create a distribution plot of the number of hours of absence, based on the level of education of the employees:

    # distribution of absence hours, based on education level

    plt.figure(figsize=(8,6))

    sns.violinplot(x="Education", y="Absenteeism time in hours",\

                   data=preprocessed_data, \

                   order=["high_school", "graduate", \

                          "postgraduate", "master_phd"])

    plt.savefig('figs/exercise_205_education_hours.png', format='png')

    The output will be as follows:

    Figure 2.36: Violin plot for number of hours of absence for each level of education

  3. It seems most of the extreme cases of absence are among employees with lower education levels. Compute the mean and standard deviation of the absence duration for the different levels of education:

    # compute mean and standard deviation of absence hours

    education_types = ["high_school", "graduate", \

                       "postgraduate", "master_phd"]

    for educ_type in education_types:

        mask = preprocessed_data["Education"] == educ_type

        hours = preprocessed_data["Absenteeism time in hours"][mask]

        mean = hours.mean()

        stddev = hours.std()

        print(f"Education type: {educ_type:12s} | Mean : {mean:.03f} \

    | Stddev: {stddev:.03f}")

    The output will be as follows:

    Education type: high_school | Mean : 7.190 | Stddev: 14.259

    Education type: graduate | Mean : 6.391 | Stddev: 6.754

    Education type: postgraduate | Mean : 5.266 | Stddev: 7.963

    Education type: master_phd | Mean : 5.250 | Stddev: 3.202

    You can see that both the mean and standard deviation of the hours of absence are decreasing, meaning that highly educated employees tend to have shorter absences. Of course, a higher degree of education is not a cause for such a phenomenon and is more of an indication of it.

  4. Now, check the reasons for absence based on the education level:

    # plot reason for absence, based on education level

    plt.figure(figsize=(10, 16))

    sns.countplot(data=preprocessed_data, y="Reason for absence",\

                  hue="Education", \

                  hue_order=["high_school", "graduate", \

                             "postgraduate", "master_phd"])

    plt.savefig('figs/exercise_205_education_reason.png', format='png')

    The output will be as follows:

Figure 2.37: Reasons for absence for each level of education

From the preceding plot, you can observe that most of the absences relate to employees with a high_school level of education. This is, of course, due to the fact that most of the employees only have a high school degree (as observed in Step 1). Furthermore, from our analysis in Step 2, we saw that most of the absences that consisted of a greater number of hours were among employees with a high_school education level.

One question that comes to mind is whether the probability of being absent for more than one working week (40 hours) is greater for employees with a high school degree compared to graduates. In order to address this question, use the definition of conditional probability:

Figure 2.38: Conditional probability for extreme absences by employees with a high school degree

Figure 2.39: Conditional probability for extreme absences by employees without a high school degree

The following code snippet computes the conditional probabilities:

"""

define threshold for extreme hours of absenteeism and get total number of entries

"""

threshold = 40

total_entries = len(preprocessed_data)

# find entries with Education == high_school

high_school_mask = preprocessed_data["Education"] == "high_school"

# find entries with absenteeism time in hours more than threshold

extreme_mask = preprocessed_data\

               ["Absenteeism time in hours"] > threshold

# compute probability of having high school degree

prob_high_school = len(preprocessed_data[high_school_mask])\

                   /total_entries

# compute probability of having more than high school degree

prob_graduate = len(preprocessed_data[~high_school_mask])\

                /total_entries

"""

compute probability of having high school and being absent for more than "threshold" hours

"""

prob_extreme_high_school = len(preprocessed_data\

                               [high_school_mask & extreme_mask])\

                               /total_entries

"""

compute probability of having more than high school and being absent for more than "threshold" hours

"""

prob_extreme_graduate = len(preprocessed_data\

                            [~high_school_mask & extreme_mask])\

                            /total_entries

# compute and print conditional probabilities

cond_prob_extreme_high_school = prob_extreme_high_school\

                                /prob_high_school

cond_prob_extreme_graduate = prob_extreme_graduate/prob_graduate

print(f"P(extreme absence | degree = high_school) = \

{100*cond_prob_extreme_high_school:3.2f}")

print(f"P(extreme absence | degree != high_school) = \

{100*cond_prob_extreme_graduate:3.2f}")

preprocessed_data.head().T

The output will be as follows:

P(extreme absence | degree = high_school) = 2.29

P(extreme absence | degree != high_school) = 0.78

The preprocessed data now looks as follows:

Figure 2.40: Analysis of data

Note

To access the source code for this specific section, please refer to https://packt.live/3fxhorg.

You can also run this example online at https://packt.live/2YDVBr0. You must execute the entire Notebook in order to get the desired result.

From the preceding computations, we can see that the probability of having an absence of more than 40 hours for employees with a high school education degree is 2.29%, which is approximately three times greater than the same probability for employees with a university degree (0.78%).