Practical Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Code description

Here are some notes and explanations for the script code that you have just run:

  • lm() function: This function runs a simple linear regression using the lm() function. This function predicts women's height based upon the value of their weight. In statistical parlance, you will be regressing height on weight. The line of code which accomplishes this is:
        lm_output <- lm(women$height ~ women$weight) 
  • There are two operations that you will become very familiar with when running predictive models in R:
    • The ~ operator: Also called the tilde, this is a shorthand way for separating what you want to predict, with what you are using to predict. This is an expression in formula syntax. What you are predicting (the dependent or target variable) is usually on the left side of the formula, and the predictors (independent variables, features) are on the right side. In order to improve readability, the independent variable (weight) and dependent variable (height) are specified using $ notation which specifies the object name, $, and then the dataframe column. So women’s height is referenced as women$height and women’s weight is referenced as women$weight. Alternatively, you can use the attach command, and then refer to these columns only by specifying the names height and weight. For example, the following code would achieve the same results:
                      attach(women)
lm_output <- lm(height ~ weight)
    • The <- operator: Also called the assignment operator. This common statement assigns whatever expressions are evaluated on the right side of the assignment operator to the object specified on the left side of the operator. This will always create or replace a new object that you can further display or manipulate. In this case, we will be creating a new object called lm_output, which is created using the function lm(), which creates a linear model based on the formula contained within the parentheses.
Note that the execution of this line does not produce any displayed output. You can see whether the line was executed by checking the console. If there is any problem with running the line (or any line for that matter), you will see an error message in the console.
  • summary(lm_output): The following statement displays some important summary information about the object lm_output and writes to output to the R Console as pictured previously:
        summary(lm_output) 
  • The results will appear in the Console window as pictured in the previous figure.  Just to keep thing a little bit simpler for now, I will just show the first few lines of the output, and underline what you should be looking at. Do not be discouraged by the amount of output produced.
  • Look at the lines marked Intercept and women$weight which appear under the coefficients line in the console.
        Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.723456 1.043746 24.64 2.68e-12 ***
women$weight 0.287249 0.007588 37.85 1.09e-14 ***
  • The Estimate column illustrates the linear regression formula needed to derive height from weight. We can actually use these numbers along with a calculator to determine the prediction ourselves. For our example the output tells us that we should perform the following steps for all of the observations in our dataframe in order to obtain the prediction for height. We will obviously not want to do all of the observations (R will do that via the following predict() function), but we will illustrate the calculation for 1 data point:
    • Take the weight value for each observation. Let’s take the weight of the first woman which is 115 lbs.
    • Then,multiply weight by 0.2872 . That is the number that is listed under Estimate for womens$weight.  Multiplying 115 lbs. by 0.2872 yield 33.028
    • Then add 25.7235 which is the estimate of the (intercept) row. That will yield a prediction of 58.75 inches.
  • If you do not have a calculator handy, the calculation is easily done in calculator mode via the R Console, by typing the following: