To investigate this issues, we first need to drop the negative cases in hours_wo. Remember that is usually a good idea to tabulate categorical variables before using to check them for negative values and other possible coding problems that need to be resolved before you arrive at any definitive conclusions about relationships. In our case, hours_wo has negative values that should be omitted when regressing this variable. There are several ways of handling this issue, but in this case, we choose to creat a new variable that will contains only 0 and positive values. To accomplis this we need the following:
gen hours = hours_wo
replace hours = . if hours <= 0 [Note: we are interested in those who "work" thus we eliminate 0's]
reg incmon hours
Which yields the following results:
Source | SS df MS Number of obs = 621 ---------+------------------------------ F( 1, 619) = 0.49 Model | 2203133.62 1 2203133.62 Prob > F = 0.4832 Residual | 2.7704e+09 619 4475575.66 R-squared = 0.0008 ---------+------------------------------ Adj R-squared = -0.0008 Total | 2.7726e+09 620 4471910.43 Root MSE = 2115.6 ------------------------------------------------------------------------------ incmon | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- hours | -4.578381 6.525533 -0.702 0.483 -17.39325 8.236484 _cons | 1785.673 303.2978 5.888 0.000 1190.055 2381.29 ------------------------------------------------------------------------------
It seems that the number of hours a person works does not effectively predict a person's income. How do we know that? The R-squared tells us how much "predicting power" our independent variable has. From the table above, we can see that the R-squared is only 0.0008 and that hours is not statistically significant (P>|t| = 0.483).
Without going through all the regressions, what are some other variables that may be important in predicting income? Someone living in South Africa may guess that province, urban vs. rural, and race all may play a role in predicting income. The following sections will teach us how to run regressions using these types of third variables.