TABLE OF CONTENTS
Introduction
Dummy Variables
Interactions with Dummy Variables
Linear Transformations of non-Linear relationships
Transformations using Squared
Terms
Transformations using the natural Logarithm
Exercises
INTRODUCTION
In Module 6, we learned about simple bivariate regression. Now, it is time to move on to the more complex, but exciting multiple regression.
Lets quickly review what we know about simple regression analysis. In general form, the simple linear regression model has one independent variable (X) and one dependent variable (Y). In multiple regression, the dependent variable Y is assumed to be a function of a set of K independent variables - X1, X2, X3,....Xk. This yields a new regression equation - an extension of the one in Module 6:
Y = a + b1X1 + b2X2 + ... + bkXk
As with the simple regression equation, the interpretation of each of these coefficients is straight forward. Each "b" is a partial slope coefficient. Put differently, each "b" coefficient is the slope of the relationship between that particular independent variable X and the dependent variable Y when all other independent variables in the model are equal to zero, or "held constant." For example, the b1 coefficient refers to the slope between X1 and the dependent variable Y when all other variables in the equation, X2, X3, etc., equal zero. Similarly, the value for b2 is the slope for the relationship between X2 and the dependent variable Y, when all other variables, X1, X3, etc., are equal to zero. As in Module 6, the "a" refers to the intercept, also known as the constant. This value, is the value of predicted Y (yhat) when all of the independent variables, X1,X2, X3, etc., are equal to zero. Thus, multiple-regression allows us to state relationships between two main variables while controlling for other factors - also known as partial effects.
It should be obvious how useful this approach can be for quantitative social researchers, since we are often interested in social phenomenon that go beyond a basic bivariate relationship. As mentioned in the previous module, we might be interested in whether the relationship between total monthly household income and total monthly household expenditures vary by rural setting. Or whether that relationship is not a matter of household income, but rather of how many household members are present in the home. All of these types of interests require multiple regression. This new approach will allow us to investigate the initial relationship while controlling for a 3rd, a 4th, and an x-number of factors.
Therefore, for this module we will investigate in depth, the relationship between income (incmon) and education (educ_c). In particular, we are hypothsizing that the amount of income earned by any individual is dependent upon their individual level of education. First, we will need to recode the education variable into linear form. In Module 3 we had to recode it to make sure that value 18 (preschool) was less than value 16 (college degree). If not click here: to recode the original educ_c variable into educ_new we must do the following:
#delimit ;
generate educ_new = educ_c;
label var educ_new "Recoded Education";
replace educ_new = 0 if educ_c == 17;
replace educ_new = 0 if educ_c == 18;
replace educ_new = . if educ_c == -3;
replace educ_new = . if educ_c == -4;
replace educ_new = 9 if educ_c == 11;
replace educ_new = 12 if educ_c == 12;
replace educ_new = 12 if educ_c == 13;
replace educ_new = 12 if educ_c == 14;
replace educ_new = 12 if educ_c == 15;
replace educ_new = . if educ_c == 19;
We should always check that our recoding worked. So lets type:
tab educ_new
We should see the following:
-> tabulation of educ_new
Recoded |
Education | Freq. Percent Cum.
------------+-----------------------------------
0 | 1403 27.44 27.44
1 | 663 12.97 40.41
2 | 293 5.73 46.14
3 | 315 6.16 52.30
4 | 331 6.47 58.77
5 | 374 7.31 66.09
6 | 435 8.51 74.59
7 | 252 4.93 79.52
8 | 326 6.38 85.90
9 | 234 4.58 90.48
10 | 344 6.73 97.20
12 | 108 2.11 99.32
16 | 35 0.68 100.00
------------+-----------------------------------
Total | 5113 100.00
Everything looks good so far. Now, as we have done in the past, we want to quickly check our variables with the correlate command to make sure our relationship (coding wise) is working correctly.
NOTE: the age qualifier that we need to use for our analysis. When using a wage or income variable, it is important to keep in mind that not everyone in the population is of working age, thus it often applies only to a certain group in the population. In the United States context, for example, working age is normally between ages of 25 to age 65. For our example here, we will use this same age range. Since we do not want to continuously have to include an age qualifier for every STATA command, we'll go ahead and keep only those between the ages of 25 to 65. To keep our sample consistent throughout this module, we will also keep only those cases that reported an income and those with a known education level:
[NOTE: Remember NOT to save over your original data set after you're done with this module]
keep if age>=25 & age<=65
keep if (incmon>=0 & incmon~=.)
keep if educ_new~=.
Lets type:
corr incmon educ_new
| incmon educ_new
-------------+------------------
incmon | 1.0000
educ_new | 0.5984 1.0000
The correlation between educ_new and incmon is 0.5984, which is definitely in line with our hypothesis - education level is associated with income earned. With a correlation, however, we do not know to what extent education makes a difference, we just know that it is positively associated with income. For further understand this relationship, we need to estimate the regression of income on education.
We accomplish this by typing:
reg incmon educ_new
Source | SS df MS Number of obs = 421
-------------+------------------------------ F( 1, 419) = 233.72
Model | 664868285 1 664868285 Prob > F = 0.0000
Residual | 1.1920e+09 419 2844755.4 R-squared = 0.3581
-------------+------------------------------ Adj R-squared = 0.3565
Total | 1.8568e+09 420 4421001.9 Root MSE = 1686.6
------------------------------------------------------------------------------
incmon | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ_new | 301.3989 19.71498 15.29 0.000 262.6464 340.1515
_cons | -206.0145 145.6281 -1.41 0.158 -492.2672 80.23829
------------------------------------------------------------------------------
Do you remember how to interpret these results? Lets review the basic regression equation:
Y = a + bX
In our case, this equation becomes: (predicted incmon) = -206.0145 + 301.3989(educ_new)
We can immediately interpret the slope coefficient for educ_new as the number of Rand that incmon would increase by for every additional year of education (educ_new). Judging from the probability of the t-value (15.29), we can tell that the coefficient is significant. The constant, as discussed before, reflects the value of the dependent variable Y when the independent variables are equal to zero. While this property is technically useful in the calculation of the regression coefficients and calculation of predicted Y values, its actual value is not always of use. Obviously we do not want to ignore it, but we also do not need to dwell on it since it is often not very interpretable. In our current case, it literally says that when education level is zero, predicted income is -136.7855. Note, however, that the constant is not statistically significant, which means that the estimated value is not significantly different from zero. If, however, we had centered our education variable around the sample's education mean, then the "zero" value would actually be the average level of eduction. Interpreting the constant in that case would be more useful. Moving along, the R-squared for this regression tells us that education accounts for almost 36% of the variation around the mean of income. Another way to think about it is, if we were asked to guess at random the income for an individual in a population sample, our guess would improve by approximately 36% if we new the education level of the individual instead of just knowing and guessing the mean income of the sample.
Lets now try graphing the regression equation:
#delimit ;
predict inchat;
graph twoway scatter incmon educ_new || line inchat1 educ_new,
ylabel(0(5000)15000) ytick(0(2500)15000)
xlabel(0(4)16) xtick(0(2)16);

Issues of Parsimony vs. Saturation
When thinking about introducing variables into a model, it is important to keep the notions of parsimony and saturation in mind. That is, we should always strive to include only the variables that make sense and that are efficient at capturing the desired social phenomenon. Model building is often a balancing act between parsimony and saturation. When we say that a model is "saturated," we mean that the model has too many variables - it is overspecified. A model that is overspecified or saturated can often predict each case in the sample perfectly because the model is using up all the degrees of freedom. Therefore, when selecting variables for a model, it is prudent to only include the most necessary variables or risk overspecifing the model. Witht that in mind, lets proceed.
Introducing a Third Variable
At this point, we can consider including our first control variable. It is likely that the amount of income earned by any one person, is not only dependent on their years of education, but also on their age. By including age in our model, we acknowledge that income is also a function of age. It is important to include this factor because most people accumulate not only life experience as they age, but also work experience and skills, thus making them more likely to earn a higher wage. If you remember our earlier discussion on how to interpret coefficients, each coefficient in a regression model is a partial effect, meaning that the coefficient reflects the effect of variable while controlling for the others at 0. In this case it means that when we include age, our coefficient for educ_new will be the effect of education while controlling for age at 0. Do not think of zero in literal terms, we are not saying that the coefficient of educ_new is the value for a newborn (age 0), but rather think of this "controlling" as the process by which we standardize the effect accross all observations. Enough theory, let's try running the multiple regression model now:
reg incmon educ_new age
Source | SS df MS Number of obs = 421
-------------+------------------------------ F( 2, 418) = 119.87
Model | 676781025 2 338390513 Prob > F = 0.0000
Residual | 1.1800e+09 418 2823061.65 R-squared = 0.3645
-------------+------------------------------ Adj R-squared = 0.3614
Total | 1.8568e+09 420 4421001.9 Root MSE = 1680.2
------------------------------------------------------------------------------
incmon | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ_new | 312.9772 20.43244 15.32 0.000 272.814 353.1403
age | 17.94223 8.734351 2.05 0.041 .7735008 35.11095
_cons | -963.2319 396.1364 -2.43 0.015 -1741.9 -184.5642
------------------------------------------------------------------------------
Compare our old equation (from above):
(predicted incmon) = -206.0145 + 301.3989(educ_new)
--> {R-squared = 0.3581}
To our new multiple regression equation:
(predicted incmon) = -963.2319 + 312.9772(educ_new) + 17.94223(age)
--> {R-squared = 0.3645}
Right away we should notice the effect that age has on our model. Notice the increase in the value of education (312.98), up from 301.40. Therefore, after controlling for age, which itself has the effect of an additional 18 Rand a month, every additional year of education on average produces an additonal 312.98 Rand per month in earned income. Another way of thinking about these new results is that in the initial model, the "true" effect of educ_new was being masked by the effect of age.
The addition of a single regressor to the bivariate model probably does not seem that difficult, but as we progress in this module, you will realize that this is merely the tip of the iceberg.
Now that you have been introduced to multiple regression, try the following two exercises:
- It is possible that income and the number of household members predict the amount of money a household spends on clothing. Run a regression to see if this hypothesis finds support in our data.
- Question 1 Answer
- Among people who reported working, to what extent does the number of hours worked in a week predict a person's monthly income? Without running any further regressions, what other variables might also help predict a person's monthly income?
- Question 2 Answer
DUMMY VARIABLES
Thus far we have focused on using continuous variables in our regressions. We can extend regression analysis to include categorical variables such as gender, general satisfaction, race, metro, rural, etc. But how do you include variables whose values are arbitrary? Can we calculate the average race of a country? How about the average metro setting? The answer is no, but lets find out how these types of variables are useful in regression analysis.
What Makes a Dummy Variable a "Dummy" variable?
No, "dummy" variables are not "stupid" variables, in fact they are quite smart and useful! A dummy variable has two properties that make it a "dummy variable." First, it is categorical and non-ordinal (i.e., categories have no rank order). Thus, the number values associated with each category serve only to identify the various groups/categories it represents, but not to assign value or order to any one category. The second, and this is what makes a dummy variable and "dummy variable," is that it is binary in the sense that it has only two values - 0 and 1. Technically, a variable like race or metro, have more than 0 and 1 values, but when this type of dummy variable is used in a regression, coefficients are calculated for each category while all the other categories are equal to zero. Thus, if done correctly, even a multi-category variable variable can be used as a dummy variable because in the end, it is broken up into 0s and 1s.
Dummy variables are useful because they allow us to control for membership within a particular category or group. If we neglected to split a categorical variable into several dummy variables when using it in a regression, we would get invalid results because regression analysis assumes variables to be continuous unless told otherwise. Therefore, if you include a categorical variable like race into a regression, STATA (or any other statistical program) would recognize it as simply another variable and would not realize that those numbers have no mathematical meaning - STATA does not know if the values in a variable are arbitrary or not. Regression analysis revolves around the use of means and standard deviations, but with categorical variables, means and standard deviations have no meaning.
How NOT to use Categorical variables
Lets try the following example of what NOT to do. Let's continue with our previous example of income returns to education. This time lets include race in the regression model without considering the fact that it is a categorical variable. First, lets tabulate race to see its categories, but remember that because we are interested in incmon we are interested in the working age population:
tabulation of race
19 |
:population |
group | Freq. Percent Cum.
------------+-----------------------------------
01-afric | 283 67.22 67.22
02-colou | 54 12.83 80.05
03-india | 14 3.33 83.37
04-white | 70 16.63 100.00
------------+-----------------------------------
Total | 421 100.00
Lets regress it now:
regress incmon educ_new race
Source | SS df MS Number of obs = 421
-------------+------------------------------ F( 2, 418) = 255.77
Model | 1.0218e+09 2 510920800 Prob > F = 0.0000
Residual | 834979197 418 1997557.89 R-squared = 0.5503
-------------+------------------------------ Adj R-squared = 0.5482
Total | 1.8568e+09 420 4421001.9 Root MSE = 1413.3
------------------------------------------------------------------------------
incmon | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ_new | 185.0327 18.67354 9.91 0.000 148.327 221.7384
race | 921.1761 68.90877 13.37 0.000 785.7252 1056.627
_cons | -1056.576 137.6228 -7.68 0.000 -1327.095 -786.057
------------------------------------------------------------------------------
After reviewing these results, how would you interpret the race coefficient? Would it make sense to say that for every unit increase in race, while controlling for age and education (educ_new), there is an 921.18 Rand increase in income? The answer is NO. This is similar to saying that the average race in South Africa is 2.3. What would 2.3 mean? Your guess is as good as mine.
The Correct Way
Lets try this same example, except this time will do it correctly. To do this we need to call upon a few of our newly found skills. First, we need to split race into multiple dummy variables. There are two main ways to accomplish this task. Here we will cover the more familiar way (tab varname, gen(varname)) and then below you will be introduced to a new command that will make it easier - the xi command.
We covered this first command in Module 3:
tab race, gen(raceid) [Note: raceid will be automatically numbered with sequential numbers]
19 |
:population |
group | Freq. Percent Cum.
------------+-----------------------------------
01-afric | 283 67.22 67.22
02-colou | 54 12.83 80.05
03-india | 14 3.33 83.37
04-white | 70 16.63 100.00
------------+-----------------------------------
Total | 421 100.00
Then we tabulate our new raceid variables to make sure the command worked by typing:
tab1 raceid1 raceid2 raceid3 raceid4 [Note: tab1 tells STATA to tabulate each variable seperately instead of crosstabulating all of them together in one big matrix]
-> tabulation of raceid1
race==01-af |
ric | Freq. Percent Cum.
------------+-----------------------------------
0 | 138 32.78 32.78
1 | 283 67.22 100.00
------------+-----------------------------------
Total | 421 100.00
-> tabulation of raceid2
race==02-co |
lou | Freq. Percent Cum.
------------+-----------------------------------
0 | 367 87.17 87.17
1 | 54 12.83 100.00
------------+-----------------------------------
Total | 421 100.00
-> tabulation of raceid3
race==03-in |
dia | Freq. Percent Cum.
------------+-----------------------------------
0 | 407 96.67 96.67
1 | 14 3.33 100.00
------------+-----------------------------------
Total | 421 100.00
-> tabulation of raceid4
race==04-wh |
ite | Freq. Percent Cum.
------------+-----------------------------------
0 | 351 83.37 83.37
1 | 70 16.63 100.00
------------+-----------------------------------
Total | 421 100.00
Great, our command worked as it should. Each new raceid variable is coded as 1 for all people who are of that particular race and 0 for everyone else. For example, Indians (raceid3) equals 1 for a total of 14 cases and it equals 0 for everyone else, a total of 407 cases.
Now its time to run the regression with our newly created dummy variables. We do this by typing:
reg incmon educ_new age raceid2 raceid3 raceid4
Source | SS df MS Number of obs = 421
-------------+------------------------------ F( 5, 415) = 106.48
Model | 1.0434e+09 5 208687855 Prob > F = 0.0000
Residual | 813381524 415 1959955.48 R-squared = 0.5619
-------------+------------------------------ Adj R-squared = 0.5567
Total | 1.8568e+09 420 4421001.9 Root MSE = 1400
------------------------------------------------------------------------------
incmon | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ_new | 185.2631 19.42997 9.53 0.000 147.0697 223.4565
age | 6.671661 7.334563 0.91 0.364 -7.745864 21.08919
raceid2 | 287.8807 208.6658 1.38 0.168 -122.2929 698.0543
raceid3 | 1722.132 389.4502 4.42 0.000 956.5913 2487.673
raceid4 | 2846.285 212.1007 13.42 0.000 2429.36 3263.211
_cons | -320.6505 334.1475 -0.96 0.338 -977.4831 336.182
------------------------------------------------------------------------------
Our new regression line can be stated as:
(predicted incmon) = -320.6505 + 185.2631(educ_new) + 6.671661(age) + 287.8807(raceid2) + 1722.132(raceid3) + 2846.285(raceid4)
By now, you should be able to interpret the basic regression equation. This new equation is simply an extension of the first regression equation discussed earlier. Lets quickly review it. This equation tells us that for every additional year of education, income increases by about 185.26 Rand while controlling for age and race. It also tells us that for every additional year of age, income increases by about 6.67 Rand while controlling for education and race. Now, the race coefficients tell us that for raceid2 (Coloured) there is an added effect of 287.88 Rand over the ommited category (raceid1 - Black Africans) while controlling for education and age. Similarly, for raceid3 (Indian) there is an added effect of 1722.132 Rand over Africans while controlling for education and age. For Whites (raceid4), there is an added effect of 2846.28 rand while controlling for education and age over the reference category - Black Africans. In general, the raceid coefficients show us the effect that race has on the amount of total monthly income after controlling for education and age. As you can tell, the effects are staggering! To be precise, the race effect for Coloureds is not statistically significant (t=1.38), however, the other raceid coefficients large and statistically significant.
NOTE on Omitted/Reference Categories
There is one important point to keep in mind when interpreting a multiple regression that uses dummy variables. Notice that only 3 raceid dummy variables were included in the equation. Why would this be necessary? It is necessary because if we were to include all four dummy variables, we would essentially overspecify the model, which we do not want to do. Whenever we use dummy variables, there should always be an omitted category (also known as the reference category), in this case the omitted category is Black African (raceid1).
Being "omitted" does not mean that the equation is ignoring that group of people, rather we are telling STATA to only explicitly show us the the coefficients for raceid2, raceid3, and raceid4, but in fact the coefficient for the omitted category (raceid1) can be known from the results above. If you remember our description of what the constant is, you will realize that raceid1 can be derived from it. The constant in this case is analogous to a "resevoir" of values, in which all omitted categories get lumped into. Therefore, if the constant represents the value of our dependent variable Y when all other regressors are equal to zero, that means that the "left over" values are used to calculate the constant (in this case those values are those not in the category raceid2, raceid3, or raceid4). And who is not in the raceid2, raceid3, or raceid4 categories? Correct, raceid1 (Black Africans).
It is important to realize that we did not drop any cases by omitting the raceid1 category, we simply "shifted" them into the constant and use them as comparison group. If we were using another set of dummy variables, gender for example, we would have to choose the reference category for that variable as well. If we chose men as our reference category, we would get a coefficient for women, but not for men. The coefficient for men would be found in the constant. If both gender and race were included in a regression model as dummy variables, two omitted categories would be captured and represented by the constant - in our case it would have been African Males (raceid1 + men).
A short cut: the "xi" option
Although the tab varname, gen(varname) command is useful in creating dummy variables, it is unnecessary. STATA provides us with an easier and more convinient short-cut to specify a categorical variable in a regression equation. The xi command tells STATA to treat the specified variable(s) as categorical - as if they were dummy variables. This command can be used with any STATA command like regress, logistic, probit, etc. Lets try it.
First, we will create and label a new gender variable that is consistant with dummy variable coding - 0s and 1s. Note however, that could also use the xi command for gender_n, but we choose not to.
gen gender=gender_n
label var gender "Gender of Respondent"
recode gender 3=0 2=1
label def gender 1 "Female" 0 "Male"
label val gender gender
Then we tabulate the new variable to make sure it works fine - and of course, it does.
tab gender
Gender of |
Respondent | Freq. Percent Cum.
------------+-----------------------------------
Male | 242 57.48 57.48
Female | 179 42.52 100.00
------------+-----------------------------------
Total | 421 100.00
Now we move on to using the xi command. We continue with our income returns to education example, but now we will be controlling for age, race, and gender. By doing so, we are stating not only that income depends on education, but also on age, race, and gender. This time, however, we will be declaring the White racial group (race = 4) as the reference category. We do this by prefacing the regress command with the char varname[omit] statement. This command is useful when using xi because STATA, by default, selects the first category in the specified variable as the reference category. In our model, the xi: command works by placing it at the beginning of the regression equation and then specifying the variables you want STATA to expand into its constituent categories by "tagging" them with an "i." in front of each target variable. See below:
char race[omit] 4 /* Makes category 4 of the race variable the reference category */
xi:reg incmon educ_new age i.race i.gender
Notice that an "i." is included for the variables race and gender. Also remember that we have told STATA to treat category 4 of the race variable as the reference category and since we have not specified a specific reference category for gender, STATA will omit its first category - 0, men.
. char race[omit] 4;
. xi:reg incmon educ_new age i.race i.gender;
i.race _Irace_1-4 (naturally coded; _Irace_4 omitted)
i.gender _Igender_0-1 (naturally coded; _Igender_0 omitted)
Source | SS df MS Number of obs = 421
-------------+------------------------------ F( 6, 414) = 96.58
Model | 1.0831e+09 6 180509880 Prob > F = 0.0000
Residual | 773761517 414 1868989.17 R-squared = 0.5833
-------------+------------------------------ Adj R-squared = 0.5772
Total | 1.8568e+09 420 4421001.9 Root MSE = 1367.1
------------------------------------------------------------------------------
incmon | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ_new | 183.1393 18.97933 9.65 0.000 145.8314 220.4471
age | 8.55595 7.174016 1.19 0.234 -5.546089 22.65799
_Irace_1 | -2852.43 207.1245 -13.77 0.000 -3259.577 -2445.283
_Irace_2 | -2516.836 260.9101 -9.65 0.000 -3029.71 -2003.962
_Irace_3 | -1248.377 402.788 -3.10 0.002 -2040.142 -456.6124
_Igender_1 | -624.9599 135.737 -4.60 0.000 -891.7796 -358.1402
_cons | 2735.125 412.9026 6.62 0.000 1923.478 3546.772
------------------------------------------------------------------------------
What do the results tell us? Right away we should be able to tell that our model explains almost 58% of the variation around our independent variable, which is great. Next, we should notice that all the dummy variable coefficients (for race and gender) are negative. These negative values tell us that in relation to the omitted categories (race=4 and gender=0 -- white men) everyone within the reported categories (Africans, Coloureds, Indians, and Women) earn significantly less after controlling for the level of education and age! In other words, this model tells us that above and beyond levels of education and age, a person that reports being Black African, Coloured, Indian, or a Woman is likely to earn significantly less than White Males. Overall, the model tells us that if we know a person's level of education, their age, their race, and gender, we are likely to guess their incomes 58% better than simply guessing the mean income in the sample.
Lets consider what our new equation looks like:
(predicted income) = 2735.125 + 183.1393(years of education) + 8.55595(years of age) - 2852.43(African=1, else=0) - 2516.836(Coloured=1, else=0) - 1248.377(Indian=1, else=0) - 624.9599(Woman=1, else=0)
The new equation allows us to calculate, for example, the predicted income for a 50 year old Indian man with 10 years of education or the income for a 25 year old Coloured woman with 16 years of education. All we need to do is plug in the number of years of education, the years of age, and either a 1 or a 0 for whether the person falls within the particular category or not. Lets try it.
(Indian man predicted income) = 2735.125 + 183.1393(10) + 8.55595(50) - 2852.43(0) - 2516.836(0) - 1248.377(1) - 624.9599(0)
= 2735.125 + 1831.393 + 427.7975 - (0) - (0) - 1248.377 - (0)
-> (predicted income) = 3745.9385 rand for a 50 year old Indian man with 10 years of education. For a 25 year old Coloured woman with 16 years of education, the predicted income equation is the following:
(Coloured woman predicted income) = 2735.125 + 183.1393(16) + 8.55595(25) - 2852.43(0) - 2516.836(1) - 1248.377(0) - 624.9599(1)
= 2735.125 + 2930.2288 + 213.89875 - (0) - 2516.836(1) - (0) - 624.9599
-> (predicted income) = 2737.45665 rand for a 25 year old Coloured woman with 16 years of education.
Does this sound right to you? Does it make sense that on average an older, less educated, Indian man earns 1000 rand more than a younger more educated Coloured woman? What, if anything, do these predicted values assume? Any ideas? How about assuming that each of the non-categorical variables in our equation have a linear relationship with the dependent variable? Does it make sense that as we get older we continue to earn more money? How about with education?
In general, the relationship between education and age is not linear, but curve-linear. That is, there comes point where age no longer provides an advantage in the workforce, but it is instead a detriment. It goes from having a positive effect to having a negative effect on earnings as people get really old. We will learn how to control for this curve-linear effect later in this module.
Note on Extrapolating Beyond the Data
Lets try calculating the following predicted income:
What is the predicted income for 90 year old White male with 20 years of education? We can easily carry out the calculations for this question:
(predicted income) = 2735.125 + 183.1393(20) + 8.55595(90) - 2852.43(0) - 2516.836(0) - 1248.377(0) - 624.9599(0)
= 2735.125 + 3662.786 + 770.0355 - (0) - (0) - (0) - (0)
-> (predicted income) = 7167.9465 rand for a 90 year old White male with 20 years of education.
Do you see any problems with this example? Does our education variable include people with 20 years of education? How about our income variable, does it include people over the age 65? The answer to these questions is NO. Extrapolating beyond the available data points is never a good idea because our results apply only to the specific cases used to calculate the model. It is possible that our observed relationship holds for 90 year olds with 20 years of education, but it is also possible that it does not. The point is that without those actual cases in the calculation of the model it is impossible to know. Therefore, we suggest that you never try to extrapolate, predict values, beyond the data points used in the model.
Try these exercises to make sure you understand the basics of interpreting dummy variables in multiple regression analysis.
- What is the predicted amount of income earned for an African woman (raceid1 & gender1) of age 30 with a 12 year education?
- Question 3 Answer
- What is the predicted amount of income earned for a White woman (raceid4 & gender1), also of age 30 with a 12 year education?
- Question 4 Answer
INTERACTIONS WITH DUMMY VARIABLES
Thus far, we have only dealt with the additive effects of dummy variables. That is, the assumption has been that for each independent variable Xi, the amount of change in our dependent variable Y is the same regardless of the values of the other independent variables in the equation. This assumption allows us to interpret the partial coefficients as the effect of a variable while controlling for the other independent variables in the model.
The additive assumption, however, does not always hold. In such cases, the partial effect of a given independent variable cannot be interpreted as the effect of the variable while all others are being held constant, instead these peculiar relationships depend on the specific values of ther other independent variables in the model. That is, in these cases it is hypothesized that the independent variable Xi is linearily related to the dependent variable Y, however, that linear relationship depends on a different independent varible in the model. Interactions are perhaps best visualized and understood in the case of dummy variables.
For instance, in our example below, we interact the categories of race and gender. In effect, what we are testing with an interactive model is whether or not the linear relationship between an independent variable Xi and the dependent variable Y is dependent on the values of a different independent variable in the model. More intuitively, by interacting race and gender, we are testing whether being an African Female, for example, is significantly different from being simply African or Female independantly. In other words, when results indicate a statistically signficant interaction effect, the data suggests that being an African Female or a White Male, or any other combination of race and gender, is qualitatively different from being either category independently.
In general, we can illustrate what we mean by the additive effect of dummy variables in regression with the graph below. Each category of an independent dummy variable has a slope as depicted by the lines in the graph. For instance, we can imagine the predicted effect of gender on income looking like the lines below. That is, because men on average earn more than women we would expect to find predicted regression lines like the ones below. As it stands, this first graph suggests that the effect of gender is similar across all racial groups, the only apparent difference is in magnitude between males and females -- both slopes are identical for each unit change in Xi.

In this second graph, we find a hypothetical interaction effect. We can imagine this effect to be similar in form to that of the interaction between race and gender. That is, the effect of gender (slope of the line) depends on the particular race the individual. In this case, we find that the upper-most line on the graph has a steeper slope than the line below it, thus the effect of gender depends on the value of Xi -- in this case, the race of the individual.

Lets now investigate how the theory measures up to empirical findings. Creating an interaction term with STATA is as easy as inserting an asterick "*" between the two variables you wish to interact. In essense, this tells STATA to multiply these two variables together. Or, you can also generate each interaction term independently by generating a variable that multiplies the two desired variables together. In the immediate example below, we use the easiest of these two approaches, but to see the second approach click here.
First, we choose to use Whites as the reference category (race==4). Then, to interact race and gender, we simply include an asterick between i.race & i.gender. Remember that it is important to restrict our model to the working age segment of the populations. Lets try it.
char race[omit] 4
xi:reg incmon educ_new age i.race*i.gender
. xi:reg incmon educ_new age i.race*i.gender;
i.race _Irace_1-4 (naturally coded; _Irace_4 omitted)
i.gender _Igender_0-1 (naturally coded; _Igender_0 omitted)
i.race*i.gender _IracXgen_#_# (coded as above)
Source | SS df MS Number of obs = 421
-------------+------------------------------ F( 9, 411) = 72.53
Model | 1.1394e+09 9 126601385 Prob > F = 0.0000
Residual | 717408332 411 1745519.06 R-squared = 0.6136
-------------+------------------------------ Adj R-squared = 0.6052
Total | 1.8568e+09 420 4421001.9 Root MSE = 1321.2
------------------------------------------------------------------------------
incmon | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ_new | 172.8666 18.43997 9.37 0.000 136.6182 209.115
age | 6.018636 6.952892 0.87 0.387 -7.649029 19.6863
_Irace_1 | -3729.071 255.3957 -14.60 0.000 -4231.116 -3227.026
_Irace_2 | -3361.513 341.8539 -9.83 0.000 -4033.513 -2689.513
_Irace_3 | -2166.792 451.2157 -4.80 0.000 -3053.77 -1279.814
_Igender_1 | -2279.992 322.2442 -7.08 0.000 -2913.444 -1646.539
_IracXge~1_1 | 1982.464 359.7817 5.51 0.000 1275.222 2689.706
_IracXge~2_1 | 1888.976 483.6555 3.91 0.000 938.2293 2839.723
_IracXge~3_1 | 2643.612 918.8979 2.88 0.004 837.2855 4449.938
_cons | 3621.657 429.1715 8.44 0.000 2778.012 4465.302
------------------------------------------------------------------------------
As with the previous regression results, we find coefficients for the main effects of educ_new, age, Irace_1 thru Irace_3, and Igende_1, but now we find the interaction effects of race and gender (irXg_1_1, irXg_2_1, irXg_3_1). The first of these interactions is for African Women (race==1 and gender==1), the second is for Coloured Women (race==2 and gender==1), and the final one is for Indian Women (race==3 and gender==1), since Whites (race==4) are the reference category, we do not get an interaction effect for White Women.
When interpreting interaction effects, it is important to keep in mind that the main effect for the variables that were interacted are no longer "available" for interpretation. That is, interaction effects supercede the original main effects and thus render them obsolete, however, we still use them to calculate any estimated yhat value. For example, if we were interested in calculating the income for an African female aged 35 with a 12 year level of education, we compute the following:
predicted income = 2690.411 + 12(167.6156) + 35(21.38838) - 1(3341.982) -
1(2055.847) + 1(1679.39)
predicted income = 1731.95
- Run a regression predicting the resale value of a house using number of rooms in the house, and whether the house is located in a rural area or not. Is there an interaction effect between number of rooms and whether the house is in a rural area or not? How can you tell? How would you interpret the interaction effect?
- Question 5 Answer
- In figuring out what predicts someone's net pay, is there an interaction effect between education and gender? Compute a regression where education and gender are the independent variables explaining someone's net pay.
- Question 6 Answer
LINEAR TRANSFORMATIONS OF NON-LINEAR RELATIONSHIPS
Thus far, we have assumed linear relationships for all of our regression models. In fact, a linear relationship is a basic requirement for regression analysis. Empirically, however, variables are often not associated in a linear fashion. Yet this reality hardly precludes regression analyses from accurately predicting and describing real world phenomenon. In this section we will show you two basic approaches to achieving that by using a quadratic term or by taking the natural logarithm of a term we can transform non-linear relationships into approximately linear and vastly improve the fit of a regression line.
Note: Logarithmic and Quadratic transformations are not restricted to multiple regression, however, we have placed them in the multiple regression module because they are rather advanced topics and should only be addressed after one has a clear understanding of all of the material in modules 6 and 7 prior to this section.
Transformations using Squared Terms
An often used squared transformation is age2. Researchers often include both age and age2 in regression models because it allows the effect of one-year increase in age to change as a person gets older. In other words, by including age squared, we are controlling for the That is, the effect of age is not likely to remain the same as we get older; after all, a 90 year old employee is not likely to make significantly more than say a 60 year old because at a certain point the effect of age evens out or declines. Therefore, by including age2, the effect of age is allowed to vary across years of age.
gen age2=age*age
regress incmon age
predict yhat1, xb
graph incmon yhat1 age
graph incmon yhat1 age, c(.l) s(Oi) sort
regress incmon age age2
predict yhat2, xb
graph incmon yhat2 age
graph incmon yhat2 age, c(.l) s(Oi) sort
graph incmon age2 age
This graph allows us to see the effect of the squared term - age2.
char race[omit] 4
xi:reg incmon educ_new age age2 i.race*i.gender
i.race Irace_1-4 (naturally coded; Irace_4 omitted)
i.gender Igende_0-1 (naturally coded; Igende_0 omitted)
i.race*i.gender IrXg_#-# (coded as above)
Source | SS df MS Number of obs = 507
---------+------------------------------ F( 10, 496) = 68.97
Model | 1.1547e+09 10 115468006 Prob > F = 0.0000
Residual | 830417593 496 1674229.02 R-squared = 0.5817
---------+------------------------------ Adj R-squared = 0.5732
Total | 1.9851e+09 506 3923117.90 Root MSE = 1293.9
------------------------------------------------------------------------------
incmon | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
educ_new | 164.7202 16.38277 10.054 0.000 132.532 196.9083
age | 138.0641 33.88001 4.075 0.000 71.4981 204.6302
age2 | -1.519243 .435349 -3.490 0.001 -2.374598 -.6638872
Irace_1 | -3333.191 222.6827 -14.968 0.000 -3770.709 -2895.673
Irace_2 | -3126.918 287.8819 -10.862 0.000 -3692.536 -2561.3
Irace_3 | -2109.402 403.2001 -5.232 0.000 -2901.593 -1317.211
Igende_1 | -1980.257 285.4959 -6.936 0.000 -2541.187 -1419.326
IrXg_1_1 | 1616.876 320.0354 5.052 0.000 988.0841 2245.669
IrXg_2_1 | 1734.739 413.7246 4.193 0.000 921.8703 2547.608
IrXg_3_1 | 1374.314 670.6213 2.049 0.041 56.7049 2691.922
_cons | 643.0632 677.4539 0.949 0.343 -687.9699 1974.096
------------------------------------------------------------------------------
In terms of our coefficients, we find that each year of education increases income by 164.72 Rand; that age increases wages up to the age 45 and thereafter decreases them (because quadratic ax2 + bx + c turns over at x = -b/2a, which for our age and age2 coefficients is 138.0641/(2 x 1.519243) = 45.438).
Transformations Using the Natural Logarithm
Often it is desirable to run a regression using the natural logarithm (to the base e) of a variable instead of the variable itself. For instance, if the graph of the dependent variable on the independent variable shows that the relationship is not linear, making one or both of the variables logarithmic can sometimes produce a linear relationship. Therefore, though a linear relationship might not exist between between two variables, a linear relationship might exist between the natural logarithms of the two variables. Logarithmic transformation also lessens the influence of outliers (which can sometimes drastically affect the slope of the regression line) because the natural logarithm of a variable is much less sensitive to extreme observations than is the variable itself.
Income is a variable that is often transformed using its natural log. Doing so makes it so that the impact of each additional dollar decreases as income increases. That is, after a certain point more money does not make that much more of difference. For example, earning 2 billion rand a year versus earning 3 billion rand will probably not have as much of an effect on how many "Appletisers" we drink, but earning only 100 rand per year versus 1000 rand is likely make a huge difference.
EXERCISES
- How much does a family's total school expenditure change with an increase in total monthly income and household food subsidy? Do these variables significantly explain/predict changes in total school expenditure? Why or why not?
- Exercise 1 Answer
- What effect does being Indian have on household size in relation to being Coloured?
- Exercise 2 Answer
- Regress insurance expenditure on health care expenditure and total monthly income. What exactly are the F and t statistics testing in this regression?
- Exercise 3 Answer
- Households speaking which language have the highest total monthly expenditure (controlling for total monthly income)? The lowest total monthly expenditure? By what amount do households from these two language groups differ on average in their expenditure at all levels of income?
- Exercise 4 Answer
- Graph total monthly food value on total monthly income. Now logarithmically transform both total monthly food value and total monthly income and graph them. Which do you think will produce a stronger regression model? Regress total monthly food value on total monthly income and the logarithmic transformation of total monthly food value on the logarithmic transformation of total monthly income. Were you right? Why or why not?
- Exercise 5 Answer
| BACK TO TOP |