Module 1: Introduction to Surveys
Module 2: Getting Started with STATA
Module 3: Understanding Distributions
Module 4: Measures of Central Tendency
Module 5: Bivariate Analysis
Module 6: Simple Regression Analysis
Module 7: Multiple Regression Analysis
Module 8: Discrete Outcome Analysis
Graphing with STATA 8

MULTIPLE REGRESSION ANALYSIS

 

TABLE OF CONTENTS

Introduction
Dummy Variables
Interactions with Dummy Variables
Linear Transformations of non-Linear relationships
Transformations using Squared Terms
Transformations using the natural Logarithm
Exercises

 

 

 

 

 

 

 

INTRODUCTION

In Module 6, we learned about simple bivariate regression. Now, it is time to move on to the more complex, but exciting multiple regression.

Lets quickly review what we know about simple regression analysis. In general form, the simple linear regression model has one independent variable (X) and one dependent variable (Y). In multiple regression, the dependent variable Y is assumed to be a function of a set of K independent variables - X1, X2, X3,....Xk. This yields a new regression equation - an extension of the one in Module 6:

Y = a + b1X1 + b2X2 + ... + bkXk

As with the simple regression equation, the interpretation of each of these coefficients is straight forward. Each "b" is a partial slope coefficient. Put differently, each "b" coefficient is the slope of the relationship between that particular independent variable X and the dependent variable Y when all other independent variables in the model are equal to zero, or "held constant." For example, the b1 coefficient refers to the slope between X1 and the dependent variable Y when all other variables in the equation, X2, X3, etc., equal zero. Similarly, the value for b2 is the slope for the relationship between X2 and the dependent variable Y, when all other variables, X1, X3, etc., are equal to zero. As in Module 6, the "a" refers to the intercept, also known as the constant. This value, is the value of predicted Y (yhat) when all of the independent variables, X1,X2, X3, etc., are equal to zero. Thus, multiple-regression allows us to state relationships between two main variables while controlling for other factors - also known as partial effects.

It should be obvious how useful this approach can be for quantitative social researchers, since we are often interested in social phenomenon that go beyond a basic bivariate relationship. As mentioned in the previous module, we might be interested in whether the relationship between total monthly household income and total monthly household expenditures vary by rural setting. Or whether that relationship is not a matter of household income, but rather of how many household members are present in the home. All of these types of interests require multiple regression. This new approach will allow us to investigate the initial relationship while controlling for a 3rd, a 4th, and an x-number of factors.

Therefore, for this module we will investigate in depth, the relationship between income (incmon) and education (educ_c). In particular, we are hypothsizing that the amount of income earned by any individual is dependent upon their individual level of education. First, we will need to recode the education variable into linear form. In Module 3 we had to recode it to make sure that value 18 (preschool) was less than value 16 (college degree). If not click here: to recode the original educ_c variable into educ_new we must do the following:

#delimit ;
generate educ_new = educ_c;
label var educ_new "Recoded Education";
replace educ_new = 0 if educ_c == 17;
replace educ_new = 0 if educ_c == 18;
replace educ_new = . if educ_c == -3;
replace educ_new = . if educ_c == -4;
replace educ_new = 9 if educ_c == 11;
replace educ_new = 12 if educ_c == 12;
replace educ_new = 12 if educ_c == 13;
replace educ_new = 12 if educ_c == 14;
replace educ_new = 12 if educ_c == 15;
replace educ_new = . if educ_c == 19;

We should always check that our recoding worked. So lets type:

tab educ_new

We should see the following:

-> tabulation of educ_new

    Recoded |
  Education |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |       1403       27.44       27.44
          1 |        663       12.97       40.41
          2 |        293        5.73       46.14
          3 |        315        6.16       52.30
          4 |        331        6.47       58.77
          5 |        374        7.31       66.09
          6 |        435        8.51       74.59
          7 |        252        4.93       79.52
          8 |        326        6.38       85.90
          9 |        234        4.58       90.48
         10 |        344        6.73       97.20
         12 |        108        2.11       99.32
         16 |         35        0.68      100.00
------------+-----------------------------------
      Total |       5113      100.00

Everything looks good so far. Now, as we have done in the past, we want to quickly check our variables with the correlate command to make sure our relationship (coding wise) is working correctly.

NOTE: the age qualifier that we need to use for our analysis. When using a wage or income variable, it is important to keep in mind that not everyone in the population is of working age, thus it often applies only to a certain group in the population. In the United States context, for example, working age is normally between ages of 25 to age 65. For our example here, we will use this same age range. Since we do not want to continuously have to include an age qualifier for every STATA command, we'll go ahead and keep only those between the ages of 25 to 65. To keep our sample consistent throughout this module, we will also keep only those cases that reported an income and those with a known education level:

[NOTE: Remember NOT to save over your original data set after you're done with this module]
keep if age>=25 & age<=65
keep if (incmon>=0 & incmon~=.)
keep if educ_new~=.

Lets type:

corr incmon educ_new

             |   incmon educ_new
-------------+------------------
      incmon |   1.0000
    educ_new |   0.5984   1.0000

The correlation between educ_new and incmon is 0.5984, which is definitely in line with our hypothesis - education level is associated with income earned. With a correlation, however, we do not know to what extent education makes a difference, we just know that it is positively associated with income. For further understand this relationship, we need to estimate the regression of income on education.

We accomplish this by typing:

reg incmon educ_new

      Source |       SS       df       MS              Number of obs =     421
-------------+------------------------------           F(  1,   419) =  233.72
       Model |   664868285     1   664868285           Prob > F      =  0.0000
    Residual |  1.1920e+09   419   2844755.4           R-squared     =  0.3581
-------------+------------------------------           Adj R-squared =  0.3565
       Total |  1.8568e+09   420   4421001.9           Root MSE      =  1686.6

------------------------------------------------------------------------------
      incmon |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    educ_new |   301.3989   19.71498    15.29   0.000     262.6464    340.1515
       _cons |  -206.0145   145.6281    -1.41   0.158    -492.2672    80.23829
------------------------------------------------------------------------------

Do you remember how to interpret these results? Lets review the basic regression equation:

Y = a + bX

In our case, this equation becomes: (predicted incmon) = -206.0145 + 301.3989(educ_new)

We can immediately interpret the slope coefficient for educ_new as the number of Rand that incmon would increase by for every additional year of education (educ_new). Judging from the probability of the t-value (15.29), we can tell that the coefficient is significant. The constant, as discussed before, reflects the value of the dependent variable Y when the independent variables are equal to zero. While this property is technically useful in the calculation of the regression coefficients and calculation of predicted Y values, its actual value is not always of use. Obviously we do not want to ignore it, but we also do not need to dwell on it since it is often not very interpretable. In our current case, it literally says that when education level is zero, predicted income is -136.7855. Note, however, that the constant is not statistically significant, which means that the estimated value is not significantly different from zero. If, however, we had centered our education variable around the sample's education mean, then the "zero" value would actually be the average level of eduction. Interpreting the constant in that case would be more useful. Moving along, the R-squared for this regression tells us that education accounts for almost 36% of the variation around the mean of income. Another way to think about it is, if we were asked to guess at random the income for an individual in a population sample, our guess would improve by approximately 36% if we new the education level of the individual instead of just knowing and guessing the mean income of the sample.

Lets now try graphing the regression equation:

#delimit ;

predict inchat;

graph twoway scatter incmon educ_new || line inchat1 educ_new,
ylabel(0(5000)15000) ytick(0(2500)15000)
xlabel(0(4)16) xtick(0(2)16);

Issues of Parsimony vs. Saturation

When thinking about introducing variables into a model, it is important to keep the notions of parsimony and saturation in mind. That is, we should always strive to include only the variables that make sense and that are efficient at capturing the desired social phenomenon. Model building is often a balancing act between parsimony and saturation. When we say that a model is "saturated," we mean that the model has too many variables - it is overspecified. A model that is overspecified or saturated can often predict each case in the sample perfectly because the model is using up all the degrees of freedom. Therefore, when selecting variables for a model, it is prudent to only include the most necessary variables or risk overspecifing the model. Witht that in mind, lets proceed.

Introducing a Third Variable

At this point, we can consider including our first control variable. It is likely that the amount of income earned by any one person, is not only dependent on their years of education, but also on their age. By including age in our model, we acknowledge that income is also a function of age. It is important to include this factor because most people accumulate not only life experience as they age, but also work experience and skills, thus making them more likely to earn a higher wage. If you remember our earlier discussion on how to interpret coefficients, each coefficient in a regression model is a partial effect, meaning that the coefficient reflects the effect of variable while controlling for the others at 0. In this case it means that when we include age, our coefficient for educ_new will be the effect of education while controlling for age at 0. Do not think of zero in literal terms, we are not saying that the coefficient of educ_new is the value for a newborn (age 0), but rather think of this "controlling" as the process by which we standardize the effect accross all observations. Enough theory, let's try running the multiple regression model now:

reg incmon educ_new age

      Source |       SS       df       MS              Number of obs =     421
-------------+------------------------------           F(  2,   418) =  119.87
       Model |   676781025     2   338390513           Prob > F      =  0.0000
    Residual |  1.1800e+09   418  2823061.65           R-squared     =  0.3645
-------------+------------------------------           Adj R-squared =  0.3614
       Total |  1.8568e+09   420   4421001.9           Root MSE      =  1680.2

------------------------------------------------------------------------------
      incmon |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    educ_new |   312.9772   20.43244    15.32   0.000      272.814    353.1403
         age |   17.94223   8.734351     2.05   0.041     .7735008    35.11095
       _cons |  -963.2319   396.1364    -2.43   0.015      -1741.9   -184.5642
------------------------------------------------------------------------------

Compare our old equation (from above):

(predicted incmon) = -206.0145 + 301.3989(educ_new)
--> {R-squared = 0.3581}

To our new multiple regression equation:

(predicted incmon) = -963.2319 + 312.9772(educ_new) + 17.94223(age)
--> {R-squared = 0.3645}

Right away we should notice the effect that age has on our model. Notice the increase in the value of education (312.98), up from 301.40. Therefore, after controlling for age, which itself has the effect of an additional 18 Rand a month, every additional year of education on average produces an additonal 312.98 Rand per month in earned income. Another way of thinking about these new results is that in the initial model, the "true" effect of educ_new was being masked by the effect of age.

The addition of a single regressor to the bivariate model probably does not seem that difficult, but as we progress in this module, you will realize that this is merely the tip of the iceberg.

Now that you have been introduced to multiple regression, try the following two exercises:

  1. It is possible that income and the number of household members predict the amount of money a household spends on clothing. Run a regression to see if this hypothesis finds support in our data.
  2. Question 1 Answer
  3. Among people who reported working, to what extent does the number of hours worked in a week predict a person's monthly income? Without running any further regressions, what other variables might also help predict a person's monthly income?
  4. Question 2 Answer

 

DUMMY VARIABLES

Thus far we have focused on using continuous variables in our regressions. We can extend regression analysis to include categorical variables such as gender, general satisfaction, race, metro, rural, etc. But how do you include variables whose values are arbitrary? Can we calculate the average race of a country? How about the average metro setting? The answer is no, but lets find out how these types of variables are useful in regression analysis.

What Makes a Dummy Variable a "Dummy" variable?

No, "dummy" variables are not "stupid" variables, in fact they are quite smart and useful! A dummy variable has two properties that make it a "dummy variable." First, it is categorical and non-ordinal (i.e., categories have no rank order). Thus, the number values associated with each category serve only to identify the various groups/categories it represents, but not to assign value or order to any one category. The second, and this is what makes a dummy variable and "dummy variable," is that it is binary in the sense that it has only two values - 0 and 1. Technically, a variable like race or metro, have more than 0 and 1 values, but when this type of dummy variable is used in a regression, coefficients are calculated for each category while all the other categories are equal to zero. Thus, if done correctly, even a multi-category variable variable can be used as a dummy variable because in the end, it is broken up into 0s and 1s.

Dummy variables are useful because they allow us to control for membership within a particular category or group. If we neglected to split a categorical variable into several dummy variables when using it in a regression, we would get invalid results because regression analysis assumes variables to be continuous unless told otherwise. Therefore, if you include a categorical variable like race into a regression, STATA (or any other statistical program) would recognize it as simply another variable and would not realize that those numbers have no mathematical meaning - STATA does not know if the values in a variable are arbitrary or not. Regression analysis revolves around the use of means and standard deviations, but with categorical variables, means and standard deviations have no meaning.

How NOT to use Categorical variables

Lets try the following example of what NOT to do. Let's continue with our previous example of income returns to education. This time lets include race in the regression model without considering the fact that it is a categorical variable. First, lets tabulate race to see its categories, but remember that because we are interested in incmon we are interested in the working age population:

tabulation of race

         19 |
:population |
      group |      Freq.     Percent        Cum.
------------+-----------------------------------
   01-afric |        283       67.22       67.22
   02-colou |         54       12.83       80.05
   03-india |         14        3.33       83.37
   04-white |         70       16.63      100.00
------------+-----------------------------------
      Total |        421      100.00

Lets regress it now:

regress incmon educ_new race

      Source |       SS       df       MS              Number of obs =     421
-------------+------------------------------           F(  2,   418) =  255.77
       Model |  1.0218e+09     2   510920800           Prob > F      =  0.0000
    Residual |   834979197   418  1997557.89           R-squared     =  0.5503
-------------+------------------------------           Adj R-squared =  0.5482
       Total |  1.8568e+09   420   4421001.9           Root MSE      =  1413.3

------------------------------------------------------------------------------
      incmon |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    educ_new |   185.0327   18.67354     9.91   0.000      148.327    221.7384
        race |   921.1761   68.90877    13.37   0.000     785.7252    1056.627
       _cons |  -1056.576   137.6228    -7.68   0.000    -1327.095    -786.057
------------------------------------------------------------------------------

After reviewing these results, how would you interpret the race coefficient? Would it make sense to say that for every unit increase in race, while controlling for age and education (educ_new), there is an 921.18 Rand increase in income? The answer is NO. This is similar to saying that the average race in South Africa is 2.3. What would 2.3 mean? Your guess is as good as mine.

The Correct Way

Lets try this same example, except this time will do it correctly. To do this we need to call upon a few of our newly found skills. First, we need to split race into multiple dummy variables. There are two main ways to accomplish this task. Here we will cover the more familiar way (tab varname, gen(varname)) and then below you will be introduced to a new command that will make it easier - the xi command.

We covered this first command in Module 3:

tab race, gen(raceid) [Note: raceid will be automatically numbered with sequential numbers]

         19 |
:population |
      group |      Freq.     Percent        Cum.
------------+-----------------------------------
   01-afric |        283       67.22       67.22
   02-colou |         54       12.83       80.05
   03-india |         14        3.33       83.37
   04-white |         70       16.63      100.00
------------+-----------------------------------
      Total |        421      100.00

Then we tabulate our new raceid variables to make sure the command worked by typing:

tab1 raceid1 raceid2 raceid3 raceid4 [Note: tab1 tells STATA to tabulate each variable seperately instead of crosstabulating all of them together in one big matrix]

-> tabulation of raceid1  

race==01-af |
        ric |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        138       32.78       32.78
          1 |        283       67.22      100.00
------------+-----------------------------------
      Total |        421      100.00

-> tabulation of raceid2  

race==02-co |
        lou |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        367       87.17       87.17
          1 |         54       12.83      100.00
------------+-----------------------------------
      Total |        421      100.00

-> tabulation of raceid3  

race==03-in |
        dia |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        407       96.67       96.67
          1 |         14        3.33      100.00
------------+-----------------------------------
      Total |        421      100.00

-> tabulation of raceid4  

race==04-wh |
        ite |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        351       83.37       83.37
          1 |         70       16.63      100.00
------------+-----------------------------------
      Total |        421      100.00

Great, our command worked as it should. Each new raceid variable is coded as 1 for all people who are of that particular race and 0 for everyone else. For example, Indians (raceid3) equals 1 for a total of 14 cases and it equals 0 for everyone else, a total of 407 cases.

Now its time to run the regression with our newly created dummy variables. We do this by typing:

reg incmon educ_new age raceid2 raceid3 raceid4

      Source |       SS       df       MS              Number of obs =     421
-------------+------------------------------           F(  5,   415) =  106.48
       Model |  1.0434e+09     5   208687855           Prob > F      =  0.0000
    Residual |   813381524   415  1959955.48           R-squared     =  0.5619
-------------+------------------------------           Adj R-squared =  0.5567
       Total |  1.8568e+09   420   4421001.9           Root MSE      =    1400

------------------------------------------------------------------------------
      incmon |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    educ_new |   185.2631   19.42997     9.53   0.000     147.0697    223.4565
         age |   6.671661   7.334563     0.91   0.364    -7.745864    21.08919
     raceid2 |   287.8807   208.6658     1.38   0.168    -122.2929    698.0543
     raceid3 |   1722.132   389.4502     4.42   0.000     956.5913    2487.673
     raceid4 |   2846.285   212.1007    13.42   0.000      2429.36    3263.211
       _cons |  -320.6505   334.1475    -0.96   0.338    -977.4831     336.182
------------------------------------------------------------------------------

Our new regression line can be stated as:

(predicted incmon) = -320.6505 + 185.2631(educ_new) + 6.671661(age) + 287.8807(raceid2) + 1722.132(raceid3) + 2846.285(raceid4)

By now, you should be able to interpret the basic regression equation. This new equation is simply an extension of the first regression equation discussed earlier. Lets quickly review it. This equation tells us that for every additional year of education, income increases by about 185.26 Rand while controlling for age and race. It also tells us that for every additional year of age, income increases by about 6.67 Rand while controlling for education and race. Now, the race coefficients tell us that for raceid2 (Coloured) there is an added effect of 287.88 Rand over the ommited category (raceid1 - Black Africans) while controlling for education and age. Similarly, for raceid3 (Indian) there is an added effect of 1722.132 Rand over Africans while controlling for education and age. For Whites (raceid4), there is an added effect of 2846.28 rand while controlling for education and age over the reference category - Black Africans. In general, the raceid coefficients show us the effect that race has on the amount of total monthly income after controlling for education and age. As you can tell, the effects are staggering! To be precise, the race effect for Coloureds is not statistically significant (t=1.38), however, the other raceid coefficients large and statistically significant.

NOTE on Omitted/Reference Categories

There is one important point to keep in mind when interpreting a multiple regression that uses dummy variables. Notice that only 3 raceid dummy variables were included in the equation. Why would this be necessary? It is necessary because if we were to include all four dummy variables, we would essentially overspecify the model, which we do not want to do. Whenever we use dummy variables, there should always be an omitted category (also known as the reference category), in this case the omitted category is Black African (raceid1).

Being "omitted" does not mean that the equation is ignoring that group of people, rather we are telling STATA to only explicitly show us the the coefficients for raceid2, raceid3, and raceid4, but in fact the coefficient for the omitted category (raceid1) can be known from the results above. If you remember our description of what the constant is, you will realize that raceid1 can be derived from it. The constant in this case is analogous to a "resevoir" of values, in which all omitted categories get lumped into. Therefore, if the constant represents the value of our dependent variable Y when all other regressors are equal to zero, that means that the "left over" values are used to calculate the constant (in this case those values are those not in the category raceid2, raceid3, or raceid4). And who is not in the raceid2, raceid3, or raceid4 categories? Correct, raceid1 (Black Africans).

It is important to realize that we did not drop any cases by omitting the raceid1 category, we simply "shifted" them into the constant and use them as comparison group. If we were using another set of dummy variables, gender for example, we would have to choose the reference category for that variable as well. If we chose men as our reference category, we would get a coefficient for women, but not for men. The coefficient for men would be found in the constant. If both gender and race were included in a regression model as dummy variables, two omitted categories would be captured and represented by the constant - in our case it would have been African Males (raceid1 + men).

A short cut: the "xi" option

Although the tab varname, gen(varname) command is useful in creating dummy variables, it is unnecessary. STATA provides us with an easier and more convinient short-cut to specify a categorical variable in a regression equation. The xi command tells STATA to treat the specified variable(s) as categorical - as if they were dummy variables. This command can be used with any STATA command like regress, logistic, probit, etc. Lets try it.

First, we will create and label a new gender variable that is consistant with dummy variable coding - 0s and 1s. Note however, that could also use the xi command for gender_n, but we choose not to.

gen gender=gender_n
label var gender "Gender of Respondent"
recode gender 3=0 2=1
label def gender 1 "Female" 0 "Male"
label val gender gender

Then we tabulate the new variable to make sure it works fine - and of course, it does.

tab gender

  Gender of |
 Respondent |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |        242       57.48       57.48
     Female |        179       42.52      100.00
------------+-----------------------------------
      Total |        421      100.00

Now we move on to using the xi command. We continue with our income returns to education example, but now we will be controlling for age, race, and gender. By doing so, we are stating not only that income depends on education, but also on age, race, and gender. This time, however, we will be declaring the White racial group (race = 4) as the reference category. We do this by prefacing the regress command with the char varname[omit] statement. This command is useful when using xi because STATA, by default, selects the first category in the specified variable as the reference category. In our model, the xi: command works by placing it at the beginning of the regression equation and then specifying the variables you want STATA to expand into its constituent categories by "tagging" them with an "i." in front of each target variable. See below:

char race[omit] 4 /* Makes category 4 of the race variable the reference category */
xi:reg incmon educ_new age i.race i.gender

Notice that an "i." is included for the variables race and gender. Also remember that we have told STATA to treat category 4 of the race variable as the reference category and since we have not specified a specific reference category for gender, STATA will omit its first category - 0, men.

. char race[omit] 4;

. xi:reg incmon educ_new age i.race i.gender;
i.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)
i.gender          _Igender_0-1        (naturally coded; _Igender_0 omitted)

      Source |       SS       df       MS              Number of obs =     421
-------------+------------------------------           F(  6,   414) =   96.58
       Model |  1.0831e+09     6   180509880           Prob > F      =  0.0000
    Residual |   773761517   414  1868989.17           R-squared     =  0.5833
-------------+------------------------------           Adj R-squared =  0.5772
       Total |  1.8568e+09   420   4421001.9           Root MSE      =  1367.1

------------------------------------------------------------------------------
      incmon |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    educ_new |   183.1393   18.97933     9.65   0.000     145.8314    220.4471
         age |    8.55595   7.174016     1.19   0.234    -5.546089    22.65799
    _Irace_1 |   -2852.43   207.1245   -13.77   0.000    -3259.577   -2445.283
    _Irace_2 |  -2516.836   260.9101    -9.65   0.000     -3029.71   -2003.962
    _Irace_3 |  -1248.377    402.788    -3.10   0.002    -2040.142   -456.6124
  _Igender_1 |  -624.9599    135.737    -4.60   0.000    -891.7796   -358.1402
       _cons |   2735.125   412.9026     6.62   0.000     1923.478    3546.772
------------------------------------------------------------------------------

What do the results tell us? Right away we should be able to tell that our model explains almost 58% of the variation around our independent variable, which is great. Next, we should notice that all the dummy variable coefficients (for race and gender) are negative. These negative values tell us that in relation to the omitted categories (race=4 and gender=0 -- white men) everyone within the reported categories (Africans, Coloureds, Indians, and Women) earn significantly less after controlling for the level of education and age! In other words, this model tells us that above and beyond levels of education and age, a person that reports being Black African, Coloured, Indian, or a Woman is likely to earn significantly less than White Males. Overall, the model tells us that if we know a person's level of education, their age, their race, and gender, we are likely to guess their incomes 58% better than simply guessing the mean income in the sample.

Lets consider what our new equation looks like:

(predicted income) = 2735.125 + 183.1393(years of education) + 8.55595(years of age) - 2852.43(African=1, else=0) - 2516.836(Coloured=1, else=0) - 1248.377(Indian=1, else=0) - 624.9599(Woman=1, else=0)

The new equation allows us to calculate, for example, the predicted income for a 50 year old Indian man with 10 years of education or the income for a 25 year old Coloured woman with 16 years of education. All we need to do is plug in the number of years of education, the years of age, and either a 1 or a 0 for whether the person falls within the particular category or not. Lets try it.

(Indian man predicted income) = 2735.125 + 183.1393(10) + 8.55595(50) - 2852.43(0) - 2516.836(0) - 1248.377(1) - 624.9599(0)

= 2735.125 + 1831.393 + 427.7975 - (0) - (0) - 1248.377 - (0)

-> (predicted income) = 3745.9385 rand for a 50 year old Indian man with 10 years of education. For a 25 year old Coloured woman with 16 years of education, the predicted income equation is the following:

(Coloured woman predicted income) = 2735.125 + 183.1393(16) + 8.55595(25) - 2852.43(0) - 2516.836(1) - 1248.377(0) - 624.9599(1)

= 2735.125 + 2930.2288 + 213.89875 - (0) - 2516.836(1) - (0) - 624.9599

-> (predicted income) = 2737.45665 rand for a 25 year old Coloured woman with 16 years of education.

Does this sound right to you? Does it make sense that on average an older, less educated, Indian man earns 1000 rand more than a younger more educated Coloured woman? What, if anything, do these predicted values assume? Any ideas? How about assuming that each of the non-categorical variables in our equation have a linear relationship with the dependent variable? Does it make sense that as we get older we continue to earn more money? How about with education?

In general, the relationship between education and age is not linear, but curve-linear. That is, there comes point where age no longer provides an advantage in the workforce, but it is instead a detriment. It goes from having a positive effect to having a negative effect on earnings as people get really old. We will learn how to control for this curve-linear effect later in this module.

Note on Extrapolating Beyond the Data

Lets try calculating the following predicted income:

What is the predicted income for 90 year old White male with 20 years of education? We can easily carry out the calculations for this question:

(predicted income) = 2735.125 + 183.1393(20) + 8.55595(90) - 2852.43(0) - 2516.836(0) - 1248.377(0) - 624.9599(0)

= 2735.125 + 3662.786 + 770.0355 - (0) - (0) - (0) - (0)

-> (predicted income) = 7167.9465 rand for a 90 year old White male with 20 years of education.

Do you see any problems with this example? Does our education variable include people with 20 years of education? How about our income variable, does it include people over the age 65? The answer to these questions is NO. Extrapolating beyond the available data points is never a good idea because our results apply only to the specific cases used to calculate the model. It is possible that our observed relationship holds for 90 year olds with 20 years of education, but it is also possible that it does not. The point is that without those actual cases in the calculation of the model it is impossible to know. Therefore, we suggest that you never try to extrapolate, predict values, beyond the data points used in the model.

Try these exercises to make sure you understand the basics of interpreting dummy variables in multiple regression analysis.

  1. What is the predicted amount of income earned for an African woman (raceid1 & gender1) of age 30 with a 12 year education?
  2. Question 3 Answer
  3. What is the predicted amount of income earned for a White woman (raceid4 & gender1), also of age 30 with a 12 year education?
  4. Question 4 Answer

 

INTERACTIONS WITH DUMMY VARIABLES

Thus far, we have only dealt with the additive effects of dummy variables. That is, the assumption has been that for each independent variable Xi, the amount of change in our dependent variable Y is the same regardless of the values of the other independent variables in the equation. This assumption allows us to interpret the partial coefficients as the effect of a variable while controlling for the other independent variables in the model.

The additive assumption, however, does not always hold. In such cases, the partial effect of a given independent variable cannot be interpreted as the effect of the variable while all others are being held constant, instead these peculiar relationships depend on the specific values of ther other independent variables in the model. That is, in these cases it is hypothesized that the independent variable Xi is linearily related to the dependent variable Y, however, that linear relationship depends on a different independent varible in the model. Interactions are perhaps best visualized and understood in the case of dummy variables.

For instance, in our example below, we interact the categories of race and gender. In effect, what we are testing with an interactive model is whether or not the linear relationship between an independent variable Xi and the dependent variable Y is dependent on the values of a different independent variable in the model. More intuitively, by interacting race and gender, we are testing whether being an African Female, for example, is significantly different from being simply African or Female independantly. In other words, when results indicate a statistically signficant interaction effect, the data suggests that being an African Female or a White Male, or any other combination of race and gender, is qualitatively different from being either category independently.

In general, we can illustrate what we mean by the additive effect of dummy variables in regression with the graph below. Each category of an independent dummy variable has a slope as depicted by the lines in the graph. For instance, we can imagine the predicted effect of gender on income looking like the lines below. That is, because men on average earn more than women we would expect to find predicted regression lines like the ones below. As it stands, this first graph suggests that the effect of gender is similar across all racial groups, the only apparent difference is in magnitude between males and females -- both slopes are identical for each unit change in Xi.

In this second graph, we find a hypothetical interaction effect. We can imagine this effect to be similar in form to that of the interaction between race and gender. That is, the effect of gender (slope of the line) depends on the particular race the individual. In this case, we find that the upper-most line on the graph has a steeper slope than the line below it, thus the effect of gender depends on the value of Xi -- in this case, the race of the individual.

Lets now investigate how the theory measures up to empirical findings. Creating an interaction term with STATA is as easy as inserting an asterick "*" between the two variables you wish to interact. In essense, this tells STATA to multiply these two variables together. Or, you can also generate each interaction term independently by generating a variable that multiplies the two desired variables together. In the immediate example below, we use the easiest of these two approaches, but to see the second approach click here.

First, we choose to use Whites as the reference category (race==4). Then, to interact race and gender, we simply include an asterick between i.race & i.gender. Remember that it is important to restrict our model to the working age segment of the populations. Lets try it.

char race[omit] 4
xi:reg incmon educ_new age i.race*i.gender

. xi:reg incmon educ_new age i.race*i.gender;
i.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)
i.gender          _Igender_0-1        (naturally coded; _Igender_0 omitted)
i.race*i.gender   _IracXgen_#_#       (coded as above)

      Source |       SS       df       MS              Number of obs =     421
-------------+------------------------------           F(  9,   411) =   72.53
       Model |  1.1394e+09     9   126601385           Prob > F      =  0.0000
    Residual |   717408332   411  1745519.06           R-squared     =  0.6136
-------------+------------------------------           Adj R-squared =  0.6052
       Total |  1.8568e+09   420   4421001.9           Root MSE      =  1321.2

------------------------------------------------------------------------------
      incmon |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    educ_new |   172.8666   18.43997     9.37   0.000     136.6182     209.115
         age |   6.018636   6.952892     0.87   0.387    -7.649029     19.6863
    _Irace_1 |  -3729.071   255.3957   -14.60   0.000    -4231.116   -3227.026
    _Irace_2 |  -3361.513   341.8539    -9.83   0.000    -4033.513   -2689.513
    _Irace_3 |  -2166.792   451.2157    -4.80   0.000     -3053.77   -1279.814
  _Igender_1 |  -2279.992   322.2442    -7.08   0.000    -2913.444   -1646.539
_IracXge~1_1 |   1982.464   359.7817     5.51   0.000     1275.222    2689.706
_IracXge~2_1 |   1888.976   483.6555     3.91   0.000     938.2293    2839.723
_IracXge~3_1 |   2643.612   918.8979     2.88   0.004     837.2855    4449.938
       _cons |   3621.657   429.1715     8.44   0.000     2778.012    4465.302
------------------------------------------------------------------------------

As with the previous regression results, we find coefficients for the main effects of educ_new, age, Irace_1 thru Irace_3, and Igende_1, but now we find the interaction effects of race and gender (irXg_1_1, irXg_2_1, irXg_3_1). The first of these interactions is for African Women (race==1 and gender==1), the second is for Coloured Women (race==2 and gender==1), and the final one is for Indian Women (race==3 and gender==1), since Whites (race==4) are the reference category, we do not get an interaction effect for White Women.

When interpreting interaction effects, it is important to keep in mind that the main effect for the variables that were interacted are no longer "available" for interpretation. That is, interaction effects supercede the original main effects and thus render them obsolete, however, we still use them to calculate any estimated yhat value. For example, if we were interested in calculating the income for an African female aged 35 with a 12 year level of education, we compute the following:

predicted income = 2690.411 + 12(167.6156) + 35(21.38838) - 1(3341.982) - 1(2055.847) + 1(1679.39)
predicted income = 1731.95

  1. Run a regression predicting the resale value of a house using number of rooms in the house, and whether the house is located in a rural area or not. Is there an interaction effect between number of rooms and whether the house is in a rural area or not? How can you tell? How would you interpret the interaction effect?
  2. Question 5 Answer
  3. In figuring out what predicts someone's net pay, is there an interaction effect between education and gender? Compute a regression where education and gender are the independent variables explaining someone's net pay.
  4. Question 6 Answer

 

LINEAR TRANSFORMATIONS OF NON-LINEAR RELATIONSHIPS

Thus far, we have assumed linear relationships for all of our regression models. In fact, a linear relationship is a basic requirement for regression analysis. Empirically, however, variables are often not associated in a linear fashion. Yet this reality hardly precludes regression analyses from accurately predicting and describing real world phenomenon. In this section we will show you two basic approaches to achieving that by using a quadratic term or by taking the natural logarithm of a term we can transform non-linear relationships into approximately linear and vastly improve the fit of a regression line.

Note: Logarithmic and Quadratic transformations are not restricted to multiple regression, however, we have placed them in the multiple regression module because they are rather advanced topics and should only be addressed after one has a clear understanding of all of the material in modules 6 and 7 prior to this section.

Transformations using Squared Terms

An often used squared transformation is age2. Researchers often include both age and age2 in regression models because it allows the effect of one-year increase in age to change as a person gets older. In other words, by including age squared, we are controlling for the That is, the effect of age is not likely to remain the same as we get older; after all, a 90 year old employee is not likely to make significantly more than say a 60 year old because at a certain point the effect of age evens out or declines. Therefore, by including age2, the effect of age is allowed to vary across years of age.

gen age2=age*age

regress incmon age
predict yhat1, xb
graph incmon yhat1 age
graph incmon yhat1 age, c(.l) s(Oi) sort

regress incmon age age2
predict yhat2, xb
graph incmon yhat2 age
graph incmon yhat2 age, c(.l) s(Oi) sort
graph incmon age2 age

This graph allows us to see the effect of the squared term - age2.

char race[omit] 4
xi:reg incmon educ_new age age2 i.race*i.gender

i.race                Irace_1-4    (naturally coded; Irace_4 omitted)
i.gender              Igende_0-1   (naturally coded; Igende_0 omitted)
i.race*i.gender       IrXg_#-#     (coded as above)

  Source |       SS       df       MS                  Number of obs =     507
---------+------------------------------               F( 10,   496) =   68.97
   Model |  1.1547e+09    10   115468006               Prob > F      =  0.0000
Residual |   830417593   496  1674229.02               R-squared     =  0.5817
---------+------------------------------               Adj R-squared =  0.5732
   Total |  1.9851e+09   506  3923117.90               Root MSE      =  1293.9

------------------------------------------------------------------------------
  incmon |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
educ_new |   164.7202   16.38277     10.054   0.000        132.532    196.9083
     age |   138.0641   33.88001      4.075   0.000        71.4981    204.6302
    age2 |  -1.519243    .435349     -3.490   0.001      -2.374598   -.6638872
 Irace_1 |  -3333.191   222.6827    -14.968   0.000      -3770.709   -2895.673
 Irace_2 |  -3126.918   287.8819    -10.862   0.000      -3692.536     -2561.3
 Irace_3 |  -2109.402   403.2001     -5.232   0.000      -2901.593   -1317.211
Igende_1 |  -1980.257   285.4959     -6.936   0.000      -2541.187   -1419.326
IrXg_1_1 |   1616.876   320.0354      5.052   0.000       988.0841    2245.669
IrXg_2_1 |   1734.739   413.7246      4.193   0.000       921.8703    2547.608
IrXg_3_1 |   1374.314   670.6213      2.049   0.041        56.7049    2691.922
   _cons |   643.0632   677.4539      0.949   0.343      -687.9699    1974.096
------------------------------------------------------------------------------
In terms of our coefficients, we find that each year of education increases income by 164.72 Rand; that age increases wages up to the age 45 and thereafter decreases them (because quadratic ax2 + bx + c turns over at x = -b/2a, which for our age and age2 coefficients is 138.0641/(2 x 1.519243) = 45.438).

Transformations Using the Natural Logarithm

Often it is desirable to run a regression using the natural logarithm (to the base e) of a variable instead of the variable itself. For instance, if the graph of the dependent variable on the independent variable shows that the relationship is not linear, making one or both of the variables logarithmic can sometimes produce a linear relationship. Therefore, though a linear relationship might not exist between between two variables, a linear relationship might exist between the natural logarithms of the two variables. Logarithmic transformation also lessens the influence of outliers (which can sometimes drastically affect the slope of the regression line) because the natural logarithm of a variable is much less sensitive to extreme observations than is the variable itself.

Income is a variable that is often transformed using its natural log. Doing so makes it so that the impact of each additional dollar decreases as income increases. That is, after a certain point more money does not make that much more of difference. For example, earning 2 billion rand a year versus earning 3 billion rand will probably not have as much of an effect on how many "Appletisers" we drink, but earning only 100 rand per year versus 1000 rand is likely make a huge difference.

EXERCISES

  1. How much does a family's total school expenditure change with an increase in total monthly income and household food subsidy? Do these variables significantly explain/predict changes in total school expenditure? Why or why not?
  2. Exercise 1 Answer
  3. What effect does being Indian have on household size in relation to being Coloured?
  4. Exercise 2 Answer
  5. Regress insurance expenditure on health care expenditure and total monthly income. What exactly are the F and t statistics testing in this regression?
  6. Exercise 3 Answer
  7. Households speaking which language have the highest total monthly expenditure (controlling for total monthly income)? The lowest total monthly expenditure? By what amount do households from these two language groups differ on average in their expenditure at all levels of income?
  8. Exercise 4 Answer
  9. Graph total monthly food value on total monthly income. Now logarithmically transform both total monthly food value and total monthly income and graph them. Which do you think will produce a stronger regression model? Regress total monthly food value on total monthly income and the logarithmic transformation of total monthly food value on the logarithmic transformation of total monthly income. Were you right? Why or why not?
  10. Exercise 5 Answer

 

BACK TO TOP