Module 1: Introduction to Surveys
Module 2: Getting Started with STATA
Module 3: Understanding Distributions
Module 4: Measures of Central Tendency
Module 5: Bivariate Analysis
Module 6: Simple Regression Analysis
Module 7: Multiple Regression Analysis
Module 8: Discrete Outcome Analysis
Graphing with STATA 8

SIMPLE REGRESSION ANALYSIS

 

TABLE OF CONTENTS

Introduction
Correlation of Variables
Outliers
Simple Regression
Understanding Regression Output Tables
Graphing the Regression Equation
Putting It All Together
Exercises

 

 

 

 

 

 

 

INTRODUCTION

In Module 5, we learned methods using STATA that allowed us to determine whether two variables were statistically related or independent of one another. While this is indeed important, it is often necessary to take our analysis a few steps further to determine the actual relationship between variables.

In this Simple Regression Module, we will cover the first two methods commonly used to determine the relationship between two variables. The first is correlation analysis, which simply measures the strength or degree of association between two continuous variables. The second is simple regression analysis, which allows us to determine how one variable changes in relation to the change in another variable.

In regression analysis, we are often interested in causal relationships. For example, we could be interested in whether political participation depends on individual income? Or we could be interested in whether product advertisement on the Internet leads to higher sales? In general, we are interested in whether variable X has an effect on variable Y. As such, it is often useful to think of variable X as the "independent" or "explanatory" variable and to think of variable Y as the "dependent" variable or as the "effect".

In this module, we will concentrate on the relationship between total monthly household expenditures (totmexp) and total monthly household income (totminc). Take a minute to think about these two variables. Which one do you think is the independent variable? How about the dependent variable? Remember, that the independent variable is the variable that is likely to "cause" or help "explain" the dependent variable. In this case, we are predicting that total monthly household expenditures (totmexp) depends on total monthly household income (totminc). It is generally clear that how much you spend as a household depends on how much you earn as a household, however, it is not as clear that how much you earn as a household depends on how much you spend as a household. Thus, totminc is our independent variable and the totmexp is our dependent variable.

For now, lets concentrate on the first method we mentioned, correlation analysis. Then we will proceed on to simple regression.

 

CORRELATION OF VARIABLES

Suppose someone made the statement, "Households that earn 6,000 Rand, spend more than households that earn 3,000." This sounds like a reasonable statement to make, however, being the researchers that we are, we want to confirm our intuition with empirical facts. Since we are dealing with two continuous variables and we presume a linear relationship, the appropriate measure of association is a Pearson correlation, which in STATA we perform with the correlate command (or corr for short).

We briefly introduced correlation analysis earlier in Module 4 and now we will review it further because it is closely related to the slope coefficient (b) in simple regression. As you should remember, the Pearson correlation measures the degree to which variables are related or in other words, the degree to which they co-vary. When using correlation in our analysis, we must make the assumption that the relationship between our two variables is linear, if we suspect otherwise, we should make the proper adjustments to the variable that does not meet the assumption (we will cover this in more detail in Module 7). Overall, the initial use of the correlate command in STATA is a good way to start investigating whether your intuition about a relationship is remotely correct.

Using STATA, the first thing that we want to do is to limit the observations to one per household since the two variables that we are interested in (totmexp and totminc) are household-level variables but are contained within individual-level cases. As we have done in the previous modules, we need to be careful to use only one observation per household, otherwise we will have the problem of multiple counting. This would lead to larger households having a greater influence on the results than smaller households. As before, we do this by first sorting the data by household id (hhid) then we go ahead and use the correlate command with the household qualifier:

sort hhid
corr totminc totmexp if hhid~=hhid[_n-1]

STATA produces the following results:

          | totminc totmexp
 ---------+------------------
  totminc | 1.0000
  totmexp | 0.5284 1.0000

You should notice that STATA uses only 1,028 cases, households in this case.

STATA should produce the above table, but what does it mean? A correlation value can range from from -1 to +1, with 0 indicating that there is no linear association and ±1 being a perfect linear association. Technically speaking, if the correlation value is low (near 0), it does not necessarily mean that there is no association whatsoever, but rather that there is no linear association.

A correlation value of 0.5284 as in our results above, is relatively large and positive. This means that the linear association between our two variables is relatively strong. Thus, as the values of totmexp increase, so do the totminc values. More clearly, households with more total monthly income are positively associated with households with higher total monthly expenditures.

Lets get back to our intuitive statement, "Households that earn 6,000 Rand, spend more than households that earn 3,000." Our initial study of the matter suggests that this is likely to be true, according to our STATA correlation estimate. This initial approach, however, is too simplistic. STATA can do much more. We can go further and figure out by exactly how much does total monthly household income influence total monthly household expenditures. To do this, we will call upon the regress command or (reg for short).

Before we move one, however, try the following questions:

  1. What does it mean when two variables render a correlation of 0.5000?
  2. Question 1 Answer
  3. What is the correlation between a total monthly household income and total monthly household savings?
  4. Question 2 Answer

 

OUTLIERS

Before we continue on to simple regression analysis, it is a good idea to spend a few minutes reviewing the issue of outliers again. We must be extremely mindful of possible outliers and their adverse effects during any attempt to measure the relationship between two continuous variables. This is particularly true when using methods that rely on the mean of amu give n variable, as is the case in both correlation and regression analysis. If we remember from an earlier module, means are extremely sensitive to outliers, whether positively or negatively skewed. Therefore, we will spend some time investigating how our two variables totmexp and totminc are distributed. The quickest method to accomplish that is to graph these variables in one scatterplot. Let's try it:

sort hhid
scatter totmexp totminc if hhid~=hhid[_n-1]

 

Or, we can get a bit more sophisticated and try a few new options:

#delimit ;
sort hhid;

scatter totmexp totminc if hhid~=hhid[_n-1],
ylabel(0(5000)25000) ytick(0(2500)25000)
xlabel(0(25000)125000) xtick(0(12500)125000);

Both scatterplots display the same information, however, the second one gives us a better description. The additional options (ylabel(0(5000)25000) ytick(0(2500)25000) xlabel(0(25000)125000) xtick(0(12500)125000)) tell STATA to plot the variables together in a scatterplot graph and to give us a more detailed display of the y- and x-axes. From the additional information provided by this new graph, we can quickly see that most data points are clustered together.

Now, we can consider removing the three more obvious outliers. It is very likely that each of these outliers is significantly moving the mean away from the median. Therefore, for our purposes we will remove these three cases and recalculate the graphs and the correlation between totmexp and totminc. To do this type:

drop if totminc>30000 & totminc<125000
(10 observations deleted)

drop if totmexp>12500 & totmexp<21000
(2 observations deleted)

Note that a total of 12 observations were deleted because each of outlier is a household which contains various household members (observations).

NOTE, that after we are done with this exercise, you must reload the original data set to recover these dropped cases and that unless you want to permanently keep these changes you should NOT save the data over the original datafile.

Now we can proceed with the calculations. To do so we type:

sort hhid
cor totminc totmexp if hhid~=hhid[_n-1]

(obs=1021)
          | totminc totmexp
 ---------+------------------
  totminc | 1.0000
  totmexp | 0.7949 1.0000

Then we type:

#delimit ;
sort hhid;

scatter totmexp totminc if hhid~=hhid[_n-1],
ylabel(0(2500)12500) ytick(0(500)12500)
xlabel(0(5000)25000) xtick(0(2500)25000);

The new results are outstanding! After dropping those outliers, we get a correlation between totmexp and totminc of 0.7949 which clearly indicates a stronger linear association. The new scatterplot also demonstrates a more even distribution.

Formally Testing for Outliers

Some fields in social research suggest and embrace an active approach to the handling of outliers, whereas others, take a more hands off approach. Neither one approach is superior to the other; after all, both are efforts to minimize the effects of extreme values. On one hand, the aggressive approach chooses to control for the ill effects by eliminating cases from the models. Whereas the hands-off approach, often chooses use more robust estimation procedures which can handle extreme values in the data.

For our purposes, we will only eliminate the three most obvious outliers for two reasons: 1) an in depth study of how to formally handle outliers is beyond the scope of this course, and 2) we advocate the use of more robust procedures to handle possible outliers, however, those procedures are also beyond the scope of this course. Therefore, we will stay the middleground and only eliminate the most obvious outliers for our regression models.

 

SIMPLE REGRESSION

Now that we have reviewed the issue of outliers, we can proceed with our study of regression using STATA.

Simple OLS regression (Ordinary Least Square regression), is a procedure that determines the best fitting regression line between two variables. In essence, the OLS regression line reduces the sum of squared errors to a minimum between two variables. It is beyond the scope of this website to teach you the finer points and intricacies of regression analysis, however, we will provide useful examples to give you a feel for what it is in general. Our main purpose here will be to show you how to use STATA to calculate the regression line between two variables and how to interpret the results. If you are not clear on what exactly regression is or would like to have a deeper understanding of it, we suggest that you take a course in statistics as it relates to your field of interest.

In general, the simplest relationship between an independent and dependent variable can be expressed in the linear formula,

Y = a + bX

where Y is the dependent variable and X is the independent variable. The coefficient "b" is referred to as the slope and tells us how a 1 unit change in X will change the value of Y. The coefficient "a" tells us the value of Y when the independent variable X is zero. On an X-by-Y graph, the coefficient "a" is where the regression line intercepts with the y-axis.

In the case of totminc and totmexp, the equation can be written as follows:

totmexp = a + b(totminc)

This equation suggests that there is a linear relationship between our two variables. If we were to find a positive b coefficient, our equation would suggest that as total monthly household income increases by one unit there will be a corresponding change (b) in total monthly household expenditure. Conversely, if we find a negative b coefficient, our equation will suggest that as total monthly household income increases 1 unit, there will be a corresponding decrease in total monthly household expenditure.

Remember that we can type help and whatever command to learn more about that command STATA. If we type help regress we will get a full description of the regression command, its options and its syntax. For our purposes we need to type:

sort hhid [Remember that we want to do this in preparation for our household qualifier]
reg totmexp totminc if hhid~=hhid[_n-1]

STATA gives us the results table below:

  Source |       SS       df       MS                  Number of obs =    1024
---------+------------------------------               F(  1,  1022) =  395.98
   Model |   879345471     1   879345471               Prob > F      =  0.0000
Residual |  2.2696e+09  1022  2220705.91               R-squared     =  0.2793
---------+------------------------------               Adj R-squared =  0.2785
   Total |  3.1489e+09  1023  3078110.37               Root MSE      =  1490.2

------------------------------------------------------------------------------
 totmexp |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
 totminc |   .2006579   .0100837     19.899   0.000       .1808707    .2204451
   _cons |   1203.453   50.36378     23.895   0.000       1104.625    1302.282
------------------------------------------------------------------------------

 

UNDERSTANDING REGRESSION OUTPUT TABLES

What do all these numbers mean? The output shown here is actually three tables in one. There is a small table in the upper left, a list of information in the upper right, and a larger table across the bottom. The smaller table in the upper left hand corner is called the analysis of variance (ANOVA) table. Although we are not particularly interested in this portion of the results, you can learn more about it, if interested click here.

For the purposes of understanding the basic relationship between totmexp and totminc, we will focus on three pieces of information provided by the output above. First, lets remember the basic linear regression equation:

Y = a + bX or in our case: totmexp = a + b(totminc)

If we plug the results into their appropriate spot in the equation, we get:

(predicted totmexp)i = 1203.453 + .2006579(totminc)i

In actual words, this equation is telling us that for every one unit increase in total monthly household income (totminc), total monthly household expenditure (totmexp) will increase by almost 0.201 Rand. Not much of an increase, however, this increase is statistically significant as indicated by the 0.000 probability associated with this coefficient. In addition, the constant (_cons) tells us that when our independent variable totminc equals zero, totmexp is 1203.45 Rand. The other important piece of information is the R-squared (r2) which equals 0.2793. In essence, this value tells us that by knowing the value of our independent variable (totminc) we can guess the value of the dependent variable (totmexp) approximately 28% better than by simply guessing the mean of totmexp. Or as more commonly talked about, we can account for about 28% of the variation around the mean of totmexp with the knowledge of totminc. If you are interested in knowing what all the other output means, click here.

The Case of Simple Regression

Now we can use this formula to make actual predicted estimates of totmexp for any given value of totminc.

So, looking at the regression results table above, we arrive at:

(predicted totmexp)i = 1203.453 + .2006579(totminc)i.

What does this equation really tell us? What if we were interested in estimating how much a household that makes 5,000 a month spends per month? Using the equation above, we plug in 5000 Rand for (totminc)i and solve for the resulting (predicted totmexp)i. In this case, one predicts that a household that earns a total monthly income of 5000 Rand, would spend 2206.74 Rand every month in total household expenditures. It is important to realize, that a regression equation will never fit perfectly the observed values. Therefore, the estimated value of monthly expenditure that our calculation predicts (2206.74) is just that, a prediction. That is why we place the word predict in front of the dependent variable totmexp.

A useful step after any regression equation is to create a variable in STATA that equals the predicted value of your dependent variable given your independent variable(s). We use the predict command to estimate each predicted totmexp. The predict command must be specified directly after the regression command. Thus, we would type the following:

sort hhid
reg totmexp totminc if hhid~=hhid[_n-1]
predict mexphat if hhid~=hhid[_n-1]

Note that we named our new variable mexphat, which includes the suffix "hat" as part of the new name. This is a common practice because the hat sign, ^, in regression equations, is often used to indicate estimated values. Lets see what this new estimated variable looks like. Type:

sort hhid
list hhid totmexp mexphat if hhid~=hhid[_n-1]

Here is a partial view of what the resulting table should look like:

      +------------------------------+
      |   hhid    totmexp    mexphat |
      |------------------------------|
   1. |   1006   594.7083   1411.335 |
   2. |   1008    441.925   1418.157 |
   3. |   1012   1219.382   1509.925 |
   4. |   2001   968.0333   1298.766 |
  12. |   2008    1432.45   1233.552 |
      |------------------------------|
  16. |   2012      818.1   1369.116 |
  18. |   2014   1669.233    1434.21 |
  23. |   2025   845.3334    1363.98 |
  29. |   3001   1185.923   1392.406 |
  33. |   3015   387.7333   1203.453 |
      |------------------------------|
  35. |   3017   876.8083   1764.091 |
  37. |   3030   1852.733   1374.012 |
  41. |   4007     5514.5   1396.085 |
  48. |   4018   803.0833   1413.743 |
  50. |   4022      804.5   1430.197 |
      |------------------------------|

You can readily see that none of our predictions were correct. Nevertheless, the regression results tell us that by knowing a household's value for totminc we can guess that household's value for totmexp by 28% better than simply guessing 1672.938 Rand - the sample mean for totmexp.

BUT, is this the best we can do? Let's find out.

  1. How much would you expect a household to increase its total monthly food expenditure for every additional household member?
  2. Question 3 Answer
  3. Income also affects households' total monthly food expenditure. Try a simple regression of mxtfood on total monthly income. Write an equation for this relationship and interpret the result.
  4. Question 4 Answer

 

GRAPHING REGRESSION EQUATIONS

Having obtained a predicted value of the dependent variable totmexp, we can plot this relation with the scatterplot graphing command. In this instance, the command would be:

#delimit ;

graph twoway scatter totmexp totminc if hhid~=hhid[_n-1] ||
line mexphat totminc if hhid~=hhid[_n-1]
ylabel(0(2500)12500) ytick(0(500)12500)
xlabel(0(5000)25000) xtick(0(2500)25000);

As you can see, the above graph is very similar to the scatterplots above. The difference now though, is that we have a regression line.

Do you see a problem here? Remember our conversation about outliers? Lets put all of our newly acquired knowledge to use.

 

PUTTING IT ALL TOGETHER

If we look over our notes from above, we should only drop the most obvious outliers. First, lets reload our data to make sure we have all the original cases. We can do this by typing:

clear
use saldru12.dta
[make sure that you have the correct directory when calling the data]

Then we can begin dropping the outliers by:

drop if totminc>30000 & totminc<125000
drop if totmexp>12500 & totmexp<21000

Then let's calculate the new regression line. What difference does the removal of the outliers make? Let's find out.

sort hhid
reg totmexp totminc if hhid~=hhid[_n-1]
predict mexphat if hhid~=hhid[_n-1]

  Source |       SS       df       MS                  Number of obs =    1021
---------+------------------------------               F(  1,  1019) = 1749.29
   Model |  1.6547e+09     1  1.6547e+09               Prob > F      =  0.0000
Residual |   963872417  1019  945900.311               R-squared     =  0.6319
---------+------------------------------               Adj R-squared =  0.6315
   Total |  2.6185e+09  1020  2567179.67               Root MSE      =  972.57

------------------------------------------------------------------------------
 totmexp |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
 totminc |   .5375762   .0128531     41.824   0.000       .5123545    .5627979
   _cons |    610.929   37.80863     16.158   0.000       536.7373    685.1206
------------------------------------------------------------------------------

WOW, what a difference the removal of the outliers made. The new R-squared jumped from merely 28% to 63%! Our new slope is also more accurate now that it is not being unduly biased by the large outliers. The constant is also much lower 610.929 versus 1203.453.

Lets now plot the new graph using our newly constructed mexphat variable:

#delimit ;

scatter totmexp totminc if hhid~=hhid[_n-1] ||
lfit totmexp totminc if hhid~=hhid[_n-1]
ylabel(0(2500)12500) ytick(0(500)12500)
xlabel(0(5000)25000) xtick(0(2500)25000);

Spend a few minutes studying these new results. What did we learn about the relationship between total monthly household income and total monthly household expenditure? How did it change from before? Is this the best we can do?

In fact, this is not the best we can do! Can you think of any other extraneous variables that could play a significant role in this bivariate relationship? How about the number of household members? How about whether the household is in a rural versus a metropolitan setting? Each of those variables is likely to alter our initial simple regression relationship between totminc and totmexp because it is likely to depend greatly on how many household members are present and similarly on whether the household is a rural one or not.

This type of analysis, however, requires more than simple regression between two variables, it requires what is known as multiple regression. We cover multiple regression in the next module. But first, try your new knowledge on the following exercise questions below. Make sure you understand simple regression before you move on to the more complex multiple regression.

 

EXERCISES

So now that we have learned quite a bit about regression analysis, its time to put our knowledge to the test!
  1. What is the correlation between where a person lives, their income, and their associated total monthly household food expenditure?
  2. Exercise 1 Answer
  3. Generally speaking (with a normal labor supply curve) people are willing to work more when they get paid more after taxes. Investigate this relationship by regressing "hours worked last week" on "household net wage." Have you estimated a labor supply function?
  4. Exercise 2 Answer
  5. Do those who are sick more days spend more on health care? Or, are the people who spend more on healthcare sick less days?
  6. Exercise 3 Answer
  7. Using the variables "sale_val" for the sale value of a house and "rooms_to" for the total number of rooms in a house, investigate how the value of a house changes as the house has more rooms.
  8. Exercise 4 Answer
  9. It is a reasonable expectation that the bigger a residence is, the more that it costs to live in it. Is this true in the Saldru dataset? If so, how much more will an additional room cost? Interpret the data with a regression and then show a graph.
  10. Exercise 5 Answer
  11. Now, lets consider whether the number of people in a family will determine the size of the house that it lives in. Is this true in the Saldru dataset? If so, an additional person is likely to make a family acquire how many more rooms (on average)? Please show the graph for this relationship.
  12. Exercise 6 Answer
  13. Eating out is considered to be a luxury. Is it reasonable to assume that as income goes up, money spent on eating out would go up too? See if this is the case in the Saldru dataset. If so, for every rand a family earns, how much is likely to be spent on eating out? Be wary of outliers and please graph your results.
  14. Exercise 7 Answer
  15. Another thing that income affects is the amount of money that a household spends on its car(s). Regress the number of vehicles per household on total monthly household income to see if this is true. If so, if a family earns 6000 rand, how much is it likely to spend on its car(s)? Be weary of outliers and graph your results.
  16. Exercise 8 Answer

 

BACK TO TOP