TABLE OF CONTENTS
Introduction
Correlation of Variables
Outliers
Simple Regression
Understanding Regression Output Tables
Graphing the Regression Equation
Putting It All Together
Exercises
INTRODUCTION
In Module 5, you learned methods in STATA that allowed you to determine whether two variables were statistically related or independent of one another. While this is indeed important, it is often necessary to take your analysis a few steps further to determine the actual relationship between variables.
In this Simple Regression Module, we will cover the first two methods commonly used to determine the relationship between two variables. The first is correlation analysis, which simply measures the strength or degree of association between two continuous variables. The second is simple regression analysis, which allows us to determine how one variable changes in relation to to change in another variable. In regression analysis, we are often interested in causal relationships. For example, we could be interested in whether political participation depends on individual income? Or we could be interested in whether product advertisement on the Internet leads to higher sales? In general, we are interested in whether variable X has an effect on variable Y. As such, it is often useful to think of variable X as the "independent" or "explanatory" variable and to think of variable Y as the "dependent" variable or as the "effect".
In this module, we will concentrate on the relationship between total monthly household expenditures (totmexp) and total monthly household income (totminc). Think about these two variables. Which one do you think is the independent variable? How about the dependent variable? Remember, that the independent variable is the variable that is likely to "cause" or help "explain" the dependent variable. In this case, we are predicting that total monthly household expenditures (totmexp) depends on total monthly household income (totminc). It is clear that how much you spend as a household depends on how much you earn as a household, however, it is not as clear that how much you earn as a household depends on how much you spend as a household. Thus, totminc is our independent variable and the totmexp is our dependent variable.
For now, lets concentrate on the first method we mentioned, correlation analysis. Then we will proceed on to simple regression.
CORRELATION OF VARIABLES
Suppose someone made the statement, "Households that earn 6,000 Rand, spend more than households that earn 3,000." This sounds like a reasonable statement to make, however, being the researchers that we are, we want to confirm our intuition with empirical facts. Since we are dealing with two continuos variables and we are expecting a linear relationship, the appropriate measure of association is a Pearson correlation which in STATA performs with the correlate command (or corr for short).
We briefly introduced correlation analysis earlier in Module 4 and now we will review it further because it is closely related to the slope coefficient (b) in simple regression. As you should remember, the Pearson correlation measures the degree to which variables are related or in other words, the degree to which they co-vary. When using correlation in our analysis, we must make the assumption that the relationship between our two variables is linear, if we suspect otherwise, we should make the proper adjustments to the variable that does not meet the assumption (we will cover this in more detail in Module 7). Overall, the initial use of the correlate command in STATA is a good way to start investigating whether your intuition about a relationship is remotely correct.
Using STATA, the first thing that we want to do is to limit the observations to one per household since the two variables that we are interested in (totmexp and totminc) are household-level variables but are contained within individual-level cases. As we have done in the previous modules, we need to be careful to use only one observation per household, otherwise we will have the problem of multiple counting. This would lead to larger households having a greater influence on the results than smaller households. As before, we do this by first sorting the data by household id (hhid) then we go ahead and use the correlate command with the household qualifier:
sort hhid
corr totminc totmexp if hhid~=hhid[_n-1]
STATA produces the following results:
| totminc totmexp
---------+------------------
totminc | 1.0000
totmexp | 0.5284 1.0000
You should notice that STATA uses only 1,028 cases, households in this case.
STATA should produce the above table, but what does it mean? A correlation value can range from from -1 to +1, with 0 indicating that there is no linear association and ±1 being a perfect linear association. Technically, if the correlation value is low (near 0), it does not necessarily mean that there is no association whatsoever, but rather that there is no linear association.
In our example, you will notice that the highlighted coefficient (0.5284) is relatively large and positive. This means that the linear association between our two variables is relatively strong. Thus, as the values of totmexp increase, so do the totminc values. More clearly, households with more total monthly income are positively associated with households with higher total monthly expenditures.
Lets get back to the original statement, "Households that earn 6,000 Rand, spend more than households that earn 3,000." Our initial study of the matter suggests that this is likely to be true, according to our STATA correlation calculations. This initial approach, however, is too simplistic. STATA can do much more. We can go further and figure out by exactly how much does total monthly household income influence total monthly household expenditures. To do this, we will call upon the regress command or (reg for short). Before we move one, however, try the following questions:
- What does it mean when two variables render a correlation of 0.5000?
- Question 1 Answer
- What is the correlation between a total monthly household income and total monthly household savings?
- Question 2 Answer
OUTLIERS
Before we continue on to simple regression analysis, it is a good idea to spend a few minutes reviewing the issue of outliers again. We must be extremely mindful of possible outliers and their adverse effects during any attempt to measure the relationship between two continuous variables. This is particularly true when using methods that rely on the means of variables, as is the case in both correlation and regression analysis. If we remember from an earlier module, means are extremely sensitive to outliers, whether positively or negatively skewed. Therefore, we will spend some time investigating how our two variables totmexp and totminc are distributed. The quickest method to accomplish this is to graph these variables in one scatterplot. Let's try it:
sort hhid
graph totmexp totminc if hhid~=hhid[_n-1]
Or, we can get a bit more sophisticated and try a few new options:
sort hhid
graph totmexp totminc if hhid~=hhid[_n-1],
twoway oneway box ylabel(0 5000 to 25000) xlabel(0 25000 to 125000) xtick(0 12500 to 125000)
Both scatterplots display the same information, however, the second one gives us a more information. (Note that the blue line and blue points were superimposed and are not created with the syntax above.) The additional options (twoway oneway box ylabel(0 5000 to 25000) xlabel(0 25000 to 125000) xtick(0 12500 to 125000)) told STATA to plot the variables together in a "twoway" graph and to give us a display of their respective "oneway" distributions (i.e., equivalent to a one variable histogram) and to plot their respective "boxplot" above the oneway distribution display. The additional options tell STATA to make the various formatting modifications to the x-axis and y-axis. From the additional information provided by this new graph, we can quickly see that most data points are clustered together between the two boxplots.
Right away, we can consider removing the three very obvious outliers which we have dressed in blue in the second scatterplot. It is very likely that each of these outliers is significantly moving the mean away from the median. Therefore, for our purposes we will remove these three cases and recalculate the graphs and the correlation between totmexp and totminc. To do this type:
drop if totminc>30000 & totminc<125000
(10 observations deleted)
drop if totmexp>12500 & totmexp<21000
(2 observations deleted)
Note that a total of 12 observations were deleted because each of outlier is a household which contains various household members (observations). Also note, that after we are done with this exercise, you must reload the original data set to recover these dropped cases and that unless you want to permanently keep these changes you should NOT save the data over the original datafile.
Now we can proceed with the calculations. To do so we type:
sort hhid
cor totminc totmexp if hhid~=hhid[_n-1]
(obs=1021)
| totminc totmexp
---------+------------------
totminc | 1.0000
totmexp | 0.7949 1.0000
Then we type:
sort hhid
graph totmexp totminc if hhid~=hhid[_n-1], twoway oneway box ylabel(0 2500 to 12500) xlabel(0 5000 to 25000)
The new results are outstanding! After dropping those outliers, we get a correlation between totmexp and totminc of 0.7949 which clearly indicates a stronger linear association. The new scatterplot also demonstrates a more even distribution, although technically we still have outliers.
Formally Testing for Outliers
Some fields in social research suggest and embrace an active approach to the handling of outliers, whereas others, take a more hands off approach. Neither one approach is superior to the other; after all, both are efforts to minimize the effects of extreme values. On one hand, the aggressive approach chooses to control for the ill effects by eliminating cases from the models. Whereas the hands-off approach, often chooses use more robust estimation procedures which can handle extreme values in the data.
For our purposes, we will only eliminate the three most obvious outliers for two reasons: 1) an in depth study of how to formally handle outliers is beyond the scope of this course, and 2) we advocate the use of more robust procedures to handle possible outliers, however, those procedures are also beyond the scope of this course. Therefore, we will stay the middleground and only eliminate the most obvious outliers for our regression models.
SIMPLE REGRESSION
Now that we have reviewed the issue of outliers, we can proceed with our study of regression using STATA.
Simple OLS regression (Ordinary Least Square regression), is a procedure that determines the best fitting regression line between two variables. In essence, the OLS regression line reduces the sum of squared errors to a minimum between two variables. It is beyond the scope of this website to teach you the finer points and intricacies of regression analysis, however, we will provide useful examples to give you a feel for what it is in general. Our main purpose here will be to show you how to use STATA to calculate the regression line between two variables and how to interpret the results. If you are not clear on what exactly regression is or would like to have a deeper understanding of it, we suggest that you take a course in statistics as it relates to your field of interest.
In general, the simplest relationship between an independent and dependent variable can be expressed in the linear formula,
Y = a + bX
where Y is the dependent variable and X is the independent variable. The coefficient "b" is referred to as the slope and tells us how a 1 unit change in X will change the value of Y. The coefficient "a" tells us the value of Y when the independent variable X is zero. On an X-by-Y graph, the coefficient "a" is where the regression line intercepts with the y-axis.
In the case of totminc and totmexp, the equation can be written as follows:
totmexp = a + b(totminc)
This equation states that if you increase total monthly household income by one unit there will be a corresponding change (b) in total monthly household expenditure.
Remember that we can type help and whatever command to learn more about that command STATA. If we type help regress we will get a full description of the regression command, its options and its syntax. For our purposes we need to type:
sort hhid [Remember that we want to do this in preparation for our household qualifier]
reg totmexp totminc if hhid~=hhid[_n-1]
STATA gives us the results table below:
Source | SS df MS Number of obs = 1024 ---------+------------------------------ F( 1, 1022) = 395.98 Model | 879345471 1 879345471 Prob > F = 0.0000 Residual | 2.2696e+09 1022 2220705.91 R-squared = 0.2793 ---------+------------------------------ Adj R-squared = 0.2785 Total | 3.1489e+09 1023 3078110.37 Root MSE = 1490.2 ------------------------------------------------------------------------------ totmexp | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- totminc | .2006579 .0100837 19.899 0.000 .1808707 .2204451 _cons | 1203.453 50.36378 23.895 0.000 1104.625 1302.282 ------------------------------------------------------------------------------
UNDERSTANDING REGRESSION OUTPUT TABLES
What do all these numbers mean? The output shown here is actually three tables in one. There is a small table in the upper left, a list of information in the upper right, and a larger table across the bottom. The smaller table in the upper left hand corner is called the analysis of variance (ANOVA) table. Although we are not particularly interested in this portion of the results, you can learn more about it by clicking here.
For the purposes of understanding the basic relationship between totmexp and totminc, we will focus on three pieces of information provided by the output above. First, lets remember the basic linear regression equation:
Y = a + bX or in our case: totmexp = a + b(totminc)
If we plug the results into their appropriate spot in the equation, we get:
(predicted totmexp)i = 1203.453 + .2006579(totminc)i
In actual words, this equation is telling us that for every one unit increase in total monthly household income (totminc), total monthly household expenditure (totmexp) will increase by almost 0.201 Rand. Not much of an increase, however, this increase is statistically significant as indicated by the 0.000 probability associated with this coeficient. In addition, the constant (_cons) tells us that when our independent variable totminc equals zero, totmexp is 1203.45 Rand. The other important piece of information is the R-squared (r2) which equals 0.2793. In essence, this value tells us that by knowing the value of our independent variable (totminc) we can guess the value of the dependent variable (totmexp) approximately 28% better than by simply guessing the mean. Or as more commonly talked about, we can account for about 28% of the variation around the mean of totmexp with the knowledge of totminc. If you are interested in knowing what all the other output means, click here.
The Case of Simple Regression
Now we can use this formula to make actual estimates of totmexp for any given value of totminc.
So, looking at the regression results table above, we arrive at (predicted totmexp)i = 1203.453 + .2006579(totminc)i. What does this equation really tell us? What if we were interested in estimating how much a household that makes 5,000 a month spends per month? Using the equation above, we plug in 5000 in for (totminc)i and solve for the resulting (predicted totmexp)i. In this case, one predicts that a household that earns a total monthly income of 5000 Rand, would spend 2206.74 Rand every month in total household expenditures. It is important to realize, that a regression equation will never fit perfectly the observed. Therefore the estimated value of monthly expenditure that our calculation predicts (2206.74) is just that, a prediction. That is why we place the word predict in front of the dependent variable totmexp.
We can create a variable in STATA that equals the predicted value of totmexp for each household using this equation. We use the command predict to estimate each predicted totmexp. STATA must first calculate the regression for variable you want to predict the estimated values for. Thus, we would type the following:
sort hhid
reg totmexp totminc if hhid~=hhid[_n-1]
predict mexphat if hhid~=hhid[_n-1]
Note that we included "hat" as part of the new predicted variable (mexphat). This is a common practice because the hat sign, ^, in regression equations often means estimated. Lets see what this new estimated variable looks like. Type:
sort hhid
list hhid totmexp mexphat if hhid~=hhid[_n-1]
Here is a partial view of what the resulting table should look like:
hhid totmexp mexphat
1. 1006 594.7083 1167.858
2. 1008 441.925 1186.135
3. 1012 1219.382 1431.987
4. 2001 968.0333 866.2776
12. 2008 1432.45 691.5654
16. 2012 818.1 1054.752
18. 2014 1669.233 1229.142
23. 2025 845.3334 1040.99
29. 3001 1185.923 1117.147
33. 3015 387.7333 610.929
35. 3017 876.8083 2112.917
37. 3030 1852.733 1067.869
41. 4007 5514.5 1127.002
48. 4018 803.0833 1174.309
50. 4022 804.5 1218.39
54. 5011 1182.377 1272.148
55. 6005 5488.969 2859.685
You can readily see that none of our predictions were correct. Nevertheless, the power of regression has given us a better estimate for each household than if we had just guessed 1672.938 Rand for each household, the sample mean for totmexp.
BUT, is this the best we can do? Let's find out.
- How much would you expect a household to increase its total monthly food expenditure for every additional household member?
- Question 3 Answer
- Income also affects households' total monthly food expenditure. Try a simple regression of mxtfood on total monthly income. Write an equation for this relationship and interpret the result.
- Question 4 Answer
GRAPHING REGRESSION EQUATIONS
Having obtained a predicted value of the dependent variable totmexp, we can plot this relation with the graph command. The general syntax is the following:
graph "observed y-variable" "predicted y-variable" "independent variable", connect(.s) symbol (Oi) ylabel xlabel
connect(.s): tells STATA to ignore the first y-variable designated with as "." but to connect the second y-variable (predicted Yhat) with a smooth line designated with the "s".
symbol(Oi): means using circle to plot the value of the dependent variables at each of the specified values of the independent variable.
So, in this instance, the command would be:
graph totmexp mexphat totminc if hhid~=hhid[_n-1], connect(.s) symbol(Oi) ylabel xlabel(0 25000 to 125000) xtick(0 12500 to 125000)
As you can see, the above graph is very similar to the scatterplots above. The difference now though, is that we have a regression line.
Do you see a problem here? Remember our conversation about outliers? Lets put all of our newly acquired knowledge to use.
PUTTING IT ALL TOGETHER
If we look over our notes from above, we should only drop the most obvious outliers. First, lets reload our data to make sure we have all the original cases. We can do this by typing:
clear
use saldru12.dta [make sure that you have the correct directory when calling
the data]
Then we can begin dropping the outliers by:
drop if totminc>30000 & totminc<125000
(10 observations deleted)
drop if totmexp>12500 & totmexp<21000
(2 observations deleted)
Then lets calculate the new regression line. What difference does the removal of the outliers make? Lets find out.
sort hhid
reg totmexp totminc if hhid~=hhid[_n-1]
Source | SS df MS Number of obs = 1021 ---------+------------------------------ F( 1, 1019) = 1749.29 Model | 1.6547e+09 1 1.6547e+09 Prob > F = 0.0000 Residual | 963872417 1019 945900.311 R-squared = 0.6319 ---------+------------------------------ Adj R-squared = 0.6315 Total | 2.6185e+09 1020 2567179.67 Root MSE = 972.57 ------------------------------------------------------------------------------ totmexp | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- totminc | .5375762 .0128531 41.824 0.000 .5123545 .5627979 _cons | 610.929 37.80863 16.158 0.000 536.7373 685.1206 ------------------------------------------------------------------------------
WOW, what a difference the removal of the outliers made. The new R-squared jumped from merely 16% to %44! Our new slope is also more accurate now that it is not being unduly biased by the large outliers. The constant is also much lower 49.78983 versus 167.3393.
Lets now plot the new graph using a newly constructed mexphat:
predict mexphat if hhid~=hhid[_n-1]
sort hhid
graph totmexp mexphat totminc if hhid~=hhid[_n-1], connect(.s) symbol(Oi) ylabel xlabel(0 5000 to 22500)
Spend a few minutes studying these new results. What did we learn about the relationship between total monthly household income and total monthly household expenditure? How did it change from before? Is this the best we can do?
In fact, this is not the best we can do! Can you think of any other extraneous variables that could play a significant role in this bivariate relationship? How about the number of household members? How about whether the household is in a rural versus a metropolitan setting? Each of these variables is likely to alter our initial simple regression because the relationship between totminc and totmexp is likely to depend greatly on how many household members are present and similarly on whether the household is a rural one or not.
This type of analysis, however, requires more than a simple regression between two variables, it requires what is known as multiple regression. We cover multiple regression in the next module. But first, try your new knowledge on the following questions below. Make sure you understand simple regression before you move on to the more complex multiple regression.
EXERCISES
So now that we have learned quite a bit about regression analysis, its time to put our knowledge to the test!- What is the correlation between where a person lives, their income, and their associated total monthly household food expenditure?
- Exercise 1 Answer
- Generally speaking (with a normal labor supply curve) people are willing to work more when they get paid more after taxes. Investigate this relationship by regressing "hours worked last week" on "household net wage." Have you estimated a labor supply function?
- Exercise 2 Answer
- Do those who are sick more days spend more on health care? Or, are the people who spend more on healthcare sick less days?
- Exercise 3 Answer
- Using the variables "sale_val" for the sale value of a house and "rooms_to" for the total number of rooms in a house, investigate how the value of a house changes as the house has more rooms.
- Exercise 4 Answer
- It is a reasonable expectation that the bigger a residence is, the more that it costs to live in it. Is this true in the Saldru dataset? If so, how much more will an additional room cost? Interpret the data with a regression and then show a graph.
- Exercise 5 Answer
- Now, lets consider whether the number of people in a family will determine the size of the house that it lives in. Is this true in the Saldru dataset? If so, an additional person is likely to make a family acquire how many more rooms (on average)? Please show the graph for this relationship.
- Exercise 6 Answer
- Eating out is considered to be a luxury. Is it reasonable to assume that as income goes up, money spent on eating out would go up too? See if this is the case in the Saldru dataset. If so, for every rand a family earns, how much is likely to be spent on eating out? Be wary of outliers and please graph your results.
- Exercise 7 Answer
- Another thing that income affects is the amount of money that a household spends on its car(s). Regress the number of vehicles per household on total monthly household income to see if this is true. If so, if a family earns 6000 rand, how much is it likely to spend on its car(s)? Be weary of outliers and graph your results.
- Exercise 8 Answer
| BACK TO TOP |