TABLE OF CONTENTS
Introduction
Column, Row and Cell
Summarize
Egen
Example: Variable Total Individual Income per Household vs. Total Monthly Income
Chi-Squared: Testing for Independence
Example: Variable New Government
Exercises
INTRODUCTION
Up to this point we have restricted our analysis to one variable at a time. This is certainly useful, although restricting our analysis to a single variable can be misleading. For example, we have learned that to produce the average monthly pay using STATA, we type:
means incmon
The above command produces a table of the Arithmetic, Geometric, and Harmonic means for the entire sample. From this table we see that on average individuals earn 1585.80 rand per month. When thinking about income, one can think of several factors that may effect what an individual earns.
Is it likely that an individual with a college degree will earn a higher wage than an individual with no formal education?
Hence, while reporting the mean income for the entire sample is useful, examining how income varies by a second variable can be even more helpful in discovering trends in the data.
In this module, we will examine the relationship between two variables (bivariate analysis) using crosstabs. A crosstab is a technique for analyzing the relationship between two variables that have been organized in a bivariate table. Using such a table, we can examine the presence and strength of the relationship between two variables.
What is bivariate analysis? Bivariate Analysis is the examination of two variables at the same time, hence the name bivariate. It is used frequently by social scientists and mathematicians to compare how two variables correspond with one another. While sophisticated equations can be written to model how one variable changes with respect to another (regression, the subject of Module 6), we are only concerned here with any two variables, whether they are mathematically related or not.
When would we use bivariate analysis? Although it can be used any time we have two variables that we want to examine at the same time, bivariate analysis is a good tool to use when we have a hunch that two variables "go together." If this is the case then we can compare them numerically.
For example, if we were interested in poverty across the country, it would be informative to know how much a household spends on food every month. To find this out, first lookfor the variables that might be of interest.
lookfor food
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
subs_f int %9.0g 5d:value of food subsidy
food_ben int %9.0g 4e:value of food subsidy
mxtfood float %9.0g total monthly food exp.
mxoth1 float %9.0g other nonfood 1
mxoth2 float %9.0g other nonfood 2
foodrec float %9.0g total month food received
tmxrec float %9.0g tot month food received value
flgfrec1 byte %9.0g flag food received wage
flgfrec2 byte %9.0g flag food received remittances
flgfrec3 byte %9.0g flag food received f.o.s
pctfexp float %9.0g per capita tot food month exp.
hhtfexp float %9.0g household tot food month exp.
tmxpur float %9.0g tot month food purchase value
tmxpro float %9.0g tot month food produced value
tmxcon float %9.0g tot month food value
foodwage float %9.0g household food subsidy
foodcw1 int %9.0g household cas 1 food subsidy
foodcw2 int %9.0g household cas 2 food subsidy
stxfood int %9.0g food eaten out.
Here we can see that the variable for total monthly food expenditure is mxtfood.
Now, we could always use the listcommand and generate a table with the individual's
observation number, household ID, and the amount his or her family spends on food every month.
(Please note that some observations are not listed here in order to save space.)
sort hhid
list hhid mxtfood
+-------------------+
| hhid mxtfood |
|-------------------|
1. | 1006 332.9 |
2. | 1008 293.2 |
3. | 1012 153.99 |
4. | 2001 479.8 |
5. | 2001 479.8 |
|-------------------|
6. | 2001 479.8 |
7. | 2001 479.8 |
8. | 2001 479.8 |
9. | 2001 479.8 |
10. | 2001 479.8 |
|-------------------|
11. | 2001 479.8 |
12. | 2008 1095.2 |
13. | 2008 1095.2 |
14. | 2008 1095.2 |
15. | 2008 1095.2 |
|-------------------|
We can scan the chart to guess a family's well-being based upon monthly food expenditure, but the amount a family spends on food actually depends on many other things. For instance, we would expect that larger families would spend more on food every month than smaller families because they have more people to feed. To answer this question we use the lookfor command once again to find a variable related to household size. In STATA we type:
lookfor household size
The syntax above provides us with a list of several variables related to households.
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
hhid float %9.0g household identification no
q7b byte %9.0g 7b:total household per unit
mxhous float %9.0g household exp.
hhsizep byte %9.0g hh size all
hhsizem byte %9.0g hh size memebers
homewage float %9.0g household housing subsidy
travwage float %9.0g household travel subsidy
hhtfexp float %9.0g household tot food month exp.
hhnwage float %9.0g household net wage
hhgwage float %9.0g household gross wage
foodwage float %9.0g household food subsidy
hhc1wage int %9.0g household cas 1 wage
foodcw1 int %9.0g household cas 1 food subsidy
bencw1 int %9.0g household cas 1 subsidy
hhc2wage int %9.0g household cas 2 wage
foodcw2 int %9.0g household cas 2 food subsidy
bencw2 byte %9.0g household cas 2 subsidy
otherinc float %9.0g household other income
profit31 float %9.0g household profit value
Clearly, hhsizem is the variable for household size (i.e. number of members). Try listing household size with the previous chart.
list hhid mxtfood hhsizem
+-----------------------------+
| hhid mxtfood hhsizem |
|-----------------------------|
1. | 1006 332.9 1 |
2. | 1008 293.2 1 |
3. | 1012 153.99 1 |
4. | 2001 479.8 8 |
5. | 2001 479.8 8 |
|-----------------------------|
6. | 2001 479.8 8 |
7. | 2001 479.8 8 |
8. | 2001 479.8 8 |
9. | 2001 479.8 8 |
10. | 2001 479.8 8 |
|-----------------------------|
11. | 2001 479.8 8 |
12. | 2008 1095.2 4 |
13. | 2008 1095.2 4 |
14. | 2008 1095.2 4 |
15. | 2008 1095.2 4 |
|-----------------------------|
Now we have some more information that helps us get a better idea why particular individuals might live in families that spend more money on food. Here, it makes sense that household 2014 would spend more than household 1006 on food because four more people live there. But, to conclude that one family is better off than another because they spend more on food every month is also incorrect. Food can be considered a "normal good" so the amount that a family spends on food also depends on its monthly income. To find out the variable for income, use the lookfor command again:
lookfor income
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
farmrent int %9.0g crop rental income
liverent int %9.0g grazing rental income
rentinc float %9.0g rental income
agincome float %9.0g value of ag. income
otherinc float %9.0g household other income
totminc float %9.0g total monthy income
The last variable listed here is totminc standing for total monthly income.
list hhid mxtfood totminc
+------------------------------+
| hhid mxtfood totminc |
|------------------------------|
1. | 1006 332.9 1036 |
2. | 1008 293.2 1070 |
3. | 1012 153.99 1527.333 |
4. | 2001 479.8 475 |
5. | 2001 479.8 475 |
|------------------------------|
6. | 2001 479.8 475 |
7. | 2001 479.8 475 |
8. | 2001 479.8 475 |
9. | 2001 479.8 475 |
10. | 2001 479.8 475 |
|------------------------------|
11. | 2001 479.8 475 |
12. | 2008 1095.2 150 |
13. | 2008 1095.2 150 |
14. | 2008 1095.2 150 |
15. | 2008 1095.2 150 |
|------------------------------|
Here, it makes sense that household 2012 spends more money than 2025 on food because they have more money to begin with. Although we can't be sure that a relationship exists between the variables, we can get a general feel for the relationship between the variables by doing bivariate analysis. Later we will expand these techniques to include three (trivariate) and four (quadivariate) variables.
How we do bivariate analysis in STATA? While the list command is very simple, it is not the most informative when we are trying to look at more than one variable at a time. In the above example, household 2014 spends more on food every month than 1006 and that more people live in 2014. But to make any conclusions, we have to search up and down the list trying to find numbers that fit our hypotheses. This process is tedious and time consuming. We would never want to do it with all of the households in the sample.
Thank goodness there is another option - cross-tabulations. Suppose we want to look at whether or not people who live in rural or urban areas move residences more often. We should start by finding the variables we're looking for using the lookfor command in STATA (for example: lookfor migrate and lookfor urban). We'll find two key variables migrate and metro. We can tabulate each variable individually, but that isn't very helpful when trying to figure out how the two variables are related. There is only so much we can learn from tabulating each of these variables individually.
tab migrate
13 |
:migrated |
during past |
5 yrs | Freq. Percent Cum.
------------+-----------------------------------
-1 | 15 0.29 0.29
1 | 347 6.75 7.04
2 | 4,782 92.96 100.00
------------+-----------------------------------
Total | 5,144 100.00
tab metro
metro - |
urban - |
rural | Freq. Percent Cum.
------------+-----------------------------------
Rural | 3,131 59.25 59.25
Urban | 922 17.45 76.70
Metro | 1,231 23.30 100.00
------------+-----------------------------------
Total | 5,284 100.00
That's a good start, but we still haven't tabulated the two variables together. A bivariate table (or crosstab) is simply a table that displays the distribution of one variable "across" the categories of a second variable. To create a bivariate table in STATA, we use the tabulate command, and instead of specifying a single variable we specify two.
The command is very simple: tab var1 var2.
The first variable is treated as the row variable and the second is the column variable.
tab migrate metro
13 |
:migrated |
during | metro - urban - rural
past 5 yrs | Rural Urban Metro | Total
-----------+---------------------------------+----------
-1 | 14 0 1 | 15
1 | 123 103 121 | 347
2 | 2,965 794 1,023 | 4,782
-----------+---------------------------------+----------
Total | 3,102 897 1,145 | 5,144
By looking at the survey, we learn that 1 equals "yes" and 2 is "no" for the question, "Have we moved in the last 5 years?" Based on these results, we can conclude that most of the people who have moved in the last five years now live in a rural area. The number of people who now live in a metro area who have moved in the last five years is slightly lower.
Let's try another example to answer the question at the beginning of the module.
Is it likely that an individual with a college degree will earn a higher wage than an individual with no post-secondary education?
First, we need to divide the sample into those who have a college degree and those who do not. We can accomplish this by creating a dummy variable as discussed in Module 3.
gen college=.
replace college = 1 if educ_c == 16
replace college = 0 if educ_c ~= 16
tab college if (incmon>0 & incmon~=.), sum(incmon)
| Summary of monthly gross pay
college | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1532.1139 1925.7624 588
1 | 6334.6471 3335.703 17
------------+------------------------------------
Total | 1667.0612 2128.3454 605
From the table, the average monthly income for college graduates is almost five times that of non-college grads.
Next, we can use a simple crosstab to figure out how many people own a house, which is another sign of income within the population. For this question, we need to know (by using the lookfor command) that the variable for ownership is ownship_ and that the variable for type of dwelling is house_c. Now we can simply crosstab the two to get the number of people who own a dwelling:
tab house_c ownship_
1 |
:code:type | 5 :ownership
of house | -1 1 2 | Total
-----------+---------------------------------+----------
-1 | 0 5 0 | 5
01-shack | 0 349 158 | 507
02-house | 9 1,974 753 | 2,736
03-tradi | 0 729 67 | 796
04-maiso | 0 22 33 | 55
05-flat | 0 93 67 | 160
06-hoste | 0 0 64 | 64
07-outbu | 0 12 49 | 61
08-combi | 0 857 39 | 896
09-other | 0 0 4 | 4
-----------+---------------------------------+----------
Total | 9 4,041 1,234 | 5,284
Note that in these results, we get negative values for each variable. What do these negative values mean? It is hard to tell because they are not coded in the survey. It is probably safe to assume, however, that the negative values signal some kind of invalid answer. So, for our purposes we will not include them. To do so, just retype the command with the "if" qualifier. Retry the command, omitting the negative observations and remember that we also need to exclude any missing observations.
tab house_c ownship_ if (house_c>=0 & house_c~=.) & (ownship_>=0 & ownship_~=.)
1 |
:code:type | 5 :ownership
of house | 1 2 | Total
-----------+----------------------+----------
01-shack | 349 158 | 507
02-house | 1,974 753 | 2,727
03-tradi | 729 67 | 796
04-maiso | 22 33 | 55
05-flat | 93 67 | 160
06-hoste | 0 64 | 64
07-outbu | 12 49 | 61
08-combi | 857 39 | 896
09-other | 0 4 | 4
-----------+----------------------+----------
Total | 4,036 1,234 | 5,270
These results show that 1,974 individuals in this sample live in a household that owns its house.
The tab ... , missing Command
Sometimes, we may want to include missing values in our calculations. While the tabulate command is used to produce one- and two-way tables of frequency counts, the missing option can be included to request that missing values be treated like other values in calculations of counts, percentages, and other statistics. By default, STATA will generate tables without the missing values, unless we specify that it do so. The basic syntax is:
tab varname1 varname2, missing
Say we wanted to examine the relationship between relationship to head and metropolitan status, and wanted to include the missing values in our tabulation. We would enter the following command:
tab rel_head metro, missing
3 |
:relations |
hip to | metro - urban - rural
head | Rural Urban Metro | Total
-----------+---------------------------------+----------
01-resid | 449 180 258 | 887
02-absen | 83 12 3 | 98
03-wife | 309 122 171 | 602
04-son o | 1,470 361 489 | 2,320
05-fathe | 36 6 7 | 49
06-grand | 512 111 123 | 746
07-grand | 1 1 1 | 3
08-mothe | 1 4 1 | 6
09-son- | 35 14 4 | 53
10-broth | 14 4 5 | 23
11-aunt | 4 0 2 | 6
12-siste | 75 26 23 | 124
13-niece | 77 19 17 | 113
14-cousi | 6 12 4 | 22
16-house | 8 3 20 | 31
17-lodge | 1 2 0 | 3
18-other | 18 15 15 | 48
19-other | 3 5 2 | 10
. | 29 25 86 | 140
-----------+---------------------------------+----------
Total | 3,131 922 1,231 | 5,284
Notice that the last row of the table demarked by a period "." contains the data for the missing values. Without the missing option included, this row will not appear.
Now that you've learned these new methods, try answering the following questions:
- How many people get all of the water that they need daily, but still have to fetch it from a kilometer or more away?
- Question 1 Answer
- How many people live in a house connected to electricity but still use wood as their main energy source for heating?
- Question 2 Answer
- In what province is English most widely spoken?
- Question 3 Answer
Column, Row and Cell
Column. Thus far, using the tabulate command, has been very useful in learning some very useful information about our data set. Suppose, however, we wanted to know more about where the various South African languages are spoken throughout the country. More specifically, of all those living in metropolitan areas, what percentage speak English? Using our STATA command knowledge up to now, we could enter:
tab lang_cod metro
That yields the following table:
21 |
:language | metro - urban - rural
code | Rural Urban Metro | Total
-----------+---------------------------------+----------
01-engli | 17 136 239 | 392
02-afrik | 34 231 317 | 582
03-xhosa | 735 84 156 | 975
04-zulu | 913 183 230 | 1,326
05-tswan | 365 83 46 | 494
06-north | 489 40 47 | 576
07-south | 133 95 68 | 296
08-venda | 77 0 8 | 85
09-shang | 187 0 6 | 193
10-swazi | 100 25 4 | 129
11-ndebe | 51 18 10 | 79
12-other | 1 2 14 | 17
-----------+---------------------------------+----------
Total | 3,102 897 1,145 | 5,144
So now we know that 239 people who live in metropolitan areas speak English. But is this a relatively large or small number compared to the other languages spoken in metropolitan areas? We could carefully check through all the values and compare and contrast, but a more useful and efficient way to find this answer would be to run the same table, but instead of frequencies use percentages. To do this in STATA, enter:
tab lang_cod metro, column
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
21 |
:language | metro - urban - rural
code | Rural Urban Metro | Total
-----------+---------------------------------+----------
01-engli | 17 136 239 | 392
| 0.55 15.16 20.87 | 7.62
-----------+---------------------------------+----------
02-afrik | 34 231 317 | 582
| 1.10 25.75 27.69 | 11.31
-----------+---------------------------------+----------
03-xhosa | 735 84 156 | 975
| 23.69 9.36 13.62 | 18.95
-----------+---------------------------------+----------
04-zulu | 913 183 230 | 1,326
| 29.43 20.40 20.09 | 25.78
-----------+---------------------------------+----------
05-tswan | 365 83 46 | 494
| 11.77 9.25 4.02 | 9.60
-----------+---------------------------------+----------
06-north | 489 40 47 | 576
| 15.76 4.46 4.10 | 11.20
-----------+---------------------------------+----------
07-south | 133 95 68 | 296
| 4.29 10.59 5.94 | 5.75
-----------+---------------------------------+----------
08-venda | 77 0 8 | 85
| 2.48 0.00 0.70 | 1.65
-----------+---------------------------------+----------
09-shang | 187 0 6 | 193
| 6.03 0.00 0.52 | 3.75
-----------+---------------------------------+----------
10-swazi | 100 25 4 | 129
| 3.22 2.79 0.35 | 2.51
-----------+---------------------------------+----------
11-ndebe | 51 18 10 | 79
| 1.64 2.01 0.87 | 1.54
-----------+---------------------------------+----------
12-other | 1 2 14 | 17
| 0.03 0.22 1.22 | 0.33
-----------+---------------------------------+----------
Total | 3,102 897 1,145 | 5,144
| 100.00 100.00 100.00 | 100.00
Here, STATA has calculated the percentages based on the total of each column. We see that 20.87% of all the people living in metropolitan areas speak English. The table also tell us other useful information. For example, from the Total Column we also see that Zulu is the most commonly spoken language (25.78%) in our sample. Overall, this table answered our question correctly (of all the people living in metropolitan areas, what percentage speaks English?) because it put the number of people living in metropolitan areas in the denominator.
Row. Ok, now let's answer a slightly different question: Of all the English speakers, what percentage lives in metropolitan areas? From the table above, we could find that answer, but we would have to use our calculator! Luckily for us, STATA can do the work for us - if we know how to ask it! Here's how:
tab lang_cod metro, row
+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+
21 |
:language | metro - urban - rural
code | Rural Urban Metro | Total
-----------+---------------------------------+----------
01-engli | 17 136 239 | 392
| 4.34 34.69 60.97 | 100.00
-----------+---------------------------------+----------
02-afrik | 34 231 317 | 582
| 5.84 39.69 54.47 | 100.00
-----------+---------------------------------+----------
03-xhosa | 735 84 156 | 975
| 75.38 8.62 16.00 | 100.00
-----------+---------------------------------+----------
04-zulu | 913 183 230 | 1,326
| 68.85 13.80 17.35 | 100.00
-----------+---------------------------------+----------
05-tswan | 365 83 46 | 494
| 73.89 16.80 9.31 | 100.00
-----------+---------------------------------+----------
06-north | 489 40 47 | 576
| 84.90 6.94 8.16 | 100.00
-----------+---------------------------------+----------
07-south | 133 95 68 | 296
| 44.93 32.09 22.97 | 100.00
-----------+---------------------------------+----------
08-venda | 77 0 8 | 85
| 90.59 0.00 9.41 | 100.00
-----------+---------------------------------+----------
09-shang | 187 0 6 | 193
| 96.89 0.00 3.11 | 100.00
-----------+---------------------------------+----------
10-swazi | 100 25 4 | 129
| 77.52 19.38 3.10 | 100.00
-----------+---------------------------------+----------
11-ndebe | 51 18 10 | 79
| 64.56 22.78 12.66 | 100.00
-----------+---------------------------------+----------
12-other | 1 2 14 | 17
| 5.88 11.76 82.35 | 100.00
-----------+---------------------------------+----------
Total | 3,102 897 1,145 | 5,144
| 60.30 17.44 22.26 | 100.00
From these new results, we learn that 60.97% of English speakers live in metropolitan areas. From the Total Row at the bottom of the table, we see that most of the sample lives in rural areas (60.30%).
Cell. Another question: Excluding missing cases, what percentage of the sample live in metro areas and speak English? To find the answer to this question, type the following:
tab lang_cod metro, cell
+-----------------+
| Key |
|-----------------|
| frequency |
| cell percentage |
+-----------------+
21 |
:language | metro - urban - rural
code | Rural Urban Metro | Total
-----------+---------------------------------+----------
01-engli | 17 136 239 | 392
| 0.33 2.64 4.65 | 7.62
-----------+---------------------------------+----------
02-afrik | 34 231 317 | 582
| 0.66 4.49 6.16 | 11.31
-----------+---------------------------------+----------
03-xhosa | 735 84 156 | 975
| 14.29 1.63 3.03 | 18.95
-----------+---------------------------------+----------
04-zulu | 913 183 230 | 1,326
| 17.75 3.56 4.47 | 25.78
-----------+---------------------------------+----------
05-tswan | 365 83 46 | 494
| 7.10 1.61 0.89 | 9.60
-----------+---------------------------------+----------
06-north | 489 40 47 | 576
| 9.51 0.78 0.91 | 11.20
-----------+---------------------------------+----------
07-south | 133 95 68 | 296
| 2.59 1.85 1.32 | 5.75
-----------+---------------------------------+----------
08-venda | 77 0 8 | 85
| 1.50 0.00 0.16 | 1.65
-----------+---------------------------------+----------
09-shang | 187 0 6 | 193
| 3.64 0.00 0.12 | 3.75
-----------+---------------------------------+----------
10-swazi | 100 25 4 | 129
| 1.94 0.49 0.08 | 2.51
-----------+---------------------------------+----------
11-ndebe | 51 18 10 | 79
| 0.99 0.35 0.19 | 1.54
-----------+---------------------------------+----------
12-other | 1 2 14 | 17
| 0.02 0.04 0.27 | 0.33
-----------+---------------------------------+----------
Total | 3,102 897 1,145 | 5,144
| 60.30 17.44 22.26 | 100.00
This time, STATA gives us a table showing both the frequencies and the percentages by cell. From this we learn that 4.65% of the entire valid sample (not including missing cases), living in metro areas, speaks English.
Each of these options can be very helpful, their individual use depends on what question we want to answer. Remember to consider what denominator best gets at the desired answer.
- 4. What percentage of people who live in traditional dwellings have moved in the last five years?
- Question 4 Answer
- 5. What percentage of people who have moved in the last five years live in traditional dwellings?
- Question 5 Answer
SUMMARIZE
Summarize Option within a Cross-tabulation Analysis. In this section, we will discuss a command that enables us to create a tri-variate analysis, as opposed to a two-way cross tabulation, which only gives frequencies or percentages for two variables.
For example, if we want to see a distribution of racial groups in rural, urban and metro area, as we did in previous sections, we can find this by typing:
tab race metro
19 |
:populatio | metro - urban - rural
n group | Rural Urban Metro | Total
-----------+---------------------------------+----------
01-afric | 3,054 539 589 | 4,182
02-colou | 19 175 213 | 407
03-india | 0 62 57 | 119
04-white | 29 121 286 | 436
-----------+---------------------------------+----------
Total | 3,102 897 1,145 | 5,144
Now, let's go for a step further. What is the average monthly income per worker in each racial group living in different areas? What would be the best table to create?
By combining commands that we have learned from past modules, we can already answer this question. One way would be:
sort metro
by metro: tab race, sum(incmon)
-> metro = Rural
19 |
:population | Summary of monthly gross pay
group | Mean Std. Dev. Freq.
------------+------------------------------------
01-afric | 703.95565 811.9145 248
02-colou | 601.77778 995.71755 9
04-white | 4396.3333 960.4636 3
------------+------------------------------------
Total | 743.02308 907.28152 260
_______________________________________________________________________________
-> metro = Urban
19 |
:population | Summary of monthly gross pay
group | Mean Std. Dev. Freq.
------------+------------------------------------
01-afric | 1296.383 1257.7867 47
02-colou | 832.34286 610.05231 35
03-india | 2743.3125 2170.4145 16
04-white | 3902.1333 3015.1395 30
------------+------------------------------------
Total | 1961.0859 2187.8447 128
_______________________________________________________________________________
-> metro = Metro
19 |
:population | Summary of monthly gross pay
group | Mean Std. Dev. Freq.
------------+------------------------------------
01-afric | 1002.0698 913.72426 43
02-colou | 1610.9697 1137.7115 33
03-india | 1684.75 933.05569 4
04-white | 4642.6964 3288.4132 56
------------+------------------------------------
Total | 2668.9779 2791.9117 136
These results suggest that the average monthly income of an African worker living in a metro area is about 1,002.07 Rand and about 1,296.38 Rand in urban areas. This table also provides the standard deviations associated with each of the means and the raw frequencies too. While these results are informative, they are not efficient. It would be better to create a table that shows all the necessary statistics in one table. To do so, we can type:
tab race metro, sum(incmon)
Means, Standard Deviations and Frequencies of monthly gross pay
19 |
:populatio | metro - urban - rural
n group | Rural Urban Metro | Total
-----------+---------------------------------+----------
01-afric | 703.95565 1296.383 1002.0698 | 824.26036
| 811.9145 1257.7867 913.72426 | 921.37097
| 248 47 43 | 338
-----------+---------------------------------+----------
02-colou | 601.77778 832.34286 1610.9697 | 1139.0909
| 995.71755 610.05231 1137.7115 | 995.02279
| 9 35 33 | 77
-----------+---------------------------------+----------
03-india | . 2743.3125 1684.75 | 2531.6
| . 2170.4145 933.05569 | 2011.2583
| 0 16 4 | 20
-----------+---------------------------------+----------
04-white | 4396.3333 3902.1333 4642.6964 | 4384.764
| 960.4636 3015.1395 3288.4132 | 3145.9774
| 3 30 56 | 89
-----------+---------------------------------+----------
Total | 743.02308 1961.0859 2668.9779 | 1540.4313
| 907.28152 2187.8447 2791.9117 | 2067.0334
| 260 128 136 | 524
Now that's much better. The results can now be easily compared. Note that we simply specified the summarize option to tell STATA to summarize incmon within the table. These results, tell us that on average, Coloured workers in rural areas receive the lowest wage of about 602 Rand and White workers in the metro receive the highest wage of about 4,643 Rand. If, for example, we were interested solely in the mean values and not the frequencies (or counts), we can also specify the mean option, which will create a simpler table. Try it, type:
tab race metro, sum(incmon) mean
Means of monthly gross pay
19 |
:populatio | metro - urban - rural
n group | Rural Urban Metro | Total
-----------+---------------------------------+----------
01-afric | 703.95565 1296.383 1002.0698 | 824.26036
02-colou | 601.77778 832.34286 1610.9697 | 1139.0909
03-india | . 2743.3125 1684.75 | 2531.6
04-white | 4396.3333 3902.1333 4642.6964 | 4384.764
-----------+---------------------------------+----------
Total | 743.02308 1961.0859 2668.9779 | 1540.4313
To get more familiar with these new options, try the following exercise:
- 6. How does the actual housing rent in each Province differs by population group?
- Question 6 Answer
EGEN
When using STATA to analyze this data set or other data sets, there will be many times when you will want to create a variable that combines data in the original data set in more expansive ways. For example, if were interested in how income is dispersed geographically, it would be useful to know how much household income a region generates. The egencommand in STATA is extremely handy in assisting with this type of research. In this module, we will cover only a few of the more important uses of this command, it would to your advantage to be acquainted with more of available options for egen in the help section of STATA.
The egen command allows variables to be created using functional commands that combine different variables in many different and important ways. This command is also an excellent complement for the "[_n]" command as you will soon learn.
Suppose we are interested in learning where the elderly are concentrated. Our working hypothesis is that households can better support their elders if they live in metro areas. Let's say that we are also interested in studying the relationship between the average age of a household and the race of that household. We can investigate both of these issues by using egen and [_n] (underscore-n).
The first issue is about the distribution of age by geographic location (metro). First, we'll sort the data by hhid then we will use the egen command to calculate the oldest age within each household:
sort hhid
egen maxage = max(age), by(hhid)
Let's examine more closely what this command tells STATA to do. It tells STATA to create a response for each individual in the data set that is equivalent to the oldest age of the person in each individual's household. To verify this command, type the command:
list hhid maxage
+-----------------+
| hhid maxage |
|-----------------|
1. | 1006 50 |
2. | 1008 67 |
3. | 1012 61 |
4. | 2001 40 |
5. | 2001 40 |
|-----------------|
6. | 2001 40 |
7. | 2001 40 |
8. | 2001 40 |
9. | 2001 40 |
10. | 2001 40 |
|-----------------|
11. | 2001 40 |
12. | 2008 52 |
13. | 2008 52 |
14. | 2008 52 |
15. | 2008 52 |
|-----------------|
To further understand the previous, it's instructive to think about and answer the following question: What would happen if the qualifier "by(hhid)" was not included in the statement? Think about it. STATA would figure what is the oldest age in the entire file and attach the value to each individual case in the sample! Clearly, that was not our intention.
Returning to our investigation, we now have a variable that is the age of the oldest person in each household. It would be possible to run a simple cross tab now, but those results would be incorrect without controlling for the number bias - if hhid~=hhid[_n-1]. Thus, we need to correct for that:
sort hhid
tab maxage metro if hhid~=hhid[_n-1]
| metro - urban - rural
maxage | Rural Urban Metro | Total
-----------+---------------------------------+----------
1 | 3 0 0 | 3
3 | 0 0 7 | 7
7 | 0 0 4 | 4
12 | 6 0 0 | 6
17 | 8 2 4 | 14
18 | 0 0 3 | 3
19 | 4 7 0 | 11
20 | 5 0 3 | 8
************************************************
80 | 3 0 1 | 4
81 | 41 0 0 | 41
82 | 0 4 9 | 13
83 | 4 0 0 | 4
84 | 18 0 2 | 20
86 | 0 7 2 | 9
87 | 0 2 0 | 2
89 | 0 11 0 | 11
90 | 15 0 0 | 15
92 | 6 0 0 | 6
93 | 0 8 0 | 8
-----------+---------------------------------+----------
Total | 3,048 893 1,198 | 5,139
(Note: Trimmed Results)
While there are some interesting households which report having very young members are the oldest in the household, most households with truly elderly people reside in rural areas.
Let's now examine the second part of our question - what is the distribution of the average household age by race. Again, the egen command is necessary. First type:
sort hhid
egen avgage = mean(age), by(hhid)
Again, let's review what this command tells STATA to do. The egen command tells STATA to create a variable called avgage that gives gives each person in the data a response equal to the average age of that individual's household. As previously noted, it is instructive to ponder what would happen if the by(hhid) was omitted from the command? If that were the case, then the command would tell STATA to create a variable called avgage, which gave each person the value equal to the average age of the entire data set.
On with our example, again we do not want to create a cross tab that allows more than one response from each household when examining the distribution of average household age by race. Therefore, it is imperative to once again use the [_n] option. Type:
sort hhid
tab avgage race if hhid~=hhid[_n-1]
Because this initial command yields too large a table, it is prudent to use some age qualifiers. Let's start by examining the distribution for avgage if avgage is greater than 35:
sort hhid
tab avgage race if hhid~=hhid[_n-1] & (avgage>35 avgage~=.)
| 19 :population group
avgage | 01-afric 02-colou 03-india 04-white | Total
-----------+--------------------------------------------+----------
35.16667 | 12 0 0 0 | 12
35.25 | 4 0 0 0 | 4
35.33333 | 6 0 0 6 | 12
35.4 | 5 0 0 0 | 5
35.5 | 8 0 0 4 | 12
35.6 | 5 0 0 0 | 5
35.66667 | 3 0 0 3 | 6
36 | 8 0 0 0 | 8
*******************************************************************
66 | 1 0 2 3 | 6
67 | 6 0 0 4 | 10
68 | 3 0 0 2 | 5
69 | 2 0 0 0 | 2
69.5 | 0 0 0 2 | 2
70.5 | 0 0 0 2 | 2
72 | 4 0 0 2 | 6
73 | 1 0 0 0 | 1
75.5 | 0 2 0 0 | 2
76 | 0 0 0 2 | 2
80 | 0 0 0 1 | 1
81.5 | 0 0 0 2 | 2
-----------+--------------------------------------------+----------
Total | 325 26 10 104 | 465
(NOTE: Trimmed Results)
We learn that the majority of elderly households, or households with high average ages, are white. The egen command offers other very useful ways to examine data. Make it point to review the other egen options in the online STATA help files.
Yet another use of the egen command is to create three variable cross-tabulations. Suppose we are interested in examining the level of satisfaction of each person by race and gender. Using the egen command, we can type:
egen racesex=group(race gender_n)
Let's make sure we understand what we just told STATA to do. The egen varname = group(varlist) option tells STATA to create a variable that takes on the values 1, 2, ... for the groups formed by the variables specified within the group( ) option - race and gender in our case. More specifically, it tells STATA to create one variable with values that corresponds to white-males, white-females, black-males, black females, etc. Let's look at the cross tab of race and gender_n first:
tab race gender_n
19 |
:populatio | 4 :gender code
n group | F M | Total
-----------+----------------------+----------
01-afric | 2157 2025 | 4182
02-colou | 216 191 | 407
03-india | 61 58 | 119
04-white | 221 215 | 436
-----------+----------------------+----------
Total | 2655 2489 | 5144
This simple crosstab tells us the frequency count of each combination of race and gender. Now, let's look at our newly created variable, racesex:
tab racesex
group(race |
gender_n) | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,157 41.93 41.93
2 | 2,025 39.37 81.30
3 | 216 4.20 85.50
4 | 191 3.71 89.21
5 | 61 1.19 90.40
6 | 58 1.13 91.52
7 | 221 4.30 95.82
8 | 215 4.18 100.00
------------+-----------------------------------
Total | 5,144 100.00
The commands that follow are the labeling commands that will make our newly created variable much easier to read.
#delimit ;
label var racesex "Race by Gender";
label define racesex
1 "African Women"
2 "African Men"
3 "Coloured Women"
4 "Coloured Men"
5 "Indian Women"
6 "Indian Men"
7 "White Women"
8 "White Men";
label values racesex racesex;
racesex | Freq. Percent Cum.
--------------+-----------------------------------
African Women | 2157 41.93 41.93
African Men | 2025 39.37 81.30
Coloured Wome | 216 4.20 85.50
Coloured Men | 191 3.71 89.21
Indian Women | 61 1.19 90.40
Indian Men | 58 1.13 91.52
White Women | 221 4.30 95.82
White Men | 215 4.18 100.00
--------------+-----------------------------------
Total | 5144 100.00
Look at the the response labeled "1" for the variable racesex. We can see by examining the earlier cross tab of the race by gender_n, that this is the correct count of female Africans in the data. When labeling the responses of the new variable, remember that STATA orders the responses by taking the combinations of the first row from left to right, then the second row, from left to right, etc., etc., until all possible combinations in a cross tab are complete.
Now we can examine the cross tab of level of satisfaction between our new variable racesex and level of satisfaction, satisfie. Now, because the level of satisfaction is a household variable, we must once again insert the [_n] clause into our command. What we will be now examining is the difference in respondents who are responding to the survey, by race and gender.
tab racesex satisfie if hhid~=hhid[_n-1]
group(race | 1 :level of satisfaction
gender_n) | -1 01-v sat 02-satis 03-neith | Total
----------------+--------------------------------------------+----------
African Women | 4 18 61 31 | 353
African Men | 2 19 89 29 | 417
Coloured Women | 1 2 18 7 | 40
Coloured Men | 0 3 13 1 | 36
Indian Women | 0 1 5 2 | 12
Indian Men | 0 2 7 0 | 13
White Women | 0 17 37 6 | 65
White Men | 0 19 37 13 | 78
----------------+--------------------------------------------+----------
Total | 7 81 267 89 | 1014
| 1 :level of
group(race | satisfaction
gender_n) | 04-dissa 05-v dis | Total
----------------+----------------------+----------
African Women | 149 90 | 353
African Men | 168 110 | 417
Coloured Women | 4 8 | 40
Coloured Men | 9 10 | 36
Indian Women | 4 0 | 12
Indian Men | 3 1 | 13
White Women | 5 0 | 65
White Men | 6 3 | 78
----------------+----------------------+----------
Total | 348 222 | 1014
While there are many cells in this cross tab, there is not as much difference between the sexes, but one can see that African households have a much different level of satisfaction with the current government compared to white households.
An Extended Example: Comparing the Total Individual Income per Household to the Variable Total Monthly Income
In the SALDRU data set, the total monthly income of a household was computed using many different sources of income. In certain situations, did the computed total monthly income use more than just the sum of the individual incomes of every member in each household? Let's examine using the command egen. To compute a variable that is the sum of all individual incomes (incmon) for each member of a household, is used for the following command.
sort hhid
egen ttlinc=sum(incmon), by(hhid)
To test to see if our assumption about the way total monthly income is computed is correct, all one has to do now is list hhid ttlinc and totminc together and see if ttlinc and totminc are usually the same for each household.
list hhid ttlinc totminc
hhid ttlinc totminc
1. 1006 1100 1036
2. 1008 0 1070
3. 1012 1620 1527.333
4. 2001 450 475
5. 2001 450 475
6. 2001 450 475
7. 2001 450 475
8. 2001 450 475
9. 2001 450 475
10. 2001 450 475
11. 2001 450 475
12. 2008 0 150
13. 2008 0 150
14. 2008 0 150
15. 2008 0 150
16. 2012 784 825.6
17. 2012 784 825.6
18. 2014 0 1150
19. 2014 0 1150
20. 2014 0 1150
21. 2014 0 1150
22. 2014 0 1150
23. 2025 800 800
24. 2025 800 800
In many cases, there are differences between the sums of a household's individual income and total monthly income. This introductory probing of this occurrence should lead to greater questioning of how people and households compute their incomes respectively.
One could also use the egen command with the sum adaptation to compute other very useful variables. For example, one could compute the total number of children in a household for each household, or total number of sicknesses by province.
There are many useful ways to use egen, and hopefully, these practice questions can assist you in gaining speed in egen.
- 7. Using egen, how many people are African sons or daughters have migrated in the last five years?
- Question 7 Answer
- 8. What is the total household income for the Kwa Zulu Natal?
- Question 8 Answer
Chi-Squared: Testing for Independence
By now, we have examined tables of variables. Perhaps you have noticed that in a few examples as one variable increased or decreased, the other variable in the cross tab decreased or increased. While the naked eye is good at noticing these relationships, it is unclear how accurate the relationships are until we examine them statistically. It is a good idea before any further analysis of the variables occurs to test whether the variables in the crosstab are independent or not. By independent, we mean whether as X moves one way or another, Y's movements are completely random with respect to X. As we shall see in the next module, this is a good test to run now. This test for independence will test for any kind of functional relationship. In the next module, we will be working only with linear relationships.
Let's try this simple example with the variables, "safety inside the home" and "victim of a crime".
tabulate safety_h crime_q, row col chi2
4 :safety |
inside | 6a:victim of crime
home | -1 1 2 | Total
-----------+---------------------------------+----------
01-more | 0 99 730 | 829
| 0.00 11.94 88.06 | 100.00
| 0.00 15.64 15.74 | 15.69
-----------+---------------------------------+----------
02-less | 1 98 1455 | 1554
| 0.06 6.31 93.63 | 100.00
| 7.69 15.48 31.37 | 29.41
-----------+---------------------------------+----------
03-the s | 12 436 2453 | 2901
| 0.41 15.03 84.56 | 100.00
| 92.31 68.88 52.89 | 54.90
-----------+---------------------------------+----------
Total | 13 633 4638 | 5284
| 0.25 11.98 87.77 | 100.00
| 100.00 100.00 100.00 | 100.00
Pearson chi2(4) = 81.2115 Pr = 0.000
The Pr=0.000 tells us that the two variables are related. Logically, this makes sense. Victims of crime will have different opinions about their safety than people who have not been that unfortunate.
An Extended Example: The New Government Variable
Now that we have learned to perform basic and more advanced cross-tabulations, we might find it beneficial to apply some of this new knowledge to one of the more interesting variables in the data set: the new government variable, new_govt. The 1993 survey asked individuals if they felt that the future government of South Africa would make their lives better, the same, or worse, with these responses being coded as 1,2, and 3, respectively. Negative values represent non-responses. Before tabulating the new_govt variable with any of the other variables in our set, we might want to tabulate it on its own in order to get a better feel for it.
tab new_govt
8 :effect |
of new |
government | Freq. Percent Cum.
------------+-----------------------------------
-4 | 389 7.36 7.36
-3 | 12 0.23 7.59
-1 | 25 0.47 8.06
01-bette | 3005 56.87 64.93
02-same | 823 15.58 80.51
03-worse | 1030 19.49 100.00
------------+-----------------------------------
Total | 5284 100.00
We can immediately see that nearly 57% (highlighted above) of all individuals surveyed felt that the future government would improve their lives. We might, however, want to eliminate the influence of non-respondents and find of the individuals that answered the question what percentage felt that a new government would benefit them.
tab new_govt if new_govt>0
8 :effect |
of new |
government | Freq. Percent Cum.
------------+-----------------------------------
01-bette | 3005 61.86 61.86
02-same | 823 16.94 78.80
03-worse | 1030 21.20 100.00
------------+-----------------------------------
Total | 4858 100.00
Now we can see that of those individuals who actually answered this question, over 60% of them felt that the new government would make their lives better. For an even quicker summary of new_govt, we might want to view the responses in a pie chart. As discussed earlier, we accomplish this by using the tab command with the gen option:
tab new_govt, gen(opinion)
Each of the possible responses for new_govt is subsequently transformed into a dummy variable (opinion1, opinion2, etc.) which can readily be graphed. (Note that the variable name "opinion" is arbitrary, any name that adheres to the guidelines for variable names in STATA can be chosen).
graph pie opinion4 opinion5 opinion6
Notice that opinion1, opinion2, and opinion3 were not included in the pie chart because they represent non-responses.
Although this simple univariate tabulation is certainly educational, we can learn more by analyzing this variable against another, i.e., performing a bivariate analysis. Say, for instance, that we wanted to know how feeling about the future government varied with employment status. In order to find this out, we would cross-tabulate new_govt with unempl_q (coded as 1 for "currently employed" and 2 for "not currently employed"). Again we might want to eliminate non-responses to the new_govt question to obtain more readily interpretable results.
tab new_govt unempl_q if new_govt>0
8 :effect | 3 :currently employed
of new | ?
government | 1 2 | Total
-----------+----------------------+----------
01-bette | 543 1083 | 1626
02-same | 206 254 | 460
03-worse | 286 321 | 607
-----------+----------------------+----------
Total | 1035 1658 | 2693
We can see that within the "same" and "worse" categories for feelings about a new government, roughly similar numbers of respondents are working and not working. The most striking cells are for individuals feeling that a new government will make their lives better: twice as many unemployed as employed people feel that a new government would benefit them. On a general level (not taking the unique qualities of South Africa into account), this intuitively makes sense. An individual who is not "profiting" under the current regime might optimistically guess that his lot will improve with the election of a new government.
In the previous example, the raw numbers were few and more-or-less easy to interpret. When that is not the case, however, use of the row, column, and cell options discussed earlier can prove invaluable. For instance, a cross-tabulation of newprov on new_govt (eliminating non-responses) generates a table crammed with raw numbers. Using the row option with this same cross-tabulation yields a more decipherable result.
tab newprov new_govt if new_govt>0, row
new | 8 :effect of new government
province | 01-bette 02-same 03-worse | Total
-----------+---------------------------------+----------
W.CAPE | 95 59 101 | 255
| 37.25 23.14 39.61 | 100.00
-----------+---------------------------------+----------
N.CAPE | 12 0 14 | 26
| 46.15 0.00 53.85 | 100.00
-----------+---------------------------------+----------
E.CAPE | 622 69 143 | 834
| 74.58 8.27 17.15 | 100.00
-----------+---------------------------------+----------
NATAL | 598 272 307 | 1177
| 50.81 23.11 26.08 | 100.00
-----------+---------------------------------+----------
O.F.S. | 219 20 20 | 259
| 84.56 7.72 7.72 | 100.00
-----------+---------------------------------+----------
E.TVL | 285 97 69 | 451
| 63.19 21.51 15.30 | 100.00
-----------+---------------------------------+----------
N.TVL | 454 119 45 | 618
| 73.46 19.26 7.28 | 100.00
-----------+---------------------------------+----------
N.W. | 374 64 57 | 495
| 75.56 12.93 11.52 | 100.00
-----------+---------------------------------+----------
P.W.V. | 346 123 274 | 743
| 46.57 16.55 36.88 | 100.00
-----------+---------------------------------+----------
Total | 3005 823 1030 | 4858
| 61.86 16.94 21.20 | 100.00
Without the benefit of row percentages, we would see that 598 people in Natal felt that the future government would better their lives, for example. The row percentage puts this number in the context of the total number of respondents in Natal, thus telling us that these 598 people comprise about 50% of the respondents residing in Natal. Alternatively, the column option puts these 598 people in the context of the total number of individuals who feel that the future government will better their lives.
tab newprov new_govt if new_govt>0, column
new | 8 :effect of new government
province | 01-bette 02-same 03-worse | Total
-----------+---------------------------------+----------
NATAL | 598 272 307 | 1177
| 19.90 33.05 29.81 | 24.23
-----------+---------------------------------+----------
Total | 3005 823 1030 | 4858
| 100.00 100.00 100.00 | 100.00
The same tabulation using column instead of row (partially shown) indicates that these 598 people comprise slightly less than 20% of all individuals who feel that the new government will benefit their lives. Meanwhile, about 24% of all respondents reside in Natal. The final option, cell, shows percentages for the intersection of the different values that the variables can assume.
tab newprov new_govt if new_govt>0, cell
new | 8 :effect of new government
province | 01-bette 02-same 03-worse | Total
-----------+---------------------------------+----------
NATAL | 598 272 307 | 1177
| 12.31 5.60 6.32 | 24.23
-----------+---------------------------------+----------
Total | 3005 823 1030 | 4858
| 61.86 16.94 21.20 | 100.00
Thus we see a table (partially shown) revealing that the 598 people living in Natal and believing that the new government will benefit them account for slightly over 12% of all individuals responding to the survey.
Thus far, we have only been tabulating the new_govt variable with other categorical variables. What if we wanted to find out how peoples' opinions about the future government varied with a quantitative variable like spending on cars, for example? As was mentioned earlier, we would use the tab command with the sum option.
tab new_govt if new_govt>0, sum(mxcar)
8 :effect |
of new | Summary of vehicle exp.
government | Mean Std. Dev. Freq.
------------+------------------------------------
01-bette | 31.787022 114.78386 3005
02-same | 73.912515 173.88925 823
03-worse | 131.25631 326.34718 1030
------------+------------------------------------
Total | 60.013174 193.47283 4858
Eliminating non-responses, we thus find that on average, individuals in this survey believing that the future government would improve their lives spent 100 fewer Rand per year unit time on their vehicles than did respondents who felt that the new government would worsen their lives. This probably gives some indication of the relative wealth of those giving different responses to the question concerning the advent of a new government. This analysis could also be done by province:
sort newprov
by newprov: tab new_govt if new_govt>0, sum(mxcar)
-> newprov= NATAL
8 :effect |
of new | Summary of vehicle exp.
government | Mean Std. Dev. Freq.
------------+------------------------------------
01-bette | 42.274247 161.93195 598
02-same | 65.441176 135.49345 272
03-worse | 88.990228 230.62849 307
------------+------------------------------------
Total | 59.813084 178.23673 1177
Although the above command produces a table like the preceding one for each province, only the table for Natal is reproduced here. We see that difference in average spending on cars according to feelings about the future government is less pronounced in Natal than in the country as a whole.
Finally, although our cross-tabulations might signal that a viable relationship between two variables exists, the results from the survey might not be statistically significant (note that this section is optional for those unfamiliar with hypothesis testing). Say for instance that we run a bivariate analysis on two variables: new_govt and rural (coded as 0 for urban and 1 for rural).
tab new_govt rural, column
8 :effect |
of new | rural home indicator
government | 0 1 | Total
-----------+----------------------+----------
-4 | 270 119 | 389
| 12.54 3.80 | 7.36
-----------+----------------------+----------
-3 | 7 5 | 12
| 0.33 0.16 | 0.23
-----------+----------------------+----------
-1 | 14 11 | 25
| 0.65 0.35 | 0.47
-----------+----------------------+----------
01-bette | 908 2097 | 3005
| 42.17 66.98 | 56.87
-----------+----------------------+----------
02-same | 309 514 | 823
| 14.35 16.42 | 15.58
-----------+----------------------+----------
03-worse | 645 385 | 1030
| 29.96 12.30 | 19.49
-----------+----------------------+----------
Total | 2153 3131 | 5284
| 100.00 100.00 | 100.00
Pearson chi2(5) = 481.9535 Pr = 0.000
(Note that non-responses were not omitted here so as not to affect inference). We can see in our table that effectively 67% of rural respondents felt that the future government would benefit them while only about 12% of rural respondents felt that a new government would make them worse off. Although a clear preference appears in rural areas for a new government, still in urban areas about 42% of respondents felt that a new government would benefit them and fully 30% of those answering the question believed that a new regime would negatively change their lives. (Note that if the percentage of individuals giving whatever response was the same in both rural and urban areas, we could make a strong case that no relationship exists between new_govt and rural, i.e. they are independent). The question then arises: can we conclude that the difference in general feelings towards a new government according to rural or urban location that appears in our sample overstates the difference that exists in the population, i.e. our results are not statistically significant? We would set up the null (H0) and alternative (H1) hypotheses like this:
H0: No significant relationship exists between new_govt and rural.
H1: A significant relationship exists between new_govt and rural.
STATA can test this for us while simultaneously performing our cross-tabulation requests. This test is known as the chi square test and it determines whether or not a relationship exists between two categorical variables.
tab new_govt rural, column chi2
Pearson chi2(5) = 481.9535 Pr = 0.000
The chi2 option places the above line under the cross-tabulation results table. The most important element of the line is "Pr." This tells us that there is a less than 0.1% chance that we are wrong in rejecting the earlier null hypothesis ("no relationship exists between new_govt and rural). We can thus conclude that attitudes about a new government did in fact differ according to rural and urban location.
EXERCISES
- See if there is an obvious trend in the average monthly income per worker by gender, race and attitude toward the new government. (Do not include invalid answers in the table)
- Exercise 1 Answer
- Which variable looks more closely related to total monthly income graphically and statistically? Metro, number of stillborns?
- Exercise 2 Answer
- Create a table and a pie graph showing occupation categories for the males in the survey.
- Exercise 3 Answer
- Compare the number of hours worked per day and racial status. First, generate a table without missing values, then generate one with missing values.
- Exercise 4 Answer
- What is the total number of African households who have the average number of births being greater than four?
- Exercise 5 Answer
| BACK TO TOP |