TABLE OF CONTENTS
Introduction
Understanding Distributions of Continuous Variables
Means of Continuous Variables
Using the Summarize Command
Medians and Modes of Continuous Variables (Using Tabulate Command)
Example: Recoding the Education Variable
Measures of Dispersion - Variance and Standard Deviation
Handling Outliers
Understanding the Distributions of Categorical Variables
Example: Exploring the Distribution of Satisfaction Variable
Combining Tabulate and Summarize
Exercises
INTRODUCTION
In Module 3, we learned about the different variable types that exist in the SALDRU data set, the commands that enable us to see frequency distribution tables, and some basic graphing commands. Based on that, we are going to start learning how to do some basic statistical analysis, using measures of central tendency and variability.
There are many interesting questions we might want to investigate using the data on household monthly income (totminc). For example:
- What is the average monthly household income in South Africa?
- Is the distribution of income in South Africa somewhat equal or are incomes much higher for the rich than for the poor? How much higher?
- How does the average income in South Africa vary by racial group and how does income inequality vary within racial groups in South Africa?
- How does income in South Africa vary by province? Do we live in a province that has above or below average income levels?
We will start this module considering continuous variables. Then we will learn some new commands that make analyzing data easier. Lastly, we will go through some of the key methods that will enable us to analyze categorical variables effectively.
UNDERSTANDING DISTRIBUTIONS OF CONTINUOUS VARIABLES
For now, let's focus on monthly individual income (incmon). Let's start by seeing what the distribution of income looks like. In the graph below, we find that slightly over 25 percent of the observations are in the first bar (which means income is less than 1000 Rand.) Also note that only a very small fraction of income earners, earn monthly incomes above 10,000 Rand. It turns out that only one person reports an income of 16,400. The next highest monthly income is 15,000 Rand and there is no one in SALDRU12 that report a monthly income between these two figures. That is why, in the graph below, there are no bars between 15,000 and 12,000. It would be tempting to simply ignore the two incomes above 15,000 rand, but for now, we don't want to arbitrarily delete observations. Later we will experiment different ways of handling "outliers", but for now, the data are what they are.
For the graph, type:
# delimit ;
histogram incmon, bin(50) percent
title("Monthly Gross Pay")
xtitle("Earnings in Rand")
note("Source: 1994 data from South Africa Labour Development Research Unit")
ylabel(0(5)30, angle(horizontal)) ytick(0(2.5)30)
xlabel(0(5000)15000) xtick(0(2500)17500);

[Jump forward to discussion of medians and modes if we are revisiting this graph.]
MEANS OF CONTINUOUS VARIABLES
Let's consider the first question posed at the top of this module. What is the average monthly household income in South Africa? The average, or mean, income is defined as the sum of all incomes divided by the number of incomes. To compute this in STATA, type:
means incmon
Variable | Type Obs Mean [95% Conf. Interval]
---------+----------------------------------------------------------
incmon | Arithmetic 636 1585.805 1421.772 1749.838
| Geometric 605 874.7511 794.5193 963.0849
| Harmonic 605 389.0338 323.7276 487.3472
---------+----------------------------------------------------------
As we see, STATA plots three different types of means, the arithmetic, geometric, and harmonic mean. We will only be concerned with the arithmetic mean in our modules. In SALDRU12, the mean household monthly income was 1585.805 Rand. If we listed other variables after incmon, STATA would report the mean of these too. For example, we could have typed means incmon age and we would have also learned what the average age of a respondent in SALDRU12 was.
The means command can also be used with the qualifiers introduced in Module 2. For example, to learn the average income of the respondents who were classified as Indian, we would type:
means incmon if race==3
Variable | Type Obs Mean [95% Conf. Interval]
---------+----------------------------------------------------------
incmon | Arithmetic 20 2531.6 1590.302 3472.898
| Geometric 20 1842.663 1226.459 2768.464
| Harmonic 20 1253.584 836.8192 2497.352
---------+----------------------------------------------------------
We see that the average monthly income of Indians is 2531.6 Rand. (If we don't know the codes for race, we could look them up in the survey by clicking on the "SALDRU Survey" button in the menu on our left.)
Try the following quick exercises:
- 1. What is the average monthly pay of respondents over the age of 40?
- Question 1 Answer
- 2. What is the average age of Black Africans in the sample?
- Question 2 Answer
- 3. Conditional on being older than 25, what is the average age of Whites in the sample?
- Question 3 Answer
USING THE SUMMARIZE COMMAND
There are other ways to compute the mean of a variable in STATA. One that is worth learning now is the command summarize. In STATA, type:
summarize incmon
Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- incmon | 636 1585.805 2106.602 0 16400
Again we see that the average monthly income is 1585.805 Rand. There are two nice features of summarize as opposed to means. First, summarize tells us the range of the variable. For example, when we typed summarize incmon, we learned that incmon ranged between 0 and 16,400. This information was not given when we typed means. Second, summarize works with the by() option. If we sort the households by race, province, or any other distinguishing characteristic, we can compute the mean by each of those characteristics. First, we have to sort the data by the variable we intend to use in the by(). To sort by race, type:
sort race
Next, compute the mean of monthly income by race, type:
by race: summarize incmon
-> race = 01-afric
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 338 824.2604 921.371 0 5960
_______________________________________________________________________________
-> race = 02-colou
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 77 1139.091 995.0228 0 6500
_______________________________________________________________________________
-> race = 03-india
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 20 2531.6 2011.258 280 7800
_______________________________________________________________________________
-> race = 04-white
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 89 4384.764 3145.977 200 16400
_______________________________________________________________________________
-> race = .
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 112 1798.089 2280.498 0 11500
We see that the average monthly income of Africans, Coloureds, Indians, and Whites, respectively are 824, 1139, 2531, and 4384 Rand. (Note that STATA also computed the average monthly income for those missing the race code, denoted ".")
Sometimes, it will be helpful to compute means (or other statistics) by groups that we construct. For example, suppose we wanted to use the summarize (or sum for short) command and we wanted to compute means by four age groups-- under 20, 21-40, 41-60, and over 60. We need to construct these age groups, sort by the constructed variable, and then compute the means. Let's do this one together:
generate agegroup = .
replace agegroup = 1 if age<= 20
replace agegroup = 2 if age > 20 & age <= 40
replace agegroup = 3 if age > 40 & age <= 60
replace agegroup = 4 if age > 60 & age~=.
label var agegroup "age group"
sort agegroup
by agegroup: summarize incmon
-> agegroup = 1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 23 492.5217 497.8474 0 2184
_______________________________________________________________________________
-> agegroup = 2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 332 1634.916 2101.675 0 15000
_______________________________________________________________________________
-> agegroup = 3
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 151 1439.887 1772.352 0 9000
_______________________________________________________________________________
-> agegroup = 4
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 12 1262.917 1844.193 0 6500
_______________________________________________________________________________
-> agegroup = .
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 118 1880.288 2613.092 0 16400
We should see that in SALDRU12, the average monthly income for people under 20, 21-40, 41-60, and over 60 is 492, 1634, 1439, and 1262 Rand, respectively.
MEDIANS AND MODES OF CONTINUOUS VARIABLES
Up to now, the only measure of central tendency that we have examined is the mean. There are other measures and two that we wish to examine now are the median and the mode of a distribution. The median of a distribution is the value for which half the observations are greater and half are less. If observations are symmetrically distributed, the median and the mean will be the same. If the distribution of a variable is quite skewed, however, the median and the mean will be quite different. In general, medians are sometimes used instead of means if one wants a measure that will be robust to large outliers. That is, we want a measure that is not very sensative to extraordinary values in the distribution.
Let's consider the income variable again, incmon. To compare the mean and the median of monthly income, type:
summarize incmon, detail
monthly gross pay
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 8 0
10% 120 0 Obs 636
25% 320 0 Sum of Wgt. 636
50% 879.5 Mean 1585.805
Largest Std. Dev. 2106.602
75% 1950 12000
90% 3939 15000 Variance 4437771
95% 6000 15000 Skewness 2.976268
99% 10000 16400 Kurtosis 14.93181
The option, detail tells STATA to give more information. Note that the output specifies the mean, 1585 Rand, as before, but it now tells us more about different parts of the distribution. In particular, we can now see the income value for which half (50 percent) of the observations are higher and half are lower -- in other words the median. Note that the median income is 879.5 Rand.
The median income is a lot less than the mean income. What is going on here? It is informative to review the graph of individual income above. Remember that the largest income of 16,400 Rand? That value is inflating the mean of the distribution whereas the median treats it as just one more value above the "half way" mark. For many variables with skewed distributions, the median is a very useful statistic.
The mode of a distribution is the value that appears most often in the distribution. The mode is a seldom used measure, but we should be aware of it. Let's consider the (strangely coded) education variable -- educ_c. The mode of this variable represents the schooling level that the most respondents claimed. There is no simple way to compute the mode in STATA. An awkward way to compute the mode of a distribution is to use the tabulate command. Type:
tabulate educ_c
6 |
:education |
code | Freq. Percent Cum.
------------+-----------------------------------
-4 | 26 0.51 0.51
-3 | 2 0.04 0.54
00-none | 1,300 25.27 25.82
01-sub a | 663 12.89 38.71
02-std 2 | 293 5.70 44.40
03-std 3 | 315 6.12 50.52
04-std 4 | 331 6.43 56.96
05-std 5 | 374 7.27 64.23
06-std 6 | 435 8.46 72.69
07-std 7 | 252 4.90 77.59
08-std 8 | 326 6.34 83.92
09-std 9 | 224 4.35 88.28
10-std 1 | 344 6.69 94.97
11-std 7 | 10 0.19 95.16
12-std 1 | 29 0.56 95.72
13-std 1 | 6 0.12 95.84
14-std 1 | 54 1.05 96.89
15-std 1 | 19 0.37 97.26
16-compl | 35 0.68 97.94
17-crech | 68 1.32 99.26
18-pre-p | 35 0.68 99.94
19-other | 3 0.06 100.00
------------+-----------------------------------
Total | 5,144 100.00
By examining the entire frequency distribution, we should note that the mode of educ_c is zero. A large number of respondents (1,300) reported zero for their level of education, far more than for any other value. That's not very surprising since infants and toddlers are included in the sample. What is more surprising is the modal level of education for adults 21 and older. Type:
tabulate educ_c if age >= 21 & age~=.
6 |
:education |
code | Freq. Percent Cum.
------------+-----------------------------------
-4 | 17 0.68 0.68
-3 | 2 0.08 0.76
00-none | 479 19.24 20.00
01-sub a | 143 5.74 25.74
02-std 2 | 125 5.02 30.76
03-std 3 | 151 6.06 36.83
04-std 4 | 170 6.83 43.65
05-std 5 | 218 8.76 52.41
06-std 6 | 269 10.80 63.21
07-std 7 | 133 5.34 68.55
08-std 8 | 233 9.36 77.91
09-std 9 | 128 5.14 83.05
10-std 1 | 277 11.12 94.18
11-std 7 | 10 0.40 94.58
12-std 1 | 28 1.12 95.70
13-std 1 | 6 0.24 95.94
14-std 1 | 50 2.01 97.95
15-std 1 | 13 0.52 98.47
16-compl | 35 1.41 99.88
19-other | 3 0.12 100.00
------------+-----------------------------------
Total | 2,490 100.00
As we see the table above, the modal value is still zero.
EXAMPLE: RECODING EDUCATION VARIABLE
Suppose we want to know the average level of education for individuals aged 21 or over in South Africa. This entails computing the mean of an education variable. When we compute means for continuous variables, we need to be wary of how the variables are coded. In this extended example, we will consider the education variable, educ_c.
The simplest (but incorrect) way to compute the mean of for educ_c is:
means educ_c
Variable | Type Obs Mean [95% Conf. Interval]
---------+----------------------------------------------------------
educ_c | Arithmetic 5144 4.217729 4.102152 4.333307
| Geometric 3816 4.278545 4.166788 4.393299
| Harmonic 3816 2.940384 2.854638 3.03144
---------+----------------------------------------------------------
That simplistic approach gives us 4.21 as the mean level of education, but what does that mean?
There are at least a couple ways to see how this variable is coded. Probably the safest way is to look at the original survey. In this case, look in SALDRU Survey Part 2 by clicking on the menu on the left. Question 6 asks about educational attainment. The codes for question 6 are given on page 4 of the survey. There we will note that a code of "0" indicates zero years of schooling, a code of 3 indicates Standard 3 and so on. The code usually increases by one for each year of schooling up to a value of 10. Codes greater than 10 do not necessarily indicate more years of schooling. For example, a code of 12 is "standard 10 + teacher training) while a code of 13 is "standard 10 + nursing). To make matters even trickier, look what a code of 16 represents a completed university degree while a code of 17 represents Preschool. A code of 18 represents Daycare. If that's not bad enough, for educ_c, missing values are given negative numbers in many cases. Unfortunately, the survey codebook does not mention how missing values are coded for educ_c.
To examine all the possible values that educ_c (or any other variable for that matter) take on, we need to use tabulate command.
tab educ_c, missing
6 |
:education |
code | Freq. Percent Cum.
------------+-----------------------------------
-4 | 26 0.49 0.49
-3 | 2 0.04 0.53
00-none | 1,300 24.60 25.13
01-sub a | 663 12.55 37.68
02-std 2 | 293 5.55 43.22
03-std 3 | 315 5.96 49.19
04-std 4 | 331 6.26 55.45
05-std 5 | 374 7.08 62.53
06-std 6 | 435 8.23 70.76
07-std 7 | 252 4.77 75.53
08-std 8 | 326 6.17 81.70
09-std 9 | 224 4.24 85.94
10-std 1 | 344 6.51 92.45
11-std 7 | 10 0.19 92.64
12-std 1 | 29 0.55 93.19
13-std 1 | 6 0.11 93.30
14-std 1 | 54 1.02 94.32
15-std 1 | 19 0.36 94.68
16-compl | 35 0.66 95.34
17-crech | 68 1.29 96.63
18-pre-p | 35 0.66 97.29
19-other | 3 0.06 97.35
. | 140 2.65 100.00
------------+-----------------------------------
Total | 5,284 100.00
To make things even more complicated, STATA only prints out the first 8 characters of the value label. By examining the survey itself, however, we find out that codes 10 and 12-15 actually refer to standard 10, not standard 1. How then, should we compute the average education level for households in the survey? There is no single answer. Below is one answer, worked out in STATA.
We begin by creating a new education variable that we call educ_new and we set it equal to the original education variable - educ_c - which means we will need to replace a few values!
generate educ_new = educ_c
label var educ_new "Recoded education"
We elect (rather arbitrarily) to combine adults who have no reported education with those whose last reported level was either daycare or preschool.
replace educ_new = 0 if educ_c == 17
replace educ_new = 0 if educ_c == 18
Next, we set the negative values to "missing".
replace educ_new = . if educ_c== -3
replace educ_new = . if educ_c== -4
We still are not done. We next combine code 11 "Standard 7,8,or 9 plus diploma" with those who have completed standard 9 and assign all of these individuals a code of "9".
replace educ_new = 9 if educ_c== 11
But what about all those adults who have completed standard 10 plus have training toward either a teaching, nursing, or technical degree or have taken some university courses. We somewhat arbitrarily assign these people a code of 12 -- more training than just completing standard 10, but only two more years.
replace educ_new = 12 if educ_c == 12
replace educ_new = 12 if educ_c == 13
replace educ_new = 12 if educ_c == 14
replace educ_new = 12 if educ_c == 15
The last code that we need to deal with is "other" which is code 19. We simply drop set these observations to "missing" since we don't know how to interpret "other."
replace educ_new = . if educ_c == 19
The last step is to review what codes are left and then compute the mean education level for individuals 21 and older. To review where we stand, type:
tabulate educ_new
Recoded |
education | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,403 27.44 27.44
1 | 663 12.97 40.41
2 | 293 5.73 46.14
3 | 315 6.16 52.30
4 | 331 6.47 58.77
5 | 374 7.31 66.09
6 | 435 8.51 74.59
7 | 252 4.93 79.52
8 | 326 6.38 85.90
9 | 234 4.58 90.48
10 | 344 6.73 97.20
12 | 108 2.11 99.32
16 | 35 0.68 100.00
------------+-----------------------------------
Total | 5,113 100.00
We will see that we now have a variable whose values are much more meaningful than those of the original educ_c. The distribution of education levels is graphed below.
histogram educ_new, frac

In the above graph, called a histogram, we can see that about 26 percent of the respondents report having zero years of education. The mean education level is found by simply typing:
means educ_new
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
educ_new | Arithmetic 5113 3.867006 3.765846 3.968166
| Geometric 3710 4.093948 3.989197 4.20145
| Harmonic 3710 2.869546 2.786527 2.957663
------------------------------------------------------------------------
We see that respondents have, on average, an education level of 3.87, somewhere close to having completed Standard 4. The correction for the strange coding of educ_c resulted in a lower mean level of education and this is what we probably expected since the original variable coded day care and preschool as more than a completed university degree.
An interesting question is the following:
What is the mean level of education in South Africa for individuals 21 years old and older?
means educ_new if age >= 21 & age~=.
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
educ_new | Arithmetic 2468 5.170583 5.017494 5.323673
| Geometric 1989 5.370819 5.215013 5.53128
| Harmonic 1989 4.074185 3.909677 4.253145
------------------------------------------------------------------------
We should have found that the average education level of adults was 5.17. What do we think this statistic will be 20 years from now when those who are now infants reach the age of 21?
As an exercise, try the following question:
- 4. Compute the mean education level of adults (21 and older) by gender and of non-adults (under 21) by gender. How about mean education level for children between the ages of 6 and 10 by gender? What do these results say about the future?
- Question 4 Answer
MEASURES OF DISPERSION -- VARIANCE AND STANDARD DEVIATION
So far, we have learned something about the average level of monthly income among individuals in South Africa. Higher incomes are generally a good thing, but from a policy perspective, we will often care about how widely distributed the distribution of income is. As an extreme, it could be the case that everyone in the country has exactly the same level of income (and that level of income would coincidently be the "mean"). That situation could be considered to be perfectly equal distribution of income. Of course, in the data, individuals and households have different levels of income.
In an effort to clarify this notion of variance consider two theoretical distributions with the same mean, as shown below. Both of these distributions have a mean income of about 4500 Rand. (This is about the same as the mean income of white households in SALDRU12). Examine these two distributions.


To learn how to recreate these distributions in STATA using more advanced commands, click here.
Even though both distributions have the same average level of income, we can see that the income inequality is greater in the top graph. If we only computed the mean of each of these (made-up) income variables, we might conclude that these distributions were essentially the same. The top distribution is more dispersed. The variance is a useful measure of dispersion of a variable. The square root of the variance is termed the standard deviation. A bigger variance always means a larger standard deviation. In the above two distributions, the standard deviation of Income Distribution #1 is 1000 while the standard deviation of Income Distribution #2 is 500. Since the standard deviation of the first distribution is twice as large as that of the second distribution, its variance is about four times as large. One useful way to think about the standard deviation of a distribution is the following. Suppose that the distribution of a variable is bell-shaped (more formally termed a normal distribution.) If we picked randomly from the normal distribution, two thirds of the time we would pick a value that was within two standard deviations of the mean of the distribution. For the case of Income Distribution #1 above, this means that if we randomly picked an income from this distribution, two out of three times we would probably pick an income within 2000 Rand of the mean income of 4500 Rand.
In STATA, there are a number of ways to determine the standard deviation of a variable, but the simplest is probably the summarize command. What is the standard deviation of individual monthly income by race in South Africa? To find out, type:
sort race
by race: summarize incmon
-> race = 01-afric
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 338 824.2604 921.371 0 5960
_______________________________________________________________________________
-> race = 02-colou
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 77 1139.091 995.0228 0 6500
_______________________________________________________________________________
-> race = 03-india
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 20 2531.6 2011.258 280 7800
_______________________________________________________________________________
-> race = 04-white
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 89 4384.764 3145.977 200 16400
_______________________________________________________________________________
-> race = .
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 112 1798.089 2280.498 0 11500
We discover that for the standard deviation for each of these groups is quite different from the other. Put another way, income inequality appears to vary by group.
As another exercise, let's investigate the the average household size as well as the standard deviation of household size depending on whether a household lives in a rural, urban, or metropolitan area. To eliminate the bias of household level variable, we need to create a new variable.
sort hhid
gen hhsizem2=hhsizem if hhid[_n]~=hhid[_n-1]
sort metro
by metro: sum hhsizem2
-> metro = Rural
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hhsizem2 | 551 5.059891 3.127887 1 16
_______________________________________________________________________________
-> metro = Urban
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hhsizem2 | 207 4.193237 2.623892 1 16
_______________________________________________________________________________
-> metro = Metro
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hhsizem2 | 304 3.894737 2.705959 1 17
We note that there is a pretty big difference in average family size depending on whether a household lives in a metropolitan/urban area or a rural household. The average household size is about 4 persons for the former and about 4 for the latter. Examining the median household size instead of the mean gives about the same picture. (Try it, just add the "detail" option to the above "summarize" command.)
HANDLING OUTLIERS
The mean can be very sensitive to data points that are very, very different from all the other observations in the distribution. We refer to these very different data points as "outliers." There are several large literatures in statistics that deal with outliers. These include statistical measures that are not especially sensitive to outliers-- robust statistics, as well as literatures on how to identify outliers and what to do once they are found. Reviewing this literature is beyond the scope of this module, but we want to be alert to the presence of outliers and aware of how they can impact some of our results. There are, in general, two ways we will deal with outliers. One is to use measures that are not sensitive to them, such as the median instead of the mean, and the other is to delete them from the data set (usually be setting them equal to a missing value.)
Sometimes it will be obvious when an outlier is simply miscoded data and hence should be set to missing. If for example, age was reported as 230, we would know that was a miscode. Most of the SALDRU data set has been "cleaned" and we are not aware of many miscodes. (If you think you find some, email us.) This doesn't mean that outliers won't matter. Consider our initial example of income, incmon. Examine what happens to mean income if we drop just the top 6 of the 636 observations. Since we will want to compare the original income measure from our truncated measure, we want to create a new income measure without writing over the old one:
generate incmon2 = incmon
tabulate incmon
replace incmon2 = . if incmon > 10000
means incmon incmon2
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
incmon | Arithmetic 636 1585.805 1421.772 1749.838
| Geometric 605 874.7511 794.5193 963.0849
| Harmonic 605 389.0338 323.7276 487.3472
-------------+----------------------------------------------------------
incmon2 | Arithmetic 630 1473.13 1335.636 1610.625
| Geometric 599 851.2621 774.3874 935.7682
| Harmonic 599 385.2893 320.6342 482.6056
------------------------------------------------------------------------
And to see if the outliers change our view of how average income varies by race:
sort race
by race: summarize incmon incmon2
-> race = 01-afric
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 338 824.2604 921.371 0 5960
incmon2 | 338 824.2604 921.371 0 5960
_______________________________________________________________________________
-> race = 02-colou
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 77 1139.091 995.0228 0 6500
incmon2 | 77 1139.091 995.0228 0 6500
_______________________________________________________________________________
-> race = 03-india
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 20 2531.6 2011.258 280 7800
incmon2 | 20 2531.6 2011.258 280 7800
_______________________________________________________________________________
-> race = 04-white
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 89 4384.764 3145.977 200 16400
incmon2 | 84 3824.333 2134.668 200 10000
_______________________________________________________________________________
-> race = .
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 112 1798.089 2280.498 0 11500
incmon2 | 111 1710.685 2093.929 0 9600
The results of this exercise suggest the following. Not counting the top six incomes leads to about a 7 percent decline in average income while it results in about a 25 percent decline in the standard deviation of income. When we delete the top five incomes, three of these observations were White. Deleting just these three observations results in a 15 percent decline in average income and almost a 50 percent decline in the standard deviation of income. These are big differences. Median income (using the "detail" option with summarize above) barely differs at all between incmon and incmon2.
There are other variables whose means are sensitive to a few very high values. For example, dropping just the top 7 of over 5200 observations on expenditures on transportation decreases the mean of these expenditures by almost 10 percent. While tabulate is one command to find outliers, summarize is also useful. Comparing the mean to the median is a helpful way to gauge the presence of outliers. There are a number of more sophisticated tools in STATA for examining outliers. These include graphical methods and the command "lv". Graphical methods are introduced in Module 5, while the "lv" command requires a better understanding of statistics than we assume in this web site.
If we are comfortable with the notion of logarithms, we can also look at the log of incmon. Here is a graph of monthly income with the x-axis calibrated in logs.

An advantage of calibrating the x-axis in logs is that high income outliers don't have as large an impact on the graph's orientation. When we work on regression analysis later on this will become very useful method.
UNDERSTANDING THE DISTRIBUTIONS OF CATEGORICAL VARIABLES
Many variables in SALDRU are categorical variables. Examples that we will work with in this subsection include occupation and race. In Module 3 we have already investigated how to investigate the frequency distribution of categorical variables. In this section, we ask whether there are statistics analogous to the mean and standard deviation (which we use to describe distributions of continuous variables) that we can use to describe distributions of categorical variables.
We begin with a cautionary note. Put simply, taking the mean of a categorical variable yields nonsense. Consider the race variable - STATA will let we compute the mean of it. Try the following:
means race
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
race | Arithmetic 5144 1.379666 1.355363 1.403968
| Geometric 5144 1.218668 1.204244 1.233265
| Harmonic 5144 1.134497 1.125687 1.143446
------------------------------------------------------------------------
We find the mean of the race variable is 1.37. What does this mean? As near as we can tell, it means nothing. It certainly does not mean that the average respondent is about halfway between an African and a Coloured respondent. The coding of the race variable was arbitrary. SALDRU dataset could just as easily have coded Whites as a 1, Indians as a 2, Africans as a 3, and Coloureds as a 4. If the mean of a categorical variable is nonsensical, are there other measures that do convey information? There are two - the mode and the range of the distribution.
The mode of the distribution is that value that appears most often in the sample. There is no STATA command that gives the mode without also giving we a bunch of other information. Probably the best way to compute the mode of a distribution is to use the tabulate command. The mode of the race variable tells us which racial category has the most respondents while the mode of the occupation variable tells us which occupation was claimed by the most respondents. To find the mode of the race variable (race) and occupation variable (k_occ_c), type:
tab race
19 |
:population |
group | Freq. Percent Cum.
------------+-----------------------------------
01-afric | 4,182 81.30 81.30
02-colou | 407 7.91 89.21
03-india | 119 2.31 91.52
04-white | 436 8.48 100.00
------------+-----------------------------------
Total | 5,144 100.00
tab k_occ_c
3a:code:occ |
upation | Freq. Percent Cum.
------------+-----------------------------------
-4 | 1 0.13 0.13
01-profe | 100 13.19 13.32
02-manag | 42 5.54 18.87
03-cleri | 79 10.42 29.29
04-trans | 50 6.60 35.88
05-servi | 122 16.09 51.98
06-farmi | 34 4.49 56.46
07-artis | 48 6.33 62.80
08-produ | 22 2.90 65.70
09-opera | 74 9.76 75.46
10-labou | 185 24.41 99.87
11-other | 1 0.13 100.00
------------+-----------------------------------
Total | 758 100.00
We find that the mode of race is "African" while the mode of the occupation variable is "labourer."
Just as the mean of the distribution of a categorical variable does not make any sense, neither does the standard deviation. Still, it may be useful to know the range of the values of a categorical variable. That is, what values are spanned by the variable codes? To compute the range of a variable, use the summarize command:
summarize k_occ_c
Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- k_occ_c | 758 5.897098 3.267868 -4 11
The above result tells us that the occupation variable varies from -4 to 11. We could also learn this, and more, using the tabulate command.
EXAMPLE: EXPLORING THE DISTRIBUTION OF THE SATISFACTION VARIABLE
There are many variables that we will want to explore in the SALDRU data set. One that we will work with often in our examples and exercises is the variable satisfie. This variable measures a household's satisfaction with its perceived quality of life. To read about this variable, go to page 51 in the household part of the SALDRU survey. We will use this variable to "test drive" some of the new STATA commands we have learned.
Unlike variables such as age, income, or number of children, the units of measurement for a variable measuring satisfaction are not obvious. In this sense, satisfied is not a typical continuous variable. In the language of statistics, it is referred to as an ordinal variable (as opposed to nominal, which has no inherent order). The first issue we need to explore is how this variable is coded. The clearest way, is to check the survey. Using STATA, we can use the tabulate command to find out how it is coded. Remember, however that this is a household-level variable. Thus, we need to handle it appropriately by typing:
sort hhid
tabulate satisfie if hhid~=hhid[_n-1]
1 :level of |
satisfactio |
n | Freq. Percent Cum.
------------+-----------------------------------
-1 | 9 0.85 0.85
01-v sat | 85 8.00 8.85
02-satis | 280 26.37 35.22
03-neith | 96 9.04 44.26
04-dissa | 359 33.80 78.06
05-v dis | 233 21.94 100.00
------------+-----------------------------------
Total | 1,062 100.00
From this, we see that a code of 1 indicates the highest level of satisfaction and a code of 5 indicates the lowest level. Values of "-1" are not listed in the survey, but appear in the data for 9 households. Before analyzing the data, we recode these to "missing" by typing:
replace satisfie = . if satisfie == -1
Now we are ready to investigate how households perceive their level of satisfaction. We can compute the overall mean level of satisfaction and the mean by metro by typing:
sort hhid
means satisfie if hhid~=hhid[_n-1]
Variable | Type Obs Mean [95% Conf. Interval]
-------------+----------------------------------------------------------
satisfie | Arithmetic 1053 3.356125 3.277525 3.434726
| Geometric 1053 3.044162 2.957692 3.133161
| Harmonic 1053 2.677005 2.585324 2.775428
------------------------------------------------------------------------
sum satisfie if hhid~=hhid[_n-1] & metro==1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
satisfie | 546 3.472527 1.21918 1 5
sum satisfie if hhid~=hhid[_n-1] & metro==2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
satisfie | 206 3.325243 1.360073 1 5
sum satisfie if hhid~=hhid[_n-1] & metro==3
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
satisfie | 301 3.166113 1.378035 1 5
We see that the average level of satisfaction is 3.36 -- about halfway between "neither satisfied nor dissatisfied" and "dissatisfied." The overall median, which must be an integer, equals 4. It is unclear what to make of these results. Some social scientists do not believe it is appropriate to take the mean of an ordinal va riable, while others do see value in such an effort. Those who consider the mean of an ordinal variable as useless argue that means for these types of variables are based on values that have no real meaning. In particular, the claim is that the intervals between categories not necessarily equal to one another. Whereas those who find value in the mean of an ordinal variable understand that in general, a mean of 3.36 indicates that most people in general lean towards the "dissatisfied" portion of the distribution.
COMBINING TABULATE and SUMMARIZE
Earlier in Module 2 we introduced two important options related to the tabulate command: missing and nolabel. Although now that we have learned about dealing with measures of central tendency in STATA, we can now introduce a third tabulate option. This third option is both the most complicated and the most useful. Using the summarize option with tabulate allows us to examine how the average of one variable differs by the categories of a second variable. Working through an example is really helpful in understanding the usefulness of this new option.
For example, say we were interested in determining the mean monthly pay for all of the racial groups in the SALDRU data set. Up to this point we would have been forced to use qualifiers to restrict the group that STATA works from. Thus to answer our question, we would have entered the following:
summarize incmon if race == 1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
incmon | 338 824.2604 921.371 0 5960
The above syntax would provide us with the average monthly pay of Africans. Then to obtain the rest of the averages, we would need to repeat the above syntax for each of the racial groups. Although, instead of having to take these steps, STATA provides us with a quicker way to obtain the exact same information.
Let's start by just typing:
tab race
This command produces a table we have seen before, the frequency distribution of the variable race.
19 |
:population |
group | Freq. Percent Cum.
------------+-----------------------------------
01-afric | 4,182 81.30 81.30
02-colou | 407 7.91 89.21
03-india | 119 2.31 91.52
04-white | 436 8.48 100.00
------------+-----------------------------------
Total | 5,144 100.00
While the above table provides us with useful information, to answer our question we do not need the actual number of Africans, Coloureds, Indians, and Whites in the data set. Instead, we would like the average monthly pay for each of these groups. This is where the summarize (which can be abbreviated as sum) option comes in, using the sum option we are able to obtain the mean values of a variable by the categories of another. So working from our example, to obtain the average monthly pay for each racial group we would enter the following in STATA:
tab race, sum(incmon)
19 |
:population | Summary of monthly gross pay
group | Mean Std. Dev. Freq.
------------+------------------------------------
01-afric | 824.26036 921.37097 338
02-colou | 1139.0909 995.02279 77
03-india | 2531.6 2011.2583 20
04-white | 4384.764 3145.9774 89
------------+------------------------------------
Total | 1540.4313 2067.0334 524
Unlike the previous table that gave us the frequency distribution of the variable race, the table above includes the information normally produced by the summarize command, except that we have this information for each of the racial groups. As we can see the combination of the tabulate command and the sum option is very powerful tool when attempting to obtain summary information.
OK, now that we have been through one example, let's see if we understand how and when to use the sum options. Give the following question a try:
- 5. How would we compute the average age by province?
- Question 5 Answer
EXERCISES
Now it is our turn to explore the measures of central tendency and variability using the commands from Module 4. Using STATA and the SALDRU12 data set, answer the following questions.
- What is the average age of heads of households in South Africa?
- Exercise 1 Answer
- Do people with income levels at or above the median level report a higher level of satisfaction than those who report an income below the median level, and if so, by how much?
- Exercise 2 Answer
- Do households that have to fetch and carry their own water have, on average, more children than households that do not have to fetch and carry their water?
- Exercise 3 Answer
- Is income more widely distributed in rural areas or in urban areas?
- Exercise 4 Answer
- What language is spoken by more respondents than any other? That is, what is the modal language?
- Exercise 5 Answer
- Is the percentage of respondents who self-identify as being a "professional" as their occupation higher for people who have recently moved or for people who have not moved recently?
- Exercise 6 Answer
| BACK TO TOP |