TABLE OF CONTENTS
Introduction
Variable Types
Count Command
Frequency Distribution Tables
Using [_n]
Creating and Using STATA DO Files
Frequency Distribution Graphs
Exercises
INTRODUCTION
Now that we have acquired some basic STATA skills, we are ready to begin analyzing the data. The immediate problem is -- Where do we begin? As we have seen, there are an enormous number of variables and observations at our disposal although none of this information in its raw form is especially useful. How can we summarize this massive amount of information simply and quickly to make it more accessible?
From Module 2, we now know how to open a data file and examine the file's contents in STATA. It turns out that we have more information than we can usefully process. To see this, load the data set (SALDRU12) and simply type:
list incmon
A large number of households scroll by. After seeing these numbers fly by, do you now know more about the distribution of income in South Africa? Probably not. We need a handy way to summarize lots of quantitative raw data, and STATA is very helpful here. For example, while we could, in principle, simply count the thousands of observations as they scroll by, STATA commands essentially do this in a much more efficient and sophisticated fashion. We will start by learning how to answer the following questions:
- 1. How many Africans are in the data?
- 2. What percentage of the sample is made up of Coloured respondents?
- 3. How many White men in the sample reside in rural areas?
- 4. Are there more men or women over the age of 60?
- 5. What percentage of the SALDRU12 sample is made up of sons and daughters?
VARIABLE TYPES
In SALDRU, there are several types of variables. If we have loaded SALDRU12.dta into STATA, we will see the list of variables in the "Variables" window. Economists, sociologists, and psychologists use different language to describe the multiple types of variables. Here, we will divide variables into two types: continuous variables and categorical variables. However, each of these types of variables can be applied to either the individual or the entire households. In this regard, we must understand the difference between an individual-level variable and a household level variable. The statistical and graphical tools used to understand the distributions of the various types of variables are quite different, so it is important to understand the differences between these measures.
Continuous Variables: Continuous variables have an infinite number of possible values that fall between any two observed values. For example, consider age. In our data, age is recorded in years. But it could have been recorded in months, days, minutes, or even seconds. A continuous variable is ordinal in the sense that its values have an inherent order. In the age example, an age of 16 years is one year older than the age of 15 years, thus the unit of measurement in between these two values is itself meaningful. (This may seem like common sense, but when we consider categorical variables, this will no longer be true.) Examples of continuous variables in the SALDRU data set include age (age), income (incmon), and the expenditure measures (totmexp and mxtrent).
We are actually not being terribly careful with our definitions. Consider, for example, a variable that counts members of the household (hhsizem). Household size might be 4 or 5, but it will never be 4.34. Nonetheless, if we were told that the average number of members of a household was 4.34, this would be comprehendible. We would know that, on average, there are more than 4 members and less than 5. We are going to treat variables like household size and number of births as continuous variables. (Some disciplines refer to these as discrete variables.) Taking the average gives an answer that is readily interpreted. Taking the average of a categorical variable, on the other hand, yields nonsense.
Categorical Variables: Categorical variables, also known as nominal variables are made up of separate and distinct categories which do not have an inherent order. To code these variables, each category is typically assigned a value, but this assignment is arbitrary. Take for example the race variable, race. Each racial group is assigned an arbitrary value. In the data set, if a person is African, the race variable for that person is set to 1. If the person is Coloured, the value is set to 2, Indian is 3, and White is 4. For the gender variable gender_n, males are coded as 3 and females as 2. Other examples of what we will consider categorical variables include the household id number (hhid) the relationship to the head of household (rel_head), and province (province).
A special type of a categorical variable is a dummy variable. A dummy variable is a variable that typically takes on a value of one if the observation meets specified criteria and a value of zero if otherwise. There are not many dummy variables in SALDRU, except for a few like RURAL. We will often want to create dummy variables ourselves. For example, if we wanted to create a dummy variable for whether a household was White, we could use the following STATA commands.
gen white = 0
replace white = 1 if race == 4
In general, it is important to know the types of variables you are using because some of the tools used to analyze variables differ depending on whether the variable is continuous or categorical. Another point to keep in mind is whether the variable you are using is an individual-level or household-level variable.
Individual-level Variables: Individual-level variables are made up of values that are unique to each person in the household. An example is the variable for age (age). To see an example of an individual-level variable, type the following: sort hhid list hhid age As we can see, each person in the same household has an age value that is unique to them. Other examples in the SALDRU data set would include the following variables: incmon (monthly income), educ_c (educational attainment level), and gender_n.
Household-level Variables: Household-level variables have the same value for every person in the household. An example would be the variable for total monthly income (totminc). To see an example of a household-level variable, type the following commands:
sort hhid
list hhid totminc
As we can see, each person in the same household has the same value of total monthly income. Other examples in the SALDRU data set include the following variables: metro and mxtfood (total monthly food expenditure).
COUNT COMMAND
COUNT counts the number of observations that satisfy specified conditions. If no conditions are specified, count displays the number of observations in the data set. For example, to count the number of observations in the SALDRU data set, we would type:
count
The results should show that there are 5284 observations in this data set. However, try an example using a qualifier. For instance, suppose we want to count the number of females in the data set.
count if gender_n==2
The results should show that there are 2655 females in the data set.
FREQUENCY DISTRIBUTION TABLES
A frequency distribution table is simply a listing of all observed values for a given variable and the number of observations that fall under each of these values. To create a frequency distribution table in STATA, we use the command tabulate.
For example, to create a frequency distribution table for the categorical variable race, you type:
tab race
The above command produces the following distribution table in the STATA Results window:
19 |
:population |
group | Freq. Percent Cum.
------------+-----------------------------------
01-afric | 4,182 81.30 81.30
02-colou | 407 7.91 89.21
03-india | 119 2.31 91.52
04-white | 436 8.48 100.00
------------+-----------------------------------
Total | 5,144 100.00
As we can see from the table, there are four distinct categories or values found within the race variable. The four observed values being 01-afric, 02-colou, 03-india, 04-white. STATA gives us 3 specific numbers related to each observed value. The column with the header "Freq." is the number of observations that fall within each category. Thus, we can now answer the question, "How many Africans are in the data?" The answer being there are 4,182 Africans in the SALDRU12 data.
The second column with the header "Percent" represents the percentage of the sample that falls within each observed category. Thus, we can now answer the question "What percentage of the sample is made up of Coloured respondents?" Coloured respondents make up 7.91% of the SALDRU12 sample.
The third column with the header "Cum." represents the cumulative percentage of the corresponding observed values. For example, 91.52 percent of the SALDRU12 sample is made up of observations with values 01-afric, 02-colou, and 03-india. In other words, approximately 92% of the sample is made up of non-whites.
Ok, now it is your turn to answer a few questions:
- 1. What percentage of the sample is 70 years old and younger?
- Question 1 Answer
- 2. How many resident heads are there in the SALDRU12 sample?
- Question 2 Answer
There are three options that are used with the command tabulate that are worth noting. The first is the nolabel option. When we use the nolabel option with the tabulate command, the value labels that sometimes appear in place of the actual recorded numeric value will not be displayed. Instead, the original numeric value will be displayed in the table. This option can best be understood using an example. Let's use the gender_n variable for our example. Start by displaying a frequency distribution table for gender_n without any options:
tab gender_n
Without the nolabel option the tabulate command produces the following table when used with the variable gender_n:
4 :gender |
code | Freq. Percent Cum.
------------+-----------------------------------
F | 2,655 51.61 51.61
M | 2,489 48.39 100.00
------------+-----------------------------------
Total | 5,144 100.00
From the table above, we see that there are two observed gender_n values displayed in the table, "F" (indicating female) and "M" (indicating male). These value labels are used to help the user identify what each numeric value represents. Thus, instead of displaying an arbitrary number, text has been substituted in the numeric value's place. To display the actual numeric values, use the nolabel option with the tabulate command:
tab gender_n, nolabel
With the nolabel option, the tabulate command produces the following table when used with the variable gender_n:
4 :gender |
code | Freq. Percent Cum.
------------+-----------------------------------
2 | 2,655 51.61 51.61
3 | 2,489 48.39 100.00
------------+-----------------------------------
Total | 5,144 100.00
As we can see, the two previous tables are identical except that the value labels "M" and "F" have been replaced by the actual numeric values. At times it is useful to see the actual numeric values of a variable instead of these value labels. For example, if we were referencing the values for gender_n, we would need to use the actual numeric values - not the value labels. To emphasize this point, let's try and generate a new variable using the variable gender_n to recode.
Go ahead and give this exercise a try:
- 3. How would you create a new variable that is equal to 1 if the respondent
is a women and equal to 0 if the respondent is a man (use the gender_n variable
to identify the gender of the respondents)?
- Question 3 Answer
Did you have trouble with question 3? If you did, most likely you were using the wrong values. While at times the value label is what is displayed in the frequency distribution table, it is not the actual value that is stored in the data set. You must reference the original numeric value for the replace command to work correctly. This is why the nolabel option is so helpful, when used with the tabulate command, the original numeric value is displayed.
A second option used with the tabulate command is missing. The missing option displays all system missing values for the specified variable. Thus, to display all system missing values for the variable race we would type:
tab race, missing
The above syntax displays the following table in the STATA Results Window:
19 |
:population |
group | Freq. Percent Cum.
------------+-----------------------------------
01-afric | 4,182 79.14 79.14
02-colou | 407 7.70 86.85
03-india | 119 2.25 89.10
04-white | 436 8.25 97.35
. | 140 2.65 100.00
------------+-----------------------------------
Total | 5,284 100.00
From the table above, we see that a new category has been added to the race variable distribution. This new category, represented by a dot, reflects the number of system missing observations. It is important to note that the raw percentages and cumulative percentages are different from those presented in the table created without the missing option. This is due to the increase in the number of total observations recognized within the distribution.
Now that we have introduced you to some of the basics, let's see how well you can use these new commands.
- 4. How many Coloured respondents are there in the data?
- Question 4 Answer
- 5. Are there more men or women over the age of 60?
- Question 5 Answer
- 6. How many White men in the sample reside in rural areas?
- Question 6 Answer
- 7. How many respondents in the sample have a missing value for the variable identifying the relationship to the head of household?
- Question 7 Answer
USING[_n]
The above example counted all of the women in the data set. However, suppose we just want to count the number of households in the data set. To accomplish this task, the [_n] option is very helpful. Using the [_n] option allows us to treat an individual-level variable as a household-level variable. For this example, we would want to type the following:
sort hhid
count if hhid~=hhid[_n-1]
1062
The 1062 that STATA gives us, is the number of households in the data set. For this qualifier to work, we must first sort the data. The hhid~=hhid[_n-1] qualifer is telling STATA to go to every hhid and only count it if that hhid is not equal to the one before it. Here is a visual display of what STATA is doing:
hhid Count
1006 1
1008 1
1012 1
2001 1
2001 .
2001 .
2001 .
2001 .
2001 .
2001 .
2001 .
2008 1
2008 .
2008 .
2008 .
2012 1
2012 .
2014 1
2014 .
2014 .
2014 .
2014 .
Total: 22 observations 5 households
Although this is only the tip of the iceberg on what the [_n] option can do and we will be using in it more in future modules, maybe we should try one more example. Suppose we wanted to create a new type of household variable that will only be applied to the first person in the household. Let's look at the household variable for total monthly income (totminc). If we were were to sort by hhid and then list the hhid and totminc, we would see a piece of the following:
+-------------------+
| hhid totminc |
|-------------------|
1. | 1006 1036 |
2. | 1008 1070 |
3. | 1012 1527.333 |
4. | 2001 475 |
5. | 2001 475 |
|-------------------|
6. | 2001 475 |
7. | 2001 475 |
8. | 2001 475 |
9. | 2001 475 |
10. | 2001 475 |
|-------------------|
11. | 2001 475 |
12. | 2008 150 |
13. | 2008 150 |
14. | 2008 150 |
15. | 2008 150 |
|-------------------|
16. | 2012 825.6 |
17. | 2012 825.6 |
18. | 2014 1150 |
19. | 2014 1150 |
20. | 2014 1150 |
|-------------------|
As we can see, STATA has produced a list that shows everyone with the same household identification number (hhid) as having the same total monthly income. However, we want to produce a list in which STATA only shows total monthly income for the first person in the household. To do this, we must create a new variable that is basically equal to the value of the total monthly income.
sort hhid
gen totminc2=totminc if hhid~=hhid[_n-1]
We have just created the new variable totminc2. The command above generates the new variable totminc2 and assigns the values based on the original variable (totminc), but it only records the new values if the total monthly income above it is different. To better understand this concept, we should do the following:
sort hhid
list hhid totminc totminc2
+------------------------------+
| hhid totminc totminc2 |
|------------------------------|
1. | 1006 1036 1036 |
2. | 1008 1070 1070 |
3. | 1012 1527.333 1527.333 |
4. | 2001 475 475 |
5. | 2001 475 . |
|------------------------------|
6. | 2001 475 . |
7. | 2001 475 . |
8. | 2001 475 . |
9. | 2001 475 . |
10. | 2001 475 . |
|------------------------------|
11. | 2001 475 . |
12. | 2008 150 150 |
13. | 2008 150 . |
14. | 2008 150 . |
15. | 2008 150 . |
|------------------------------|
16. | 2012 825.6 825.6 |
17. | 2012 825.6 . |
18. | 2014 1150 1150 |
19. | 2014 1150 . |
20. | 2014 1150 . |
|------------------------------|
With the new variable totminc2, only the first person in the household received the value of the household-level variable, totminc. Why do this? Basically, this will become very helpful when we look at the means of household-level variables. In doing this, we will hopefully avoid the problem of number bias.
CREATING AND USING STATA "DO" FILES
Up to this point, we have been entering all STATA commands using the command window. In the process of recoding the more complex variables or in the process of creating more sophisticated graphs, you will find it cumbersome to enter long lines of syntax (commands) line by line. We will now learn how to be much more efficient by using STATA .do files. As the name implies, ".do" files are files that help you "do" commands with STATA.
Normally, do files are created using a simple text editor, like Notepad or any other word processing program, however, STATA itself has a Do-File Editor. In general, you can use any text editor as long as you save the document with a Do extension. After correctly typing your commands into a do file, you save it, and then run it after telling STATA where it is at. After telling STATA where the do file is at, it will execute the commands contained within the do file.
Like many other things in STATA, there are several ways to find and run your do files. One way, is to use FILE on the toolbar, clicking on DO..., and then finding the do file in the directory where it was saved. A second method, is to use an explorer window to find the do file and then double-clicking it to have STATA execute it. Another, but most tricky, is to search for it using your command window. This method is a bit harder because it requires basic knowledge of DOS/UNIX commands to navigate the various directories and subdirectories.
Lets find out what a do file looks like. Due to the immediate availability of the STATA Do-File editor, we will use it to type in and save our STATA syntax (commands). We can open the Do-File editor by either clicking on the icon (looks like right-hand holding a pen over a white pad of paper) or to use the keyboard shortcut - press the ctrl + 9 keys together. Once open, you can either type in the syntax below or highlight it, copy, and paste it into the Do-File editor. Either way, after entering the commands you will want to save the file and note in what directory you save it in. (Note: if you type it manually, you do not need to include the comments, however, if you copy and paste it, the comments will not interfere with the commands.) After entering the syntax and saving the file, search for the do file as outlined above, watch STATA do it's magic. Lets try it - either type, or copy and paste, the syntax below into the Do-File editor:
***********************************************************
***********************************************************
set mem 5M /*Sets the memory to 5M*/
set mat 800 /*Sets the number of variables allowed in any given model estimation*/
set more 1 /*Allows the output scroll by without requiring user assistance*/
#delimit ; /*Tells STATA that every command line below ends with an ";"*/
log using module3.log, replace; /*Tells STATA to log output and to replace the old one*/
use saldru12.dta; /*Tells STATA to load and use the named data file*/
gen white = 0;
replace white = 1 if race == 4;
count; /*Counts how many cases exist in the current data file*/
count if gender_n==2;
tab race;
tab gender_n;
tab race, missing;
sort hhid;
count if hhid~=hhid[_n-1];
gen totminc2=totminc if hhid~=hhid[_n-1]; /*Generates new variable called totminc2*/
label var totminc2 "Household Level Total Monthly Income"; /*Labels new var totminc2*/
log close; /*Closes any opened log file*/
***********************************************************
***********************************************************
Now, lets review what the syntax above is telling STATA.
The first three lines set the environment for STATA to work in. In particular, the set more 1 command is important to include. It is similar to the set more off command, however when set more off is typed in it is permanent for that session of STATA until you type in set more on. If instead you use set more 1, STATA will temporarily set more off for that do file and then reinstates it after it is done.
The fourth line, is also very important, it allows us to break up long command lines into multiple lines. Unless otherwise told, STATA will automatically assume that a command line ends with a carriage return (i.e., enter key). The command #delimit ; tells STATA to execute whatever is before the next semi colon as one command. This option can be reset with #delimit cr, which then tells STATA to execute everything before the next carriage return as one command (default setting). The #delimit ; command can be quite useful, especially when we start writing longer lines of code.
The remaining commands, you should be well acquainted with. Although, it is important to realize that virtually every command you enter in the command window, can be entered with a do file.
Including Comments with your Syntax:
Note the use of comments. Every good programmer will include more than enough comments to make their syntax completely understandable to anyone else interested in the coding, recoding, or creating of new variables. In general, we recommend and embrace an active and prolific use of comments. In the example above, our comments are meant to document the purpose of specific command lines, however, in a real life do files, it is likely that we would only extensively comment the newly created variables and the rationale behind them. As you begin to develop your own do files and new variables, we encourage you to comment your new creations.
Using LOG Files to Create DO Files:
While it can be a bit cumbersome to edit a log file, it is definitely a viable alternative to creating a do file from scratch. As discussed earlier, every command entered in the command window will be noted in your log file. After executing any given command, however, STATA will always preface each command line with a "." (dot). Thus, after entering STATA commands interactively and saving your log file, you can edit your log file by removing the unnecessary dots and any other STATA comments and thereafter saving your edited log file as a new do file - with a do extension. After that, you are ready to rerun all your saved commands.
FREQUENCY DISTRIBUTION GRAPHS
Now that we know how to use do files, we can begin to graph good looking graphs.
While the tabulate command gives us one way to understand the frequency distribution of a variable, graphing is another way. In principle, each can convey the same information. Often, though, graphs are more readily interpreted. If we are trying to convey information to a colleague with limited (or no) quantitative training, a graph sometimes will be more effective than a table. Also, graphs are, even for those with lots of statistical sophistication, a wonderful way to get a quick feel for the information in the data set.
STATA is a very powerful graphing tool and in this module, we introduce the basics. We will learn how to create and interpret three kinds of graphs -- histograms, bar graphs, and pie graphs. Let's go!
First things first, in general we can think of the graph command as having the following form:
[graph] [graph type] [plot type] [if exp] [in range] [, graph type_options],
where graph type can be twoway, matrix, bar, dot, box, pie, or other;
plot types is mainly for twoway graph types and can be scatter, line, bar, dot, among others.
Type: help graph_twoway to see all the possibilities. For now, let's learn about the basic graphs - histograms, bar charts, and pie charts.
HISTOGRAMS
Histograms are a graphical tool that tells us the fraction of observations, for any given variable, that fall within different ranges. Histograms are used for continuous variables. As a running example, we will consider the variable hmxtfood, total monthly food expenditure for the household. To draw a histogram, we can leave almost everything up to STATA. First, however, since this variable is a household-level variable, we need to recode it into a new variable that controls for the household size (i.e., number bias). We can do this using the [_n] command from above by simply typing:
sort hhid
gen hmxtfood=mxtfood if hhid~=hhid[_n-1]
label var hmxtfood "Total Monthly Household Food Expenditures"
Then we tell STATA to draw a histogram using our new hmxtfood variable. We do this by typing:
graph twoway histogram hmxtfood, fraction
We will see STATA draw the following graph.

NOTE: In the past (prior to STATA 8), this initial graphing command would require additional specifications to be useful, however this new version of STATA version makes graphing a bit easier and more robust. As mentioned above, this module will keep instruction fairly simple by focusing on the basic graphing commands, however, for a more thorough and more complex graphing tutorial visit our Graphing Module.
This first graph is a good start, but this command allows for many more specifications. Instead of using "fraction," we could have specified density, frequency, or percent. Each would produce a similar looking histogram but each would have a different y-axis. For now, let's continue with fraction and later when we show you how to combine graphs, we'll show you how what those other options produce.
To improve this first histogram, let's type:
#delimit ;
histogram hmxtfood, frac
title("Total Monthly Household
Food Expenditures")
xtitle("Expenditure in Rand")
note("Source: 1994 data from South Africa Labour Development Research Unit")
ylabel(0(.05).30, angle(horizontal))
ytick(0(.025).25)
xlabel(0(1000)5000)
xtick(0(250)5000);
Now we should see the following histogram:

This is much better!
Let's go over the syntax that created this good looking histogram. First, given the length of the STATA command, we needed to use a delimiter that tells STATA that a carriage return (STATA's default) does not end the command, but instead a semi-colon does. This allows us to enter commands on multiple lines, as we did above. Next, note that we did not need to include graph twoway to tell STATA what we wanted, in this case it is only necessary to type histogram. Similarly, we did not need to completely spell out fraction, frac works just as well. Next, we told STATA to title the graph "Total Monthly Household Expenditure" then we changed the default x-axis title, which is based on the variable label, to "Expenditure in Rand". We also included a note at the bottom of the graph that informs the reader where the data came from, in our case we are using 1994 SALDRU data. The remaining lines of syntax, tell STATA how to reformat the x- and y-axes. We told it to relabel the y-axis from 0 to .30 in .05 increments. Then we asked for tick-marks to be placed in between the labeled units. The angle(horizontal) option told STATA to change the default y-axis labels to be read horizontally. Similarly, we asked STATA to relabel the x-axis in increments of 1000 from 0 to 5000 and to include tick-marks in increments of 250.
We can see that we have come a long way. This graph tells us, among other things, that total monthly food expenditure is concentrated below 1000 Rand and that values around and below 500 Rand per month are very frequent. We could be more specific if we counted the number of bins (bars) and estimated the fraction of the observations in each bin. For example, it appears that about 45 percent of the observations are below 500 Rand per month.
While these first two graphs are very informative, we could for example look at these graphs by racial categories. We can accomplish this with the "by" option. To graph total monthly food expenditure by race, we type:
sort race;
histogram hmxtfood, frac by(race);

We now see how the distribution of food expenditure varies by racial group. STATA placed each of the racial categories in one graph to make comparison easy. Surely, however, we can make this graph look better - to do so, we can type:
#delimit ;
histogram hmxtfood, frac by(race)
title("Total Monthly Household Food Expenditures")
xtitle("Expenditure in Rand")
note("Source: 1994 data from South Africa Labour Development Research Unit")
ylabel(0(.05).30, angle(horizontal))
ytick(0(.025).25)
xlabel(0(1000)5000)
xtick(0(250)5000);
We should get the following graph:

The graph above leaves MUCH to be desired. There is definitely some room for improvement. First thing to realize is that the by(group) option is treated as a "repeating" option. Meaning that STATA plotted each of the four racial categories individually and then merged them into a single graph, which is what we see. As a result, each of the racial categories has its own title, x-axis, and note attached to it, but that is not what we want. Instead we should type the following set of commands:
#delimit ;
label define race 1 "Black African" 2 "Coloured" 3 "Indian" 4 "White
African", modify;
label val race race;
histogram hmxtfood, frac
by(race, title("Total Monthly Household Food Expenditures")
note("Source: 1994 data from South Africa Labour Development Research Unit"))
xtitle("Expenditure in Rand")
ylabel(0(.05).35, angle(horizontal))
ytick(0(.025).35)
xlabel(0(1000)5000)
xtick(0(250)5000);

Now that's much better.
Notice how we had to include both our title "Total Monthly Household Food Expenditures" and our note, "Source: 1994 data from South Africa Labour Development Research Unit" inside the by(group) option. The remaining instructions are similar to the ones for the second graph above. We did, however, need to redefine the variable values for the race variable. This was necessary to make each of the subtitles more descriptive. We will discuss variable labels in more detail in the coming modules.
For now, study the following graph. Type:
#delimit ;
histogram hmxtfood, frac
by(race, title("Total Monthly Household Food Expenditures")
subtitle("(in Rand)")
note("Source: 1994 data from South Africa Labour Development Research Unit")
caption("South Africa Distance Learning Project") row(1))
xtitle("")
ylabel(0(.05).35, angle(horizontal)) ytick(0(.025).35)
xlabel(0(1000)5000, angle(vertical)) xtick(0(250)5000);

The final graph in this food expenditure example, demonstrates how STATA is a truly versatile graphing tool. Remember this is merely the tip of the graphing-iceberg, for more details see our Graphing Module and/or consult the STATA GRAPHING manual and the online graphing help - help graph.
BAR & PIE GRAPHS
Similar to Histograms, Bar and Pie graphs are graphical tools - visual aids if you will, which inform the reader of the distribution of any particular variable. Unlike histograms, however, Bar and Pie graphs are used for categorical variables only. As an example, we will consider the variable metro, which identifies the location of a given household. As in the previous example, this variable is also a household-level variable, so, again we must first create a new variable, we will name it "hhmetro" and it will control for the number bias. We will also go ahead and label the variable and its values. We do this by typing:
sort hhid
gen hhmetro=metro if hhid~=hhid[_n-1]
label var hhmetro "Metropolitan Status of Household"
label def hhmetro 1 "Rural" 2 "Urban" 3 "Metro"
label val hhmetro hhmetro
As with any other analysis, we must know what the variable we're working with "looks" like. In our case, we can do this by simply tabulating hhmetro. Type:
tab hhmetro, missing
Metro |
Status of |
Household | Freq. Percent Cum.
------------+-----------------------------------
Rural | 551 51.88 51.88
Urban | 207 19.49 71.37
Metro | 304 28.63 100.00
------------+-----------------------------------
Total | 1,062 100.00
This table gives us a sense of what we should expect. It gives us a frame of reference to compare our resulting graphs. It also tells us that there are no missing cases in our variable. For our purposes, however, we will also need to know what the actual values are for each of the category, so we need to instruct STATA to tabulate the same variable but this time without showing the value labels. Type:
tab hhmetro, missing nolabel
are working with construct a bar or pie graph for a categorical variable we must construct dummy variables for each category of the specified variable. Thus, we must begin by creating dummy variables for each of the hhmetro categories. Let's begin by tabulating our hhmetro variable.
As we can see, the numeric values for the variable hhmetro are not displayed. Instead, the value labels that we assigned it above are listed in the distribution table. To determine the numeric values, we will need to use the "nolabel" option. Type:
A frequency distribution table is displayed. This time, instead of the value labels "Rural", "Urban", and "Metro", the numeric values are shown. From the table we see that there are three numeric values found in the distribution of the hhmetro variable: 1, 2, and 3. Now that we have identified these values, we must create dummy variables for each. We do this using the generate and replace commands. Type:
gen metro1 = 0 if hhmetro~=.
replace metro1 = 1 if hhmetro == 1
gen metro2 = 0 if hhmetro~=.
replace metro2 = 1 if hhmetro == 2
gen metro3 = 0 if hhmetro~=.
replace metro3 = 1 if hhmetro == 3
We created three new dummy variables (metro1, metro2, and metro3), one for each value of hhmetro. From here we are ready to graph the distribution. To create a bar graph for the categorical variable hhmetro type:
#delimit ;
graph bar metro1 metro2 metro3;
The command above creates the following bar graph:

The graph contains three distinct columns, one for each of the hhmetro values. The height of each column reflects the percent of observations that fall into that given category. From the graph we see that the number of rural households is much larger than number of urban or metro households. This initial bar graph is informative, of course, but if we left it as is we would be wasting STATA's great graphing capabilities. Let's try to make it better. Type:
graph bar metro1 metro2 metro3,
title("LOCATION OF SAMPLED HOUSEHOLDS")
note("Source: 1994 data from South Africa Labour Development Research Unit")
caption("South Africa Distance Learning Project")
ylabel(0(.10).50, angle(horizontal)) ytick(0(.05).50) ymtick(0(.025).50)
blabel(bar, position(inside) format(%9.2f) color(white))
legend(label(1 "Rural") label(2 "Urban") label(3 "Metro"))
bargap(25);

To create a pie graph for the categorical variable hhmetro type:
graph pie, over(hhmetro)
title("LOCATION OF SAMPLED HOUSEHOLDS")
note("Source: 1994 data from South Africa Labour Development Research Unit")
caption("South Africa Distance Learning Project")
pie(1, color(green))
pie(2, color(orange))
pie(3, explode color(red))
plabel(_all percent, color(white))
legend(label(1 "Rural") label(2 "Urban") label(3 "Metro") row(1) order(3 2 1));
The command above creates the following pie graph:

The command that created this pie graph should be fairly intuitive, but let's take a closer look to make sure. First off, the actual pie command can take three different forms:
- it can take the form used above where we told STATA to graph pie, using the variable hhmetro
type: graph pie, over(hhmetro); - we could have used the same variables we used for the bar graph (metro1, metro2, metro3)
type: graph pie metro1 metro2 metro3; or - we can tell STATA to create a pie graph of hhmetro by another
variable, say *****
type: graph pie hhmetro, over(*****).
The title, note, and caption specifications should be clear - just remember to enclose your text in quotes for each of these. The pie specification allows you to format each of the pie slices if you like. In the example above, our variable hhmetro has three categories, thus we have three pie slices to specify if we choose to. In our example, we told STATA to color slice number 1 green and to color the second slice orange. For the third slice, we told STATA to not only color it red, but also to "explode" it from the other 2 slices, which emphasizes the Metro slice.
We also specified the plabel (pie label), which is analogous to the blabel (bar label) we used in the bar graph above. With it we told STATA to label _all of the pie slices with their corresponding percent and to present that number in the color white. Finally, using the legend option, we were able to modify the default label of each category in our variable. If you notice, we also instructed STATA to present the labels in a single row and to order the labels in the suggested order. The default order is based on the values of the categories. In our case, category 1 is Rural, the second is Urban, and the last one is Metro. The default ordering is (1 2 3), which would be fine but in this example we decided to switch the order of the legend labels to present Metro then Urban and finally Rural. NOTE: the suboptions for the pie slices, pie labels, and the legend must be specified within parentheses as shown in the examples above, also known as round-brackets.
Overall, however, we see that a majority of the SALDRU12 households are located in rural settings. The distribution displayed in the pie graph is identical to the distribution displayed in the bar graph, although the presentation is a bit different. In this case, the pie graph appears to be better at showing and calling attention to a particular category - metro in this example.
Now that we have learned some of the basics to graphing categorical variables, let's turn to some of the more helpful advanced commands. At times, it can be quite tedious creating dummy variables when the given categorical variable has a large number of categories. Luckily, STATA allows us to create dummy variables much quicker and easier. Instead of using the replace command to create dummy variables for each category of a given variable, we can use the combination of two commands: tabulate and generate.
For example, if we were to graph the categorical variable province (which has 14 categories) it would take quite a bit of time to create the necessary dummy variables. Although, by using the tabulate and generate commands together, these dummy variables can be created in one step.
Once again, however, this is a household-level variable and as such we must treat it accordingly. Meaning that since each individual in any given household has the same value as everyone else in that household, we need to control for the number of household members, otherwise our results will reflect individuals and not households. We do this by using the [_n] command:
sort hhid
gen hhprov=province if hhid~=hhid[_n-1]
tab hhprov, gen(temp)
All the necessary dummy variables will be constructed. The 14 new dummy variables will be named temp1, temp2, temp3, and so on. The name "temp" (which is placed after the generate command in parantheses or round brackets) was arbitrarily assigned and could be replaced by any combination of letters and numbers. These new dummy variables all have values of 0 and 1. Thus, temp3 is set to equal 1 for all observations in which the hhprov value is equal to 3, the variable temp10 is set to equal 1 for all observations in which the hhprov value is equal to 10. After using this quick method of creating dummy variables, we simply type the following to graph the hhprov variable:
graph pie temp1 - temp14

As you see, the command above creates a very basic and not very good-looking pie graph for the categorical variable hhprov. Let's make it look better! Type:
graph pie, over(hhprov)
title("PERCENT OF HOUSEHOLDS BY PROVINCE")
note("Source: 1994 data from South Africa Labour Development Research Unit")
caption("South Africa Distance Learning Project")
plabel(_all percent, size(*0.75) format(%9.0f) color(white))
legend(label(1 "CAPE") label(2 "NATAL") label(3 "TRANSVAAL")
label(4 "ORANGE FREE") label(5 "KWA ZULU") label(6 "KANGWANE")
label(7 "QWA-QWA") label(8 "GAZANKUL") label(9 "LEBOWA")
label(10 "KWANDEBE") label(11 "TRANSKEI") label(12 "BOPHUTHA")
label(13 "VENDA") label(14 "CISKEI") col(3));

Now this is a lot better. The specification of the pie graph should be fairly intuitive to you by now. If not, please review the notes above. There are a couple of new options that we specified, however, that you should be aware of. First, unlike the first hhprov graph, which used the dummy variables (temp1 thru temp14) we specified this second pie graph using the over(hhprov) option. This instructs STATA to use the values in hhprov as the slices of the pie. Therefore, it is not necessary to create dummy variables to create a good looking pie graph!
Also note that we specified, size(*0.75) format(%9.0f), which tells STATA to adjust the size of the labels by 75% of the default size and to format the values to have no decimal points. Finally, notice that we told STATA to relabel the legend and to display the names in 3 columns.
The by option is another helpful feature when creating graphs. The by option can be used to create separate pie and bar graphs for each category of an additional variable. For example, if we want a separate pie graph of the variable province for men and women, we would once again need to create a set of dummy variables, but this time for the original variable province, since in this case we are interested in each individual and not just the households they reside in. We begin by creating the new dummy variables:
tab province, gen(tmp)
Then we type:
sort gender_n
graph pie tmp1 - tmp14, by(gender_n)

The commands above produce two separate pie graphs, one representing the distribution of women across the various provinces and another representing the distribution of men. Once again, however, the basic command produces a graph that needs some enhancing. Let's try the following:
graph pie tmp1-tmp14,
by(gender_n,
title("PERCENT OF INDIVIDUALS BY PROVINCE AND GENDER")
note("Source: 1994 data from South Africa Labour Development Research Unit")
caption("South Africa Distance Learning Project"))
legend(label(1 "CAPE") label(2 "NATAL") label(3 "TRANSVAAL")
label(4 "ORANGE FREE") label(5 "KWA ZULU") label(6 "KANGWANE")
label(7 "QWA-QWA") label(8 "GAZANKUL") label(9 "LEBOWA")
label(10 "KWANDEBE") label(11 "TRANSKEI") label(12 "BOPHUTHA")
label(13 "VENDA") label(14 "CISKEI") col(3))
plabel(_all percent, size(*0.75) color(white));

The syntax that we used for this last graph should be fairly easy to understand. You should note, however, the title, note, and caption are all specified within the by(gender_n) command.
EXERCISES
Now it is your turn to explore the distributions of variables using the commands from this module.
Using STATA and the SALDRU12 data set, answer the following questions.
- Is the variable gender categorical or continuous? Considering this, how would you graph this variable in STATA?
- Exercise 1 Answer
- What percentage of the sample is made up of sons and daughters?
- Exercise 2 Answer
- Which of the racial groups makes up the largest proportion of urban residents?
- Exercise 3 Answer
- How many households are headed by females?
- Exercise 4 Answer
| BACK TO TOP |