TABLE OF CONTENTS
-
Introduction
Launching STATA
Getting Ready
Loading Data into STATA
Exploring the Data
Managing the Data
Exercises
INTRODUCTION
In this module, you will be introduced to the statistical program STATA. Throughout this second module we are going to focus on some issues relating to the family. By the time you are finished with this module, you will be able to answer questions like:
- How many households are in the SALDRU survey?
- What are some variables in the data set that are related to food?
- How old is the oldest person in the data set?
- How many 50 year old women are in the data set?
LAUNCHING STATA
The first thing we need to do is to start the statistical software program, STATA. So let's go! If you know how to start STATA, do so now. If you don't, here are some helpful hints. Launch STATA now if you have not already done so.
We will need to go back and forth between the window that STATA uses and the window used by your web browser (with which you are now viewing this text.)
When STATA is running, there are a number of "windows" within STATA. The "STATA Command" window is where we will enter all STATA commands. Note all attempts will be made to display all relevant STATA commands used in these modules in black Courier font (as shown here). The "Review" window lists all commands run by STATA. We will be able to repeat any command listed here by simply clicking on them instead of re-typing them. The "STATA Results" window is where all the output from our commands appears. The "Variables" window lists the variables that are in your data set. When we first open STATA, all these windows will be blank except for the "STATA Results" window.GETTING READY
We'll need to do a couple things before actually loading the data set. First, you will want to get in the habit of opening what is called a log file before you start your work. This file will record all the input that you type as well as all the output produced by STATA. It is a useful file to have for a number of reasons. It will let us re-create our work if we later decide we want to redo something. It also allows others to replicate our work. The log file can also be used to cut results from and paste them into another file for editing purposes. For all these reasons, always open a log file. We can easily delete it later if we decide not to keep it. To open a log file, in the STATA command window, type:
log using "C:\MellonCourse\filename.log", replace
If the log file does not already exist, you will see a STATA message warning you that the file you created is new. That's fine. The replace option in the command line above tells STATA to write over any existing log file with the same name. If we wanted to add this log file on to an existing file without erasing the contents of the existing file, we would use:
log using "C:\MellonCourse\filename.log", append
AN IMPORTANT HINT: For every STATA command that we will use, there is on-line help within STATA. For example, above we used replace and append as options to the log command. There are other options and these can be investigated by simply typing: help log
This help command works for almost every command in STATA. If there are no STATA manuals handy, the help command is invaluable. The help command will tell us how to use a command, what that command does, its options, and even some examples. Use it often!
The step we will need to take before opening the data set is to tell STATA how much memory the data set will require. In this module, we will use the smallest of the SALDRU data sets. This data set is:
Other data sets have slightly different names. If you are using a folder other than "MellonCourse" (which would be a subdirectory you would create on your 'C:\' drive), that name too would be different. This data set requires 4 megabytes of memory. To tell STATA how much memory to set aside for our data, type:
set mem 4m
If you were using a larger version of the SALDRU data set, or any other larger data set, you would substitute the "4m" in the set mem command above using these memory guidelines.
LOADING THE DATA INTO STATA
Now we are ready to open the data. There are several ways to do this, but easiest way while in STATA is to click on the "File" menu at the top left and then click on "Open." Then navigate your way to the folder to which you downloaded the data. We have called that folder MellonCourse. Finally, click on the data set. It will be named SALDRUxx where "xx" represents the version of the SALDRU data set you have chosen to use. After selecting the data file, click on "Open" and the file will open in STATA. Alternatively, if you know the full path and file name of the data file, you could enter this directly into STATA in the STATA "Command" window using:
use "C:\MellonCourse\SALDRU12", clear
The clear option at the end of the above command will remove from memory any data currently there. Using this option does not pose a problem when you are getting started, but you should be aware that if you load a new data set using the clear option, you will lose all the changes you might have made, unless you save the changes before using the command. Now we are ready to begin exploring the data!
EXPLORING THE DATA
Now that the data is loaded into STATA you will notice that the "Variables" window now has two new columns of information. The first column is the list of variables in the data we just loaded and the second column displays the attached variable labels. A variable label is a brief description of the variable's content. For example, the second variable in this window is hhid and the label for this variable reads "household identification no." To learn more about what this variable really is, we would need to go back to the SALDRU Survey (using the menu on the left.) We'll do this later. In this case, hhid is a unique number given to each household in the survey that allows us to identify the household without compromising the household's true identity. Take a few minutes to scan the list of variables using the scroll bar on the "Variables" window. There are several ways within STATA to further explore the contents of your data set. One example of this is the command ds. In the STATA command window type:
ds
In the "STATA Results" window, we now see a list of variable names. These are the same variable names that appear in the "Variables" window. The ds command is advantageous in that it lists a large number of variable names at once, although variable labels associated with these variable names are not shown. Neverthelss, this is a handy way to see many variable names at once. It turns out that not all the variable names will fit in the window. At the bottom of the window, it may say "--more--." Get used to this. By simply tapping the space bar on your keyboard, you can scroll through the windows of information. If you don't want the output to stop after every full screen, just type:
set more off
Here are some other useful commands for exploring the data set:
describe will tell you how many observations are in the data set, how much memory the data set is using, what the variables are, how much memory each variable is using, and how many variables are in the data set. There are other details that don't really concern us at this point.
codebook will provide very detailed information about every variable in your data set. It will tell you, for each variable, the number of missing observations, the largest and smallest values, the number of unique values, and some information regarding the means and standard deviation of the variable. We will discuss means and standard deviations later in Module 4 .
list will tell you more information than you ever want to know unless you use it with some of the qualifiers described immediately below. This command simply prints out everything in the data set.
lookfor is like a dictionary. You can specify what you are looking for and this command will list all variable names or labels that contain the list of letters (or string) that you give it. Some examples are listed after the introduction of the qualifiers and operators. As you use these commands, you will often want to use qualifiers and operators. By using these options, you can restrict the specified STATA command to a specific sub-population.
| Qualifiers: | Comparison Operators: | Logical Operators: | |||
|---|---|---|---|---|---|
if |
qualify when a command is executed. |
== | equal to |
| |
or |
| in | specify which observations to examine | != | not equal to | & | and |
| ~= | not equal to | ||||
| > | greater than | ||||
| < | less than | ||||
| >= | greater than or equal to | ||||
| <= | less than or equal to | ||||
Qualifiers and operators add more detail to these data exploration commands. For instance, try some of the following examples:
describe
This command alone lists all the variables and their corresponding labels in the data set.
Now if we type the following:
describe rel_head educ_c race
A list of the variables rel_head, educ_c, and race as well as their labels are displayed in the STATA Results window.
Now what if we wanted to explore information specific to a particular group or person only? This is where the qualifier and operators are used. Using qualifiers and operators allows us to apply STATA commands to specific observations in the data. To make things a bit more clear, here are some examples:
list in 200
Allows us to examine the contents of the 200th observation.
list if race == 3
Allows us to examine the contents of all observations where race equals 3; the number 3 in this case refers to Indians, thus only information for Indian respondents will be displayed.
list race if age > 100
This command will print all the observations for the race variable where age of the individual is greater than 100. When you try this command, you will note that many of the observations for race are recorded as a missing value, "." If you wanted to list the race of all individuals older than 100 years old and did not want to list those for whom race was missing, you could type:
list race if age > 100 & race ~= .
codebook hhsizem
This provides you with the codebook details for the variable hhsizem. The results from this command tell us that the size of households varies from 1 to 17 in South Africa, and that average household size is 6.42.
lookfor income
Gives us all the variables that have "income" in their name or their label. Entering this command, we can see that in addition to total monthly income, there are five separate categories of income in the data set.
Let's see if you have gotten the hang of this, try these quick exercises:
- 1. What is the hhid value for the 1000th observation?
- Question 1 Answer
- 2. What are the ages of the respondents with an hhid equal to 42011?
- Question 2 Answer
- 3. How many 50 year old Indian respondents are there in the data? (Be careful this one is a bit tougher than the others)
- Question 3 Answer
MANAGING THE DATA
In this section, we'll learn some data management commands. In many circumstances, we will want to amend the original SALDRU data set. We might want to add new variables that we create from the existing variables, we might want to drop variables that we will never use to free up memory, we might want to recode missing values, and/or we might want to create our own variable labels. While we will not deal with some of the trickier data management issues (such as merging data sets) in this section, you will learn enough to get started. We'll start by creating a new variable. To create a new variable in STATA you use the command generate. When using generate (or gen for short) you must specify two values, a name for the new variable, and what the new variable is equal to. Let's try an example. Suppose we want to create a new variable called "temp" and we would like this variable to be equal to 1 for every observation in the data set.
To create this new variable you need to type:
generate temp = 1
Go ahead and enter this command into STATA. You will notice that our new variable temp has been added to the end of the variable list in the STATA Variables window. How do we know for sure what this new variable is equal to? Did the command work correctly?
See if you can figure this out:
- 4. How can we check to make sure our new variable is equal to 1?
- Question 4 Answer
- 5. How would you create a new variable called "temp2" that is equal to 50 for all of the observations in the data set?
- Question 5 Answer
Now that we have created our new variable, it would be useful to create a label for it to help us identify what the variable is. To create a label for a variable you use the STATA command label variable. To label the new variable temp type:
label variable temp "This is a temporary variable equal to 1"
This command will add the label "This is a temporary variable equal to 1" to the new variable temp. Typing the following STATA command you will see that the new label has been added:
describe temp
Now you may be asking yourself, why would I want to create a variable like temp that has the same value for each observation in the data set. Variables like temp can actually be quite useful at times, although a majority of the variables you create will not have the same value for all observations. Let's try another example. Suppose we want to create a variable for race that has only two values: white and non-white. The variable race in the SALDRU data set has four values, White, Indian, African, and Coloured. We will want to use the commands generate and replace. We'll call the new variable race2. In the STATA window, type:
generate race2 = .
This command will create a new variable called race2 and all values of this variable are set to "missing". If the individual is white, we want to code race2 to equal 0. If the individual is non-white, we want to code race2 to equal 1. We do this by using the replace command. The replace command allows us to change the values of an existing variable. By typing the following command we recode the new variable race2 according to our desired scheme:
replace race2 = 0 if race == 4 replace race2 = 1 if race == 1 | race == 2 | race == 3
In the above commands, we needed to know how the original variable race was coded. All original coding can be found in the SALDRU survey. Next we will want to put a label on this variable so we know what it is. This requires the label variable command.
Type:
label variable race2 "0 if White, 1 if Non-White"
You will now notice that at the bottom of the Variables window that our new variable race2 is listed with its new label. Alright now it's your turn. Try to answer the following questions:
- 6. How would you create a new variable called "head" that is equal to 1 if the individual is the resident head, and equal to 0 if the individual is not?
- Question 6 Answer
- 7. How would you label this new variable with the following label - "Resident head indicator"?
- Question 7 Answer
We can also generate variables using the mathematical operators. For example, if we wanted to simply add crop rental income and grazing rental income to create a new variable for agricultural rental income (called ag_inc) we would type:
generate ag_inc = farmrent + liverent
Other useful data management commands are:
drop will remove from memory the variables that are listed after the command. For example, if we no longer needed the variable race2, we could type drop race2. If we want to drop lots of variables, it is usually easier to use the keep command instead.
keep will retain only the listed variables and drop all the others. Be careful when using this command since it eliminates from memory everything that is not listed.
EXERCISES
Using STATA and the data set named SALDRU12, answer the following questions. After you think you have the answer, you can click on the "Answer" link to see if you have the correct answer. Remember, use the help command if you need to. We'll start with the questions at the top of this module and then move on to some more.- How many households are in the sadlru12 data file?
- Exercise 1 Answer
- What are some variables in the data set that are related to food?
- Exercise 2 Answer
- How old is the oldest person in the data set?
- Exercise 3 Answer
- How many 50 year old women are in the data set?
- Exercise 4 Answer
- How would you create a new variable that is equal to 1 for Africans and 0 for non-Africans?
- Exercise 5 Answer
- How would you drop every variable from the data set except the new variable you just created?
- Exercise 6 Answer
| BACK TO TOP |