INTRODUCTION
This module presents advanced graphing commands and assumes proficiency with STATA. If you have not completed the earlier modules, we highly recommend that you familiarize yourself with them before continuing.
First, however, you will need to download and unzip a slightly modified version of the SALDRU. We will use it for this module only. NOTE: this file is large, which might take a long time to download depending on your internet connection. To download the file, click here SALDRU Data.
Scatter Plots
Here, we are going to graph the share of household monthly food expenditures on total monthly household expenditures. These are household-level variables, so we only want one observation per household because the values will be the same for all members of a single household. To do this, we will use only those cases whose pcode equals one.
We need to generate the variable foodshare, which will be the share of household monthly food expenditures in total monthly household expenditures. To do this, we divide household monthly food expenditure (hhtfexp) by total monthly expenditure (totmexp). Also, remember that we will need to change the command delimiter from carriage return (default) to a semi-colon - ";":
#delimit ;
gen foodshare=hhtfexp/totmexp if pcode==1;
Now, we can produce a scatter plot of the share of food expenditure on total household expenditures by entering the command:
graph twoway scatter foodshare totmexp;

Because this is such a common command, we can simply type the command scatter to get the same results:
scatter foodshare totmexp;

As we can see, this graph is difficult to analyze because of the concentration of observations on the far left. Stata automatically scales the graph to include all observations, however we can change the range to exclude the extremes.
scatter foodshare totmexp if totmexp<=7500;
Now that we’ve focused in on the concentration of observations, we can start to adjust the overall appearance of our graph. If we know that we have limited the observations to those at or below 7500, then we can adjust our x-axis to only go to 7500 and also determine our labels by using the axis label options xlab and xtic.
First we add the labels, which will adjust our axis:
scatter foodshare totmexp if totmexp<=7500,xlab(0(1000)7500);

This makes the graph start at 0 go to 7500 and label every 1000.
If we want to add ticks in between our labels we can type:
scatter foodshare totmexp if totmexp<=7500,xlab(0(1000)7500) xtic(500(1000)7500);

This makes ticks start at 500 go to 7500 and tick every 1000.
We can do the same for our y-axis, adding more labels and getting rid of the grid lines if wanted:
scatter foodshare totmexp if totmexp<=7500,xlab(0(1000)7500) xtic(500(1000)7500)
ylab(0(.1)1, nogrid);

Next we need to add a title to our graph and change the axis titles. We do that with the following syntax:
scatter foodshare totmexp if totmexp<=7500,title("Share of Total Monthly Expenditure Spent on Food")
xtitle("Total monthly expenditure")
ytitle("Share spent on food")
xlab(0(1000)7500) xtic(500(1000)7500)
ylab(0(.1)1, nogrid);

If we want to adjust the position or change the size of any of the labels or titles, then we can add more option commands.
To control the size of the number labels on the axis, we add labs(...) within the parentheses following xlab or ylab. To control the title sizes, we add size(...) within the parentheses following xtitle, ytitle, or title. We can also add margin(...) to determine the distance of the title from the graph edge.
scatter foodshare totmexp if totmexp<=7500,
title("Share of Total Monthly Expenditure Spent on Food", size(large)
margin(medium))
xtitle("Total Monthly Expenditure", size(medium) margin(small))
ytitle("Share Spent on Food", size(medium) margin(small))
xlab(0(1000)7500, labs(small)) xtic(500(1000)7500)
ylab(0(.1)1, labs(small) nogrid);

Now we can move on to the symbols or markers representing the observations within the graph. Our options for the shape of the markers are:
| SHAPE | SOLID |
HALLOW |
||
|---|---|---|---|---|
Large |
Small |
Large |
Small |
|
| circle | O |
o |
Oh |
oh |
| diamond | D |
d |
Dh |
dh |
| triangle | T |
t |
Th |
th |
| square | S |
s |
Sh |
sh |
| plus | + |
smplus |
||
| x | X |
x |
||
| point | p |
very small dot |
||
| none | i |
invisible symbol |
||
The option command used to determine the markers is simply an m(...) as seen here:
scatter foodshare totmexp if totmexp<=7500,
title("Share of Total Monthly Expenditure Spent on Food", size(large)
margin(medium))
xtitle("Total Monthly Expenditure", size(medium) margin(small))
ytitle("Share Spent on Food", size(medium) margin(small))
xlab(0(1000)7500, labs(small)) xtic(500(1000)7500)
ylab(0(.1)1, labs(small) nogrid) m(oh);

Another way to control the size of the markers is with msize(...):
scatter foodshare totmexp if totmexp<=7500,
title("Share of Total Monthly Expenditure Spent on Food", size(large)
margin(medium))
xtitle("Total Monthly Expenditure", size(medium) margin(small))
ytitle("Share Spent on Food", size(medium) margin(small))
xlab(0(1000)7500, labs(small)) xtic(500(1000)7500)
ylab(0(.1)1, labs(small) nogrid)
m(oh) msize(vlarge);

For printing purposes, however, you may want to adjust the color so that it is only in black, white and shades of gray. To do this you may pick from a variety of "schemes". A popular option for printing is scheme(s1mono) shown here:
scatter foodshare totmexp if totmexp<=7500,
xlab(0(1000)7500, labs(small)) xtic(500(1000)7500)
ylab(0(.1)1, labs(small) nogrid)
xtitle("Total Monthly Expenditure", size(medium) margin(small))
ytitle("Share Spent on Food", size(medium) margin(small))
title("Share of Total Monthly Expenditure Spent on Food", size(large)
margin(medium))
m(oh) scheme(s1mono);

Line Graphs
Next we are going to see how a line graph is constructed. To do this, we are going to compare the mean of the logarithm of monthly income (logy) by race and gender against years of education. Before we can start graphing, we need to create the variables needed.
First, we use the egen command to create the mean of logy by years of completed education, race and gender.
sort edyears race male;
egen meanlogy=mean(logy), by(edyears race male);
We know that each value of edyears has the same value for meanlogy, but Stata will plot each point over and over to represent all the observations. This can cause graphs to take a long time to copy or print. To eliminate this problem, we can tell Stata to plot only one point per age group. To do this, we must use our sorted data to create a variable that numbers the observations sequentially within each value of edyears.
by edyears race male: gen number=_n;
Now, for graphing purposes, we proceed to generate a meanlogy variable for each combination of race and gender. Because everyone in each group will have the same value for each particular value of edyears, we tell Stata to only use the first person in each group. That way, when Stata plots the graphs, there will be only one observation per group.
gen logyam=meanlogy if race==1 & male==1;
gen logyaf=meanlogy if race==1 & male==0;
gen logycm=meanlogy if race==1 & male==1;
gen logycf=meanlogy if race==1 & male==0;
gen logyim=meanlogy if race==1 & male==1;
gen logyif=meanlogy if race==1 & male==0;
gen logywm=meanlogy if race==1 & male==1;
gen logywf=meanlogy if race==1 & male==0;
We use these variables for our graphs along with the same commands and options discussed earlier to make a scatter plot. For example, to compare the mean log of income for African males versus African females we enter the same command as we did for the scatter plot:
scatter logyam logyaf edyears;

Now, to turn this into a line graph, we must tell Stata how we want the markers connected.
For direct lines to be drawn, we indicate an l for each set of markers we want connected.
scatter logyam logyaf edyears, c(l l);

Now we can go back and add some of those options we used in the scatter plot example. Let’s start with the range, axis labels, and ticks.
How do we know what we want the range to be? We can start by finding the maximum and minimum values of our variables. To do this, we simply type:
sum meanlogy edyears;
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
meanlogy | 8068 6.677306 .9475648 4.493598 9.21034
edyears | 8031 6.196613 4.715922 0 15
Because we ultimately want to compare the mean log of income for all the groups, we summarize meanlogy here rather than entering each group variable separately.
As we can see, we want our x-axis to run from 0 to 15 and our y-axis to run from approximately 4.5 to 9.5. Let’s label each integer on both axis, and tick each .5 mark on the y-axis.
scatter logyam logyaf edyears, c(l l)
xlab(0(1)15) ylab(5(1)9) ytic(4.5(1)9.5);

Now we can add marker shape, titles, size and position:
scatter logyam logyaf edyears, c(l l)
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5) m(T O);

We’re starting to make some progress, but there is a lot more "cleaning-up" to do before this graph is ready to be printed or used in a paper.
First of all, the variables in our legend do not have good names. Let’s give new labels to all of the variables we created earlier.
lab var logyam "African males"
lab var logyaf "African females"
lab var logycm "Coloured males"
lab var logycf "Coloured females"
lab var logyim "Indian males"
lab var logyif "Indian females"
lab var logywm "White males"
lab var logywf "White females"
Now run the command again:
scatter logyam logyaf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
c(l l) m(T O);

The next issue is the legend itself. As you can see, the dimensions of the graph change automatically to adjust to the legend and the size of the titles. We have control over the size of the legend and where it is placed.
First, let’s address the position of the legend. The placement is based upon the face of a clock:
So, if we wanted our legend on the right-hand side of the graph, we would indicate the position with the legend command:
scatter logyam logyaf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
c(l l) m(T O) leg(pos(3));

Again, we see how the graph automatically adjusts to the legends new position. This position causes the graph to be too narrow and doesn’t leave adequate room for our other titles and labels. To make a better fit, we can have our variables listed in a single column:
scatter logyam logyaf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
c(l l) m(T O) leg(col(1) pos(3));

Maybe it would be even better to have the entire contents of the legend stacked in one column:
scatter logyam logyaf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
c(l l) m(T O) leg(col(1) stack pos(3));

In this particular case, the best option might be to have the legend inside of the plot area. To do this, we use the ring-position option and indicate 0, then use the clock-position option to place the legend in the plot area.
scatter logyam logyaf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
c(l l) m(T O) leg(col(1) ring(0) pos(11));

Now we can attempt to add in the other variables:
scatter logyam logyaf logycm logycf logyim logyif logywm logywf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
c(l l l l l l l l) leg(col(1) ring(0) pos(11));

This legend is entirely too big to fit in the graphing area at that position. To find a better fit, we can adjust the text size, space between rows, the position and the number of columns.
scatter logyam logyaf logycm logycf logyim logyif logywm logywf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
c(l l l l l l l l) leg(col(2) ring(0) pos(5) rowg(.25) size(small));

For black and white printing purposes, we must change the scheme of the graph as we did with the scatter plot:
scatter logyam logyaf logycm logycf logyim logyif logywm logywf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
c(l l l l l l l l) leg(col(2) ring(0) pos(5) rowg(.25) size(small))
scheme(s1mono);

Here, Stata helped us by automatically changing the line patterns and the marker shapes and shades to differentiate among the variables. However, having eight different line types and eight different markers does not always make for an easily readable graph.
One option is to have the same markers across the board for males and another for females, while only changing the line type for each race group.
The line pattern styles can be determined using the clp(...) option as we did similarly with the marker shape. The options are:
| Shape | Commands | Formula Options |
|---|---|---|
| solid line | solid | "l" |
| dashed line | dash | "-" |
| dotted line | dot | |
| dash & dot | dash_dot | |
| short dash | shortdash shortdash_dot |
"." |
| long dash | longdash | "-" |
| long dash & dot | longdash_dot | |
| invisible line | blank | |
| small blank space | "#" |
Example of a formula: clp("_.._" "l" "-." etc.)
Now, we can specify our own marker shapes and line styles. We might also want to determine the marker color using mcolor(...) to avoid those shades that are too light.
Here’s an example of a graph using some of the options described above:
scatter logyam logyaf logycm logycf logyim logyif logywm logywf edyears,
title("Income by Years of Education, South Africa 1993")
xtitle("Years of schooling completed", margin(vsmall))
ytitle("Log of mean income", margin(vsmall))
xlab(0(1)15) ylab(5(1)9, nogrid) ytic(4.5(1)9.5)
m(t Oh t Oh t Oh t Oh) mcolor(black black black black black black black black)
c(l l l l l l l l) clp("-#.#" "-#.#" "l" "l" "_#" "_#" "-" "-")
leg(col(2) ring(0) pos(5) rowg(.25) size(small))
scheme(s1mono);
