dtabxl - version 1.1
Descriptive statistics tables are ubiquitous in academic research. To make creating them easy, dtabxl tabulates univariate statistics in Excel. The Excel output can be easily sent to coauthors to communicate findings or copied to Word to produce elegant tables worthy of even the most prestigious academic journals.
Installing and updating
To install the current version of dtabxl, use the following code to search for zach.prof in Stata:
search zach.prof
Next click the link for dtabxl. Then click the link to install dtabxl.
After installation, you can type help dtabxl in Stata to view a comprehensive help file. You can also read the remainder of this webpage for examples that illustrate all of dtabxl's features.
If you have an old version of dtabxl on your computer and want to update to the current version, run the following code in Stata to uninstall the old version:
net uninstall dtabxl
Then run the search command from above and follow the links to install the current version.
You can alternatively use the following net install commands to install any current or non-current version of dtabxl:
net install dtabxl, from("https://raw.githubusercontent.com/zachprof/dtabxl/1.1") // installs version 1.1 (current)
net install dtabxl, from("https://raw.githubusercontent.com/zachprof/dtabxl/1.0") // installs version 1.0
Getting started
To demonstrate dtabxl's functionality, I am going to use the auto.dta dataset. This dataset comes with Stata and allows any Stata user to replicate my examples exactly as they are presented below. You can also access the code in my examples by downloading dtabxl examples.do.
To load auto.dta, type the following two commands in Stata:
clear
sysuse auto
Once auto.dta is loaded, I recommend using the cd command to change your working directory to an easy to find location. The location you specify is where dtabxl will save descriptive statistics tables by default. As an example, I might type the following in Stata's command window to change my working directory to my desktop:
cd "C:\Users\zacha\Desktop"
To make a descriptive statistics table, dtabxl requires (1) the variables you want in your table in the order you want them tabulated followed by (2) the word "using" and the name of the file you want your table saved in. For example, you might type the following in Stata:
dtabxl headroom length price weight mpg using "Auto"
This will produce a descriptive statistics table examining price, weight, gas mileage, and two measures of a car's size (headroom and length) in a file named Auto.xlsx saved in the location specified earlier using the cd command. The descriptive statistics table can be opened by navigating in your computer to the location specified using the cd command or by clicking a link that appears in Stata after running dtabxl (the link will say "click here to open Excel output"). If you open the Excel output, the descriptive statistics table should look like this:
This table is formatted in a very generic way. A standard set of univariate statistics is tabulated for each variable (observation numbers, means, standard deviations, and quartiles). In addition, all statistics are rounded to two decimal places (except for observation numbers).
Before moving on, you can check the accuracy of these results by typing the underlying command directly in Stata as follows:
tabstat headroom length price weight mpg, s(n mean sd q)
tabstat is the command dtabxl uses to calculate univariate descriptive statistics and typing the code above produces the outputs necessary to verify the accuracy of observation numbers, means, standard deviations, and quartiles.
To demonstrate a few of dtabxl's options, try running the following code and opening the new file it creates named Stats.xlsx:
dtabxl price weight mpg using "Stats", stats(p75 n p25 mean sd median) sheetname(D1) tablename(odd order)
dtabxl price weight mpg using "Stats", stats(n mean sd cv p25 median p75) sheetname(D2) tablename(cv)
dtabxl price weight mpg using "Stats", stats(mean) sheetname(D3) tablename(mean only)
dtabxl price weight mpg using "Stats", stats(n mean sd p25 p50 p75) sheetname(D4) tablename(p50)
This will result in four sheets, D1, D2, D3, and D4, with table names indicating what is different in each sheet. I've also limited the variables to price, weight, and mpg to keep each line of code roughly to one line on your computer screen if you're using a standard laptop or desktop to view this webpage.
If you open Stats.xlsx, you'll see statistics tabulated in a funny order in sheet D1. From left to right, each variable's 75th percentile, observations, 25th percentile, mean, standard deviation, and median is tabulated. Even though you likely won't want statistics tabulated in this order in a research paper, this example shows that dtabxl will tabulate statistics in any order you want and that statistics will be tabulated in the order in which they are entered using the stats option.
Moving to sheet D2, you'll see statistics tabulated in a sensible order with each variable's coefficient of variation tabulated after its standard deviation. This is done by inserting the letters "cv" between "sd" and "p25" in the second line of code.
cv is one of many additional statistics you can add to your descriptive statistics tables with dtabxl. You can see a full list of statistics accepted by dtabxl's stats option by typing help dtabxl##statname in Stata. To include these statistics in a table, include the statistic's statname in the stats option (e.g., "cv" for coefficient of variation).
You can also remove statistics from a table by removing the statistic's statname from the stats option. This is shown in sheet D3 where I've removed "n", "sd", "cv", "p25", "median", and "p75" from the stats option to create a table that only includes means.
Finally, in sheet D4, you'll see that "p50" is displayed at the top of the median's column rather than "median". This is the only statistic dtabxl currently accepts two names for but, as illustrated, the name displayed in the table will be the statname actually entered using the stats option.
Depending on the statistics you ultimately put in your table, you might find that two decimal places aren't informative. For example, if descriptive statistics are as small as 0.01 or 0.00, you may want to round to three or four decimal places. You can do this with the roundto option. However, to avoid creating more new spreadsheets, let's also use the replace option to overwrite the spreadsheet created in my first example. To do this, try the following command:
dtabxl headroom length price weight mpg using "Auto", roundto(4) replace
This overwrites the Descriptives sheet in Auto.xlsx with a new table that rounds to four decimal places rather than two. You can also round to different decimal places by replacing four with other integers from zero to 26.
This concludes my getting started tutorial. You can continue reading for more examples that demonstrate all of dtabxl's other options. You can also type help dtabxl in Stata to view a comprehensive help file containing all the information you need to put dtabxl to use tabulating beautiful descriptive statistics tables in your next paper.
Technical Note: If you want dtabxl to save tables to a location other than your working directory, you can do so by specifying the path and file name together. For example, the following code saves the initial table created in my first example to my downloads folder:
dtabxl headroom length price weight mpg using "C:\Users\zacha\Downloads\Auto"
Applying sample restrictions
You can use if and in as you normally would in any other Stata command to restrict the sample used in calculating descriptive statistics. For example, you could type the following in Stata to examine statistics for non-US car manufacturers:
dtabxl headroom length price weight mpg using "Auto" if foreign == 1, replace
If you open Auto.xlsx, you'll now see the sample restricted to 22 cars in auto.dta manufactured by non-US companies, such as Audi and Subaru.
Finally, in some datasets, you might find variables set to zero for a large number of observations. In these cases, it may be useful to evaluate nonzero values only. This can be done with the nozeros option. To demonstrate, let's first set some observations in auto.dta to zero with the following command:
replace headroom = 0 in 1/10
This sets the variable headroom equal to zero for the first ten observations in the dataset. Since these observations are all related to US car manufacturers, we'll have to remove the foreign == 1 restriction. You can now use the nozeros option as follows to produce a table that excludes the ten observations where headroom is set to zero when calculating descriptive statistics for headroom:
dtabxl headroom length price weight mpg using "Auto", nozeros replace
I've now shown you all the tools available for restricting the sample in dtabxl. Remember, you can always type help dtabxl in Stata if you need a refresher or would like to review dtabxl's other options on your own. If you continue reading, the next section covers options available for bifurcating the sample and the final section covers how to add extra rows and columns to your descriptive statistics tables, which can be helpful for copying tables from Excel to Word.
Bifurcating the sample
The bifurcate option allows you to bifurcate a sample into two groups based on a 1/0 indicator variable and tabulate side-by-side descriptive statistics for each group. In addition, bioptions (options that are only allowed if the bifurcate option is specified) allow you to test for mean and median differences across groups and change the appearance of the table. Finally, when testing mean or median differences, sigoptions allow you to control how statistical significance is reported.
All options introduced in prior sections work as previously discussed with the bifurcate option (e.g., the stats option can still be used as before to change the statistics tabulated and the order in which they are tabulated). Therefore, I only discuss bifurcate, bioptions, and sigoptions in this section. In addition, all sigoptions work the same for testmean and testmedian. Therefore, I only use testmean in many examples.
Before doing examples, let's reload auto.dta since you may have changed it in the last example. You can do this by typing the following two commands in Stata:
clear
sysuse auto
Now let's start with a side-by-side statistics table using dtabxl's default settings. We can create this table using a 1/0 indicator in auto.dta named foreign. This variable equals one for cars manufactured by non-US companies and zero for cars manufactured by US companies. You can use this variable with the bifurcate option as follows:
dtabxl price weight mpg using "Auto", bifurcate(foreign) replace
This code replaces the Descriptives sheet in Auto.xlsx with a new table comparing default statistics separately for the subsample where foreign = 1 (on the left side) vs. foreign = 0 (on the right side). However, if the focus were US cars, it may be better to have foreign = 0 (US cars) on the left. Given there are five statistics on each side, you may also want an extra column between US and non-US to make the table easier to look at. These changes can be achieved with switch and extrabicols as follows:
dtabxl price weight mpg using "Auto", bifurcate(foreign) switch extrabicols(1) replace
If you open Auto.xlsx after running this code, the new table should look like this:
Column H is the extra column produced by extrabicols. You can add more extra columns between the statistics by replacing the one in parenthesis with any other integer from one and 10 (e.g., extrabicols(2) would start statistics for foreign = 1 in column J, leaving columns H and I blank).
If you compare the two groups, you'll see differences in sample means and medians suggesting foreign cars cost more, weigh less, and are more fuel efficient. A natural question is whether these differences are statistically significant. To find out, try the testmean and testmedian bioptions as follows:
dtabxl price weight mpg using "Auto", bifurcate(foreign) switch extrabicols(1) testmean testmedian replace
You can verify that statistical significance is correctly reported in the resultant table using the following code:
ttest price, by(foreign)
median price, by(foreign) exact
ttest weight, by(foreign)
median weight, by(foreign) exact
ttest mpg, by(foreign)
median mpg, by(foreign) exact
ttest is the command dtabxl uses to test means and median is the command dtabxl uses to test medians. Statistical significance for means is based on the two-tailed p-values reported under "Ha: diff != 0" in Stata after running ttest. Statistical significance for medians is based on the two-tailed p-values reported to the right of "Fisher's exact =" after running median. If you open Auto.xlsx and cross-reference against the ttest and median commands, you'll see that means and medians are significantly different at the 0.05 level or better for weight and gas mileage, but not price.
The default for statistical significance reporting is to create additional columns to the right of the table that show changes from right to left in means and medians for each variable of interest. In addition, the default is to use one star to indicate p-value < 0.05 in the difference columns. However, you may want to remove difference columns and report statistical significance directly in the table or indicate statistical significance at a level other than 0.05. You can achieve these alternative formats with sigoptions. To test a few out, try the following code:
dtabxl price weight mpg using "Auto", bifurcate(foreign) testmean sig(0.01) bold replace
dtabxl price weight mpg using "Auto", bifurcate(foreign) testmean italic sright replace
dtabxl price weight mpg using "Auto", bifurcate(foreign) testmean bold italic nostars sright sleft replace
dtabxl price weight mpg using "Auto", bifurcate(foreign) testmean bold nostars replace
If you run each line of code and open Auto.xlsx before running the next line, you'll see first that bold results in bold text with one star. sig is also used to change the significance level to 0.01. Next, italic results in italics with one star. sright is also used to remove the difference column and report significance in the right side of the table. Third, bold, italic, and nostars together result in bold italics with no stars. sright and sleft are also used to report significance on both sides of the table, again with no difference column. Last, bold and nostars together result in bold text with no stars. Since, sright and sleft are removed, significance is again reported in a difference column to the right of the table. Finally, it is worth noting that bold, italic, nostars, sright, and sleft can be used in any combination, and sig accepts any number between zero and one.
The final sigoption is 3stars, which allows you to use three stars to indicate statistical significance at different levels (e.g., *** [**, *] for p-value < 0.01 [0.05, 0.10]). This sigoption cannot be used with bold, italic, nostars, or sig, and it requires you to specify three levels of significance in any order. The largest level will automatically be assigned one star, the smallest level three stars, and the other level two stars, no matter what order you enter them in. To demonstrate, try typing the following command in Stata:
dtabxl price weight mpg using "Auto", bifurcate(foreign) testmean 3stars(0.01 0.1 0.05) replace
At this point, we've covered almost all of dtabxl's options. As always, remember to type help dtabxl in Stata any time you need a refresher. The last section covers how to add extra rows and columns to your descriptive statistics tables, which can be helpful for copying tables from Excel to Word.
Adding extra rows and columns
The last two options available in dtabxl are extrarows and extracols. These options aren't useful if you're tabulating descriptive statistics in Excel to send to coauthors. However, they can really come in handy if you're trying to copy statistics from Excel to Word. To illustrate, consider the following table prepared in Word:
To get the spacing right, I added one extra row and column between every statistic. Therefore, you would need to manually add extra rows and columns in Excel, or paste each statistic one by one to avoid significant formatting issues in Word. You can alternatively use the extrarows and extracols options as follows:
dtabxl price weight mpg using "Auto", stats(n mean median) extrarows(1) extracols(1) replace
After running this code, you would want to open Auto.xlsx and copy the entire table of statistics as shown here:
Finally, to populate the table in Word, you could open the word document and use the "Keep Text Only" paste option as shown here:
The formatting looks a little funny in the image above because I haven't actually clicked the "Keep Text Only" button. However, after clicking "Keep Text Only", the table should go right back to the original spacing shown at the top of this section. To try for yourself, download a blank table here: descriptive statistics table.docx.
If you ever decide to use more than one extra row or column in a project, keep in mind the extrarows and extracols options will accept any integer from one to ten.
In addition, if you are using boldfaced or italicized text with testmean and/or testmedian, you will need to use the "Overwrite Cells" rather than the "Keep Text Only" paste option for boldfaced and/or italicized text to transfer to the Word document. In all other cases, "Keep Text Only" should work.
If you've read every section on this page, give yourself a pat on the back. You now know how to use all of dtabxl's options, including all bioptions and sigoptions. If you ever forget anything you learned here or need to quickly reference something, you can always type help dtabxl in Stata.
Happy coding!
Acknowledgements: I owe a special thanks to Jesse Chan, Rachel Flam, and Ben Osswald for carefully reviewing and providing helpful feedback on version 1.0 of this code, the complementary help file, and the documentation on this page. I also thank all three for suggesting the addition of separate columns for tabulating and reporting statistical significance for mean and median differences. I also thank Flam for suggesting the addition of clickable links to Excel output files. This addition has also been incorporated in ctabxl.
Code access: The files underlying every version of this code as well as information about what changed from one version to the next are readily accessible on GitHub.