icodes - version 1.1
Industry classifications are commonly used in empirical research. To make coding them easy, icodes takes one numeric or string 4 digit SIC code and automatically outputs multiple numeric variables containing 1, 2, 3, and 4 digit SIC codes and every Fama-French industry code (5, 10, 12, 17, 30, 38, 48, and 49).
Installing and updating
To install the current version of icodes, use the following code to search for zach.prof in Stata:
search zach.prof
Next click the link for icodes. Then click the link to install icodes.
After installation, you can type help icodes in Stata to view a comprehensive help file. You can also read the remainder of this webpage for examples that illustrate all of icodes's features.
If you have an old version of icodes on your computer and want to update to the current version, run the following code in Stata to uninstall the old version:
net uninstall icodes
Then run the search command from above and follow the links to install the current version.
You can alternatively use the following net install commands to install any current or non-current version of icodes:
net install icodes, from("https://raw.githubusercontent.com/zachprof/icodes/1.1") // installs version 1.1 (current)
net install icodes, from("https://raw.githubusercontent.com/zachprof/icodes/1.0") // installs version 1.0
Getting started
To demonstrate icodes's functionality, I am going to use a dataset I created of random SIC codes. This dataset will be accessible in Stata after installing icodes. The code in my examples can be accessed by downloading icodes examples.do.
To load the dataset of random SIC codes, type the following two commands in Stata:
clear
sysuse randomSICcodes
After loading the dataset, take a moment to look at the variables. There should be four variables, each with 250 observations. Each variable contains almost the same information, but with slight variations (described below) that you might encounter in your research.
Variable descriptions:
sic_str, baseline SIC code formatted as a string with leading zeros (e.g., 100 is recorded as "0100" rather than "100")
sic_str_err = sic_str with two errors: the 8th observation has a letter in it, the 10th observation is beyond the range of possible SIC codes
sic_str_no0 = sic_str with leading zeros removed
sic_num = sic_str as a numeric variable
Converting any of these variables to commonly used industry classifications is as easy as typing "icodes" followed by the name of the variable you want to use. For example, try the following code:
icodes sic_str
After running this code, you'll find 12 new variables in your dataset containing 1, 2, 3, and 4 digit SIC codes and every Fama-French code (5, 10, 12, 17, 30, 38, 48, and 49). All the new variables are formatted as numeric variables so they can easily be used in empirical analyses.
In case a dataset does not have SIC codes perfectly formatted as strings with leading zeros, icodes is designed to automatically deal with input data that's numeric, strings with leading zeros, strings without leading zeros, or contains erroneous letters or numbers.
To demonstrate, use the following code to delete the industry indicators created in the first example:
drop sic1-ff49
Now use the following code to create industry classification variables based on sic_str_err:
icodes sic_str_err
After running this code, you'll see two warnings. The first indicates observations were set to missing due to non-numeric values. This warning is a result of the 8th observation, which equals 672v rather than 6722. The second warning indicates observations were set to missing due to values below 0 or above 9999. This warning is a result of the 10th observation, which equals 10500 rather than 7363.
As suggested by the warning messages, if you open the data, you'll see that icodes automatically used all valid SIC codes to construct industry classifications as normal, assigning missing values to the 8th and 10th observations.
sic_str and sic_str_err both contain SIC codes with leading zeros but, as mentioned above, icodes automatically works with no leading zeros too. To demonstrate, run the following code:
drop sic1-ff49
icodes sic_str_no0
If you open the data after running this code, you'll see icodes created the same correct industry classifications it created in the first example. An easy way to test this is to look at the 245'th observation, which has an SIC code of 100. As you will see, icodes successfully sets the one digit SIC code to zero rather than one, even though 100 has no leading zero. In addition, the two digit SIC code is one rather than 10, and so on.
Finally, you can verify icodes also automatically knows how to deal with numeric input data by running the following code:
drop sic1-ff49
icodes sic_num
Again, after running this code, you can open the data and inspect the output for correctness.
This concludes my getting started tutorial. You can continue reading for more examples that demonstrate icodes's options. I also comment on a few technical issues, including how icodes compares to Judson Caskey's ffind. You can also type help icodes in Stata for a comprehensive help file that explains everything you need to know to put icodes to work in your next research project.
Options
icodes has the following four options:
suffix: adds a suffix to variable names
short: uses shorter value labels for Fama-French codes (e.g., "Hlth" rather than "Healthcare, Medical Equipment, and Drugs")
nolabel: uses no value labels for Fama-French codes so the underlying number is displayed (e.g., 10 rather than "Healthcare, Medical Equipment, and Drugs")
nomissing: puts SIC codes not mapped to a particular Fama-French industry code in Fama-French's "other" category rather than setting them to missing
The suffix option is useful if you already have a variable in your dataset with one of the variable names icodes uses by default (sic1, sic2, sic3, sic4, ff5, ff10, ff12, ff17, ff30, ff38, ff48, or ff49). It is also useful if you want to create industry classifications based on two or more alternative SIC codes (e.g., historical vs. current).
Continuing from the prior example, to demonstrate suffix, run the following code:
icodes sic_str, suffix(_icodes_str)
icodes sic_str_err, suffix(_icodes_err)
After you run this code, there will be (1) a set of new industry classification variables based on sic_str with the suffix "_icodes_str" appended to their names and (2) a set of new industry classification variables based on sic_str_err with the suffix "_icodes_err" appended to their names.
The short and nolabel options address an entirely different issue. Specifically, the names given to Fama-French industries (e.g., "Healthcare, Medical Equipment, and Drugs") can be so long that, when you open a dataset, it's only possible to view two or three variables at a time.
The short option addresses this issue by assigning abbreviated value labels to Fama-French industries. To see, try the following code:
icodes sic_str, short suffix(_short)
If you open the data after running this code, you'll see shorter names are used as labels for Fama-French industries, making it much easier to see variables in the dataset.
The nolabel option addresses the same issue by assigning no value labels to Fama-French industries. To see, try the following code:
icodes sic_str, nolabel suffix(_no)
If you open the data after running this code, you'll see no names are used as labels for Fama-French industries, again making it much easier to see variables in the dataset.
The last option, nomissing, deals with an issue in how different Fama-French codes are mapped to SIC codes. To see this issue, it will help to go to Kenneth French's website and download the files that map SIC codes to Fama-French codes. What you'll find is that Fama-French codes are mapped to SIC codes using one of the following two methods:
Assign specific 4 digit SIC codes to every Fama-French industry except "other" (Fama-French 5, 10, 12, and 38 codes use this method).
Assign specific 4 digit SIC codes to every Fama-French industry including "other" (Fama-French 17, 30, 48, and 49 codes use this method).
For Fama-French codes based on method 1, icodes automatically puts unassigned SIC codes in the "other" industry because the "other" industry is deliberately defined as a catch-all group.
For Fama-French codes based on method 2, icodes sets unassigned SIC codes to missing by default, which is what happens to ff17_no in the 26th observation if you take a look at the data from the last example. The nomissing option overrides this default and puts all unassigned SIC codes in the "other" industry.
To see this in action, run the following code:
icodes sic_str, nolabel nomissing suffix(_nomiss)
If you still have the data open, you'll see ff17_nomiss is set to 17 (i.e., the "other" category) rather than missing.
At this point, we've covered all of icodes's options. As always, remember to type help icodes in Stata any time you need a refresher. The next section compares icodes to Judson Caskey's ffind. The final section explains how icodes deals with a few minor inconsistencies in the mapping files on Kenneth French's website.
Comparison with Judson Caskey's code
Judson Caskey (an accounting professor at UCLA) developed a Stata command ffind that calculates Fama-French industry codes. His code is widely used and can be downloaded from his website (https://sites.google.com/site/judsoncaskey/). However, I recommend using icodes due to the following advantages over ffind:
ffind only accepts numeric SIC codes as inputs. icodes accepts numeric and string SIC codes, and automatically identifies and deals with data errors, allowing you to more easily create Fama-French industry codes using a wider range of input data.
ffind only outputs one Fama-French code at a time. icodes automatically outputs every Fama-French code and 1, 2, 3, and 4 digit SIC codes, allowing you to more easily evaluate and use alternative industry classifications.
ffind only uses long value labels for Fama-French industries. icodes allows you to choose long, short, or no value labels, making it easier to work with large datasets.
ffind puts all unmapped SIC codes in the "other" industry for Fama-French 17, 30, 48, and 49 codes. icodes allows you to choose how these unmapped SIC codes are dealt with using the nomissing option, as explained in the last section.
In summary, icodes is easier to use, requires fewer lines of code, and gives you more control in how Fama-French industry codes are calculated.
However, if you're considering transitioning from ffind to icodes, it is important to verify that both commands result in exactly the same industry classifications. The following code allows you to do so:
* Create dataset containing every number from 0 to 9999
clear
set obs 10000
gen sic = _n - 1
* Use icodes to create numeric sic and Fama-French codes
icodes sic, nomissing
* Use Judson Caskey's ffind to create Fama-French codes
foreach ffn of numlist 5 10 12 17 30 38 48 49 {
local jcname ff`ffn'jc
ffind sic4, newvar(`jcname') type(`ffn')
}
* Count number of observations where icodes and ffind result in different outcomes
* NOTE: Should be zero for every Fama-French code
foreach ffn of numlist 5 10 12 17 30 38 48 49 {
count if ff`ffn' != ff`ffn'jc
}
Technical note on overlapping SICs
If you go to Kenneth French's website and download the files that map SIC codes to Fama-French codes, you'll find overlapping SIC code cohorts in the Fama-French 5 and 10 mapping files.
The overlapping SIC code cohorts in the Fama-French 5 mapping file are as follows:
Fama-French industry number 2 is mapped to SIC codes 3830-3839 and Fama-French industry number 3 is mapped to SIC codes 3810-3839.
Fama-French industry number 2 is mapped to SIC codes 3580-3629 and Fama-French industry number 3 is mapped to SIC code 3622.
To remove these overlaps, I made the following modifications to the Fama-French 5 mapping file in creating ff5sics.dta (the data file icodes uses to map SIC codes to Fama-French 5 codes):
I removed 3830-3839 from Fama-French industry number 2 because sic codes in the 3800's are more high-tech than manufacturing (see descriptions of industries in major SIC group 38 on osha.gov).
I removed 3622 from industry number 2 because this SIC code is specifically identified as belonging to industry number 3.
The overlapping SIC code cohorts in the Fama-French 10 mapping file are as follows (the overlapping SIC codes and methods used to remove overlaps are the same as above, but the Fama-French industry numbers are different):
Fama-French industry number 3 is mapped to SIC codes 3830-3839 and Fama-French industry number 5 is mapped to SIC codes 3810-3839.
Fama-French industry number 3 is mapped to SIC codes 3580-3629 and Fama-French industry number 5 is mapped to SIC code 3622.
To remove these overlaps, I made the following modifications to the Fama-French 10 mapping file in creating ff10sics.dta (the data file icodes uses to map SIC codes to Fama-French 10 codes):
I removed 3830-3839 from Fama-French industry number 3 because sic codes in the 3800's are more high-tech than manufacturing (see descriptions of industries in major SIC group 38 on osha.gov).
I removed 3622 from industry number 3 because this SIC code is specifically identified as belonging to industry number 5.
Acknowledgements: I owe a special thanks to Mayer Liang and Jessica Nylen for carefully reviewing and providing helpful feedback on version 1.0 of this code, the complementary help file, and the documentation on this page. I also thank Jessica Nylen for suggesting the nolabel option.
Code access: The files underlying every version of this code as well as information about what changed from one version to the next are readily accessible on GitHub.