Iecodebook
iecompdup
is the final command in the Stata package created by DIME Analytics, iefieldkit
. After data collection is complete, iecodebook
allows the research team to automatically perform the repetitive steps involved in cleaning data before further analysis. As the name suggests, the iecodebook
command is structured around Excel-based codebooks, which allow researchers to perform and document data cleaning tasks in Excel itself, instead of using do-files.
Read First
- Stata coding practices.
iefieldkit
.- The sub-commands of
iecodebook
allow the research team to rapidly clean, harmonize, and document datasets using codebooks. - Codebooks allow researchers to document the cleaned data in a format that is both human and machine-readable.
- To install
iecodebook
, typessc install iecodebook
in Stata. - To install all the commands in the
iefieldkit
package, typessc install iefieldkit
in Stata. - For instructions and available options, type
help iecodebook
.
Overview
As its name suggests, the iecodebook
command creates Excel-based codebooks. The research team can fill these codebooks with data cleaning instructions for Stata. In this way, iecodebook
creates a metadata record which is easier to write than a long sequence of data cleaning commands in a do-file. These codebooks in Excel are also easier to read and understand, even if someone does not have knowledge of Stata. There are four subcommands in iecodebook
to support its functions:
iecodebook apply
: Reads an Excel codebook where the user renames, recodes, and/or labels a large number of variables, and applies these changes to the current dataset.
iecodebook append
: Allows two or more datasets to have the same variable names, labels, and value labels. That is, it harmonizes two or more datasets, and appends them.
iecodebook export
: Creates an Excel codebook that describes the current dataset. It can also produce an exportable version of the dataset which only contains the variables used in a particular do-file.
iecodebook template
: Creates an Excel template that describes the current or targeted dataset(s), and prepares the codebook for the otheriecodebook
subcommands.
Apply
The most common data cleaning tasks include renaming variables, applying variable and value labels, and recoding values. The iecodebook apply
subcommand allows the research team to perform all of these tasks without writing separate lines of code for each task in Stata. The following steps list how the iecodebook apply
works.
- Create template:
iecodebook
first converts the dataset into a template in Excel usingiecodebook template
. In this template, each column describes different aspects of a single variable, including name, label, type, and so on. - Complete template: After this, you can simply fill out the template, which creates the codebook. The codebook lists all the data cleaning tasks that you wish to perform on the dataset.
- Apply changes: The
iecodebook apply
subcommand then reads these commands, and executes them all with just one line of Stata code. The resulting output is a cleaned dataset, along with an easy-to-read record of the cleaning commands you applied.
Syntax
The following line of code creates an apply template with the relevant dataset. The template is named filename.xlsx in this case.
iecodebook template using "filename.xlsx"
The following line of code applies the changes to the dataset. It saves the codebook with the same name, that is, filename.xlsx.
iecodebook apply using "filename.xlsx" , [drop] [missingvalues(# "label" [# "label" ...])]
Implementation
The following steps use an example to explain how iecodebook apply
works in practice.
Step 1: Load the dataset.
First, load the dataset which you wish to clean. In this case, the dataset is named "auto.dta".
sysuse auto.dta , clear
Step 2: Create template.
Next, run the following code to create the template codebook, which is named "cleaning.xlsx" in this case.
iecodebook template using "cleaning.xlsx"
This produces the template codebook in Figure 1, which shows the current state of the data.
Step 3: Complete template
Next, fill-up the following columns in the template to specify the relevant cleaning tasks:
- name: Fill the name column in the template to specify what the rename command will do to the variables in the dataset. You can use this to rename a variable. For example, if you want to change the name of the make variable to model.
- label: Fill the label columns in the template to specify what the label command will do to the variables in the dataset.
- choices: Enter a label name in the choices column to apply a particular value label for a variable. Also create the corresponding value label in the choices sheet. Every template includes a demo yesno label as a guide.
- recode:current: Use the usual syntax (rule) [(rule) ...] in the recode:current column to recode data values.
Note: The data types are given for reference only; the iecodebook command cannot change them. Figure 3 shows an example of what you might write to make some adjustments to the foreign variable.
To apply changes to the data, complete the "name" and "label" columns to prepare rename and label variable commands for the current dataset, respectively. To apply value labels, enter a label name in the "choices" column and create the corresponding value label in the choices sheet (every template includes a demo yesno label as a guide). To recode data values, use the usual syntax (rule) [(rule) ...] in the "recode:current" column. You cannot change data types with this command; these are provided for reference only. For example, you might write the following to make some adjustments to the foreign variable:
Step 4: Apply cleaning commands
To apply the changes, you would then run the following command:
// Apply cleaning commands to open dataset iecodebook apply using "codebook.xlsx"
Note that the correct command is created by replacing template with apply. By default, all variables with no adjustments will be left as-is. However, this is not required: the drop option orders all variables that have no final variable name in the name column to be dropped from the dataset. Alternatively, the user can place a single period . in the name column to drop variables one by one. The missingvalues() option allows global missing-value codes to be propagated to all value labels. Note also that you will have to manually recreate all value label lists in the choices sheet, but that the data labels from your original dataset is available for copy-paste from the choices_current sheet.
Append
A common downstream task in data collection is to combine two or more sequential rounds of surveys; or, similarly, to combine similar survey instruments conducted in different settings. This is always harder than it first sounds. Inevitably, updates and/or localization have been made to at least one of the datsets, such that a simple append command will not produce the desired data structure. Most often, these changes cause desynchronisation of:
- Variable names
- Variable labels (including translation)
- Value labels
- Data types
The iecodebook append subcommand offers a rapid workflow for documenting and resolving these differences across multiple datasets. The general syntax of the templating command is:
iecodebook template "/path/to/survey1.dta" "/path/to/survey2.dta" [...] using "/path/to/codebook.xlsx" , surveys(Survey1Name Survey2Name ...) [match] [generate(varname)]
As in iecodebook apply, the correct executing command is formed by replacing template with append. The general syntax of the iecodebook append command is therefore:
iecodebook append "/path/to/survey1.dta" "/path/to/survey2.dta" ... using "/path/to/codebook.xlsx" , surveys(Survey1Name Survey2Name ...) [keepall] [generate(varname)] [missingvalues(# "label" [# "label" ...])]
The surveys() option is required in both steps, and must match between them. As a list of single words, the users should specifiy the names of the surveys (which the command will look for in the codebook headers). The command will also create a survey variable in the resulting dataset, labelled with these names -- -- to change the name of that variable, use the generate() option in both commands. To demonstrate the usage, we will create two datasets that have similar data but with different structures, then combine them using a codebook. Run the following:
// Create demonstration datasets sysuse auto.dta , clear save data1.dta , replace rename (price mpg)(cost car_mpg) recode foreign (0=1 "Domestic")(1=0 "Foreign") , gen(origin) drop foreign save data2.dta , replace
Harmonize
// Create harmonization codebook template iecodebook template /// "data1.dta" "data2.dta" /// using "codebook.xlsx" /// , surveys(First Second)
This should produce the following harmonization codebook template:
To resolve the differences, the completed codebook would be modified to look as follows. Note the key functionality of harmonization -- variables from different datasets are placed by the user into the same row, and iecodebook append understands this to mean that they should have the same final instructions applied to them so that they append properly (except, of course, recode; which is why there is one recode: column for each survey as well as choices_ sheets for reference). Specifying the match option does this as best as possible by automatically aligning variables that have the same name in the template.
There are two important differences from the apply syntax. First, the drop option is the default: that is, if there is no name harmonization specified, that is, if there is no value in the first four columns (name, label, type, choices), variables are dropped. (The keepall option may be specified to override this behavior, but the user should check the results carefully.) Again, note that you will have to manually recreate the value label lists in the choices sheet, but that the data labels from your original datasets are available for copy-paste from respective choices_ sheets.
To execute the command, run:
// Harmonize and append the datasets iecodebook append /// "data1.dta" "data2.dta" /// using "codebook.xlsx" /// , surveys(First Second)
The combined dataset will yield the following crosstabs, and a codebook titled codebook\_appended.xlsx will be created in the same location as the append codebook documenting the final state of the dataset for quick reference.
. ta survey foreign
Data | Foreign Source | Domestic Foreign | Total -----------+----------------------+---------- First | 52 22 | 74 Second | 52 22 | 74 -----------+----------------------+---------- Total | 104 44 | 148
Export
The iecodebook export command provides a simple utility for documenting the current state of a dataset, and for preparing a trimmed "release" version of a dataset. The syntax is:
iecodebook export [if] [in] using "/path/to/codebook.xlsx" , [replace] [trim("/path/to/dofile1.do" ["/path/to/dofile2.do"] ...)]
The base command will simply produce a record of the dataset's contents at the specified location. If the trim() option is specified, iecodebook export will read the contents of the specified dofiles; drop any variables that do not match the contents; restrict the dataset according to if and in as specified; and save the results in the same location as the codebook as a .dta file with the same name. (Note that this is a new functionality and is imperfectly implemented: trim() will not, for example, correctly parse macros. Therefore, please check that your results run and reproduce correctly after using this option.)
Additional Resources
- DIME Analytics' guidelines on iecodebook