Difference between revisions of "Iecodebook"

Jump to: navigation, search
 
(166 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== iecodebook==
<code>iecodebook</code> is the final command in the Stata package created by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics], <code>[[iefieldkit]]</code>. After [[Primary Data Collection|data collection]] is complete, <code>iecodebook</code> allows the [[Impact Evaluation Team|research team]] to automatically perform the repetitive steps involved in [[Data Cleaning|cleaning data]] before further [[Data Analysis|analysis]]. As the name suggests, the <code>iecodebook</code> command is structured around Excel-based '''codebooks''', which allow researchers to perform and [[Data Documentation|document]] data cleaning tasks in Excel itself, instead of using '''do-files'''.
== Read First ==
* Please refer to [[Stata Coding Practices|Stata coding practices]] for coding best practices in Stata.
* <code>iecodebook</code> is part of the package <code>[[iefieldkit]]</code>, which has been developed by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics].
* The <code>iecodebook</code> and its subcommands allow the [[Impact Evaluation Team|research team]] to rapidly [[Iecodebook#Apply|clean]], [[Iecodebook#Harmonize|harmonize]], and [[Iecodebook#Export|document]] datasets using '''codebooks'''.
* Codebooks allow researchers to document the cleaned data in a format that is both human and machine-readable.
* To install <code>iecodebook</code> and all the commands in the <code>[[iefieldkit]]</code> package, type <syntaxhighlight lang="Stata" inline>ssc install iefieldkit</syntaxhighlight> in Stata.
* For instructions and available options, type <syntaxhighlight lang="Stata" inline>help iecodebook</syntaxhighlight>.


Once data collection is complete, it must be cleaned before it can be analyzed. The iecodebook commands are designed to automate repetitive data cleaning tasks in two typical situations: iecodebook apply, where a large number of variables need to have arbitrary rename, recode, or label commands applied to them; and iecodebook append, when two or more datasets need to be harmonized to have the same variable names, labels, and value labels ("choices") in order to be appended together. iecodebook also provides an export subcommand so that a human-readable record of the variables and their labels in a dataset can be instantly created at any time; and a template subcommand which prepares the codebooks for the other syntax.
== Overview ==
As its name suggests, the <code>iecodebook</code> command creates Excel-based '''codebooks'''. The [[Impact Evaluation Team|research team]] can fill these codebooks with [[Data Cleaning|data cleaning]] instructions for Stata. In this way, <code>iecodebook</code> creates a '''metadata record''' which is easier to write than a long sequence of '''data cleaning''' commands in a '''do-file'''. These codebooks in Excel are also easier to read and understand, even if someone does not have knowledge of Stata. There are four subcommands in <code>iecodebook</code> to support its functions:
* <code>[[Iecodebook#Apply|iecodebook apply</code>]] reads an Excel codebook where the user renames, recodes, and/or labels a large number of variables, and applies these changes to the current dataset.


As its name suggests, the iecodebook command is structured around Excel-based "codebooks". The purpose of these codebooks is to document data cleaning after data are exported from the primary data collection software in a format that is both human and machine-readable. By completing these codebooks with data cleaning instructions for Stata, a metadata record is created which is easier to write than a long sequence of data cleaning commands in a dofile; and easier to read later. This functionality is implemented via four subcommands:
* <code>[[Iecodebook#Append and Harmonize|iecodebook append]]</code> allows two or more datasets to have the same variable names, labels, and value labels. That is, it '''harmonizes''' two or more datasets, and then appends them. As you will see, this is different from simply using the <code>append</code> command in Stata.


* <code>[[Iecodebook#Export|iecodebook export]]</code> creates an Excel codebook that describes the current dataset. It can also produce an exportable version of the dataset which only contains the variables used in a particular '''do-file'''.


* iecodebook template, which  creates an Excel template that describes the current or targeted dataset(s), with empty columns for you to specify the changes or harmonization for the other iecodebook commands.
* <code>[[Iecodebook#Template|iecodebook template]]</code> creates an Excel template that describes the current or targeted dataset(s), and prepares the codebook for the other <code>iecodebook</code> subcommands.
* iecodebook apply, which reads an Excel codebook that specifies renames, recodes, variable labels, and value labels, and applies them to the current dataset.
* iecodebook append, which  reads an Excel codebook that specifies how variables should be harmonized across two or more datasets - rename, recode, variable labels, and value labels - applies the harmonization, and appends the datasets.
* iecodebook export, which creates an Excel codebook that describes the current dataset, and optionally produces an export version of the dataset with only variables used in specified dofiles.


== Apply ==
The most common [[Data Cleaning|data cleaning]] tasks include renaming variables, applying variable and value labels, and recoding values. The <code>iecodebook apply</code> subcommand allows the [[Impact Evaluation Team|research team]] to perform all of these tasks without writing separate lines of code for each task in Stata. The following steps list how the <code>iecodebook apply</code> works.
# '''Create template:''' <code>iecodebook</code> first converts the dataset into a template in Excel using <code>iecodebook template</code>. In this template, each column describes different aspects of a single variable, including '''''name''''', '''''label''''', '''''type''''', and so on.
# '''Complete template:''' After this, you can simply fill out the template, which creates the '''codebook'''. The codebook lists all the '''data cleaning''' tasks that you wish to perform on the dataset.
# '''Apply changes:''' The <code>iecodebook apply</code> subcommand then reads these commands, and executes them all with just one line of Stata code. The resulting output is a cleaned dataset, along with an easy-to-read record of the cleaning commands you applied.


===Apply cleaning commands to the open dataset===
=== Syntax ===
The following line of code creates an <code>apply</code> template with the relevant dataset. The template is named '''filename.xlsx''' in this case.
<syntaxhighlight lang="Stata">iecodebook template using "filename.xlsx"</syntaxhighlight>


The most common data cleaning tasks are renaming variables, applying variable and value labels, and recoding values. The iecodebook apply subcommand provides a workflow to execute an arbitrary number of these commands without writing out an arbitrary number of lines of Stata code. Instead, the dataset is first translated into a codebook with each line describing the contents of a single variable. Then, the user fills out a template specifying the cleaning commands they wish to execute. The iecodebook apply subcommand reads these commands and executes them all with a single line of Stata code. The resulting output is a cleaned dataset and a highly-readable record of the cleaning commands applied to it.
The following line of code applies the changes to the dataset. It saves the codebook with the same name, that is, '''filename.xlsx'''.
<syntaxhighlight lang="Stata">iecodebook apply using "filename.xlsx" ///
    , [drop] [missingvalues(# "label" [# "label" ...])]</syntaxhighlight>


First, create an apply template with the dataset open:
=== Implementation ===
The following steps use an example to explain how <code>iecodebook apply</code> works in practice.
==== Step 1: Load the dataset ====
First, load the dataset that you want to clean. In this case, the dataset is named '''"auto.dta"'''.
<syntaxhighlight lang="Stata" line>sysuse auto.dta , clear</syntaxhighlight>


    iecodebook template using "/path/to/codebook.xlsx"
==== Step 2: Create template ==== 
Next, run the following code to create the '''template codebook''', which is named '''"cleaning.xlsx"''' in this case.
<syntaxhighlight lang="Stata" line start="2">iecodebook template using "cleaning.xlsx"</syntaxhighlight>
This produces the template codebook in '''Figure 1''', which shows the state of the data before you make changes.


Next, fill out the template with the specific instructions desired, save it, then apply it to the open data by writing:
[[File:Template-apply.png|800px|thumb|center|''' Figure 1: Data cleaning codebook template''']]


    iecodebook apply using "/path/to/codebook.xlsx"
==== Step 3: Complete template ====
    , [drop] [missingvalues(# "label" [# "label" ...])]
Next, fill-up the following columns in the template to specify the relevant [[Data Cleaning|cleaning]] tasks:
* '''''name''''': Fill the '''''name''''' column in the template to specify what the <code>rename</code> command will do to the variables in the dataset. You can use this to rename a variable. For example, in '''Figure 3''', we rename the '''foreign''' variable to '''domestic''' depending on which of these names is assigned '''"0"''' and '''"1"''' in the '''''choices''''' sheet (shown in '''Figure 6''').


For example, running:
* '''''label''''': Fill the '''''label''''' columns in the template to specify what the <code>label</code> command will do to the variables in the dataset. You can use this to give more information about a variable.


    // Load data
* '''''choices''''': Enter a label name in the '''''choices''''' column to apply a particular value label for a variable. Also create the corresponding value label in the '''''choices''''' sheet. Every template already includes a demo <code>yesno</code> label as a guide. For example, in '''Figure 3''', we have applied the <code>yesno</code> value label to the variable with the name '''domestic''', to indicate if a car is domestically made or not.  We have also applied the '''origin''' value label to the variable with the name '''foreign'''. We will also create the '''origin''' value label in the '''''choices''''' sheet, for instance by assigning values of '''"0"''' to '''domestic''', and '''"1"''' to '''foreign'''.
    sysuse auto.dta , clear


    // Create cleaning template
* '''''recode:current''''': Use the usual syntax <code>(''rule'') [(''rule'') ...]</code> in the '''''recode:current''''' column to <code>recode</code> data values. For example, in '''Figure 3''', we use <code>(0=1)(1=0)</code> to indicate that the label for the value of '''"0"''' has now been assigned to '''"1"''' , and the label for the value of '''"1"''' has now been assigned to '''"0"'''. Using the example above, this means that now '''foreign''' has a value of '''"0"''', and '''domestic''' has a value of '''"1"'''.
    iecodebook template using "codebook.xlsx"


Produces the following template codebook reflecting the current state of the data:
'''Note:''' The data types are given for reference only; you cannot use <code>iecodebook</code> to change them. '''Figure 3''' shows how you can make the above changes to the '''foreign''' variable.


[[File:Apply-template.png]]
[[File:Template-complete.png|800px||thumb|center|'''Figure 2: <code>iecodebook apply</code> template filled out with changes to be applied''']]


To apply changes to the data, complete the "name" and "label" columns to prepare rename and label variable commands for the current dataset, respectively. To apply value labels, enter a label name in the "choices" column and create the corresponding value label in the choices sheet (every template includes a demo yesno label as a guide). To recode data values, use the usual syntax (rule) [(rule) ...] in the "recode:current" column. You cannot change data types with this command; these are provided for reference only. For example, you might write the following to make some adjustments to the foreign variable:
==== Step 4: Apply cleaning commands ====
Finally, apply the changes using the following command, and save the codebook with the same name as before - '''"cleaning.xlsx"'''.
<syntaxhighlight lang="Stata" line start="3">iecodebook apply using "cleaning.xlsx"</syntaxhighlight>


[[File:Apply-complete.png]]
'''Note:''' Keep the following points in mind when using <code>iecodebook apply</code>.
* '''Default:''' By default, all variables where you do not make changes will be the same as before.


To apply the changes, you would then run the following command:
* '''Dropping variables:''' You can also use <code>iecodebook apply</code> to drop variables from the dataset, using the <code>drop</code> option, or using single periods (<code>.</code>).
** <code>drop</code>: You can simply use the <code>drop</code> option to drop variables from the dataset that have no final variable name under the '''''name''''' column.
** Single period (<code>.</code>): Alternatively, you can place a single period (<code>.</code>) under the '''''name''''' column to drop variables one by one.


    // Apply cleaning commands to open dataset
* <code>missingvalues()</code>: The <code>missingvalues()</code> option allows you to add '''global''' missing-value codes to all value labels. For example, if you use extended missing values consistently you can have <code>.a</code> or <code>.n</code> have the same meaning in all responses.
    iecodebook apply using "codebook.xlsx"


Note that the correct command is created by replacing template with apply. By default, all variables with no adjustments will be left as-is. However, this is not required: the drop option orders all variables that have no final variable name in the name column to be dropped from the dataset. The missingvalues() option allows global missing-value codes to be propagated to all value labels. Note also that you will have to manually recreate all value label lists in the choices sheet, but that the data labels from your original dataset is available for copy-paste from the choices_current sheet.
* '''Value labels:''' You will have to manually recreate all value label lists in the '''''choices''''' sheet. However, you can copy-paste data labels from your original dataset to the '''''choices_current''''' sheet.


===Append and harmonize multiple datasets===
== Append and Harmonize ==
A common task in [[Primary Data Collection|data collection]] is combining two or more sequential rounds of [[Field Surveys|surveys]], or combining similar [[Questionnaire Design|survey instruments]] that were used in different contexts. This is often a tricky process, and at least one of the datasets might not be correctly updated. In such a case, simply using the '''append''' command in Stata will not provide you with the desired structure for your dataset. In such cases, it is possible that certain aspects may not be synchronized or '''harmonized''' across the two datasets, such as:
* '''Variable names'''
* '''Variable labels (including translations)'''
* '''Value labels'''
* '''Data types'''
In these cases, <code>iecodebook append</code> offers a quick option to [[Data Documentation|document]] and resolve these differences across multiple datasets. The following steps list how <code>iecodebook append</code> works.
# '''Create template:''' Just as in the case of <code>iecodebook apply</code>, use <code>iecodebook template</code> to first convert the dataset into a '''harmonization template''' in Excel. In this template, each column describes different aspects of a variable, including '''''name''''', '''''label''''', '''''type''''', and so on. The only difference is that in this case, <code>iecodebook</code> also creates a new variable called '''survey''' by default. The value label for this variable will contain the name of the surveys that you specify under the <code>surveys()</code> option for <code>iecodebook template</code>.
# '''Complete template:''' After this, you can simply fill out the template in Excel. In this template, you can specify the rules to <code>append</code> the datasets, and resolve differences across the two datasets. For instance, you can place certain variables in the same row, and <code>iecodebook append</code> will understand this to mean that we want those variables to have the same values under the '''''name''''', '''''label''''' and '''''choices''''' columns so that they can <code>append</code> properly.
# '''Append datasets:''' The <code>iecodebook append</code> subcommand then reads the rules that you specify in the template, and uses them to finally <code>append</code> the datasets. The resulting output is a '''harmonized''' dataset with all the differences across the two datasets now resolved.
=== Syntax ===
The general syntax of <code>iecodebook template</code> in this case is as follows:
<syntaxhighlight lang="Stata" line start="4">iecodebook template "filename.dta" "filename.dta" [...] ///
    using "filename.xlsx" ///
    , surveys(Survey1Name Survey2Name [...]) [match]</syntaxhighlight>


A common downstream task in data collection is to combine two or more sequential rounds of surveys; or, similarly, to combine similar survey instruments conducted in different settings. This is always harder than it first sounds. Inevitably, updates and/or localization have been made to at least one of the datsets, such that a simple append command will not produce the desired data structure. Most often, these changes cause desynchronisation of:
The following is the syntax to run <code>iecodebook append</code> subcommand based on the rules you specify in the '''harmonization template''':
<syntaxhighlight lang="Stata" line start="8">iecodebook append "filename.dta" "filename.dta" [...] ///
    using "filename.xlsx" ///
    , surveys(Survey1Name Survey2Name [...]) ///
    [generate(varname)] [keepall] [report] [replace] ///
    [missingvalues(# "label" [# "label" ...])]</syntaxhighlight>


* Variable names
The following points explain the options that are used with the <code>iecodebook template</code> and <code>iecodebook append</code> subcommands:
* Variable labels (including translation)
* <code>match</code>: The <code>match</code> option automatically aligns variables from other datasets in the same row if they share a name with a variable in the first dataset. It is optional for the <code>template</code> subcommand only.
* Value labels
* Categorical options
* Data types


The iecodebook append subcommand offers a rapid workflow for documenting and resolving these differences across multiple datasets. The general syntax of the templating command is:
* <code>surveys()</code>: The <code>surveys()</code> option must be used with both subcommands, and the names you specify under this option must be the same for both subcommands. Specify the names of the surveys as a list of single words - <code>(''Survey1Name'' ''Survey2Name'' [...])</code>. <code>iecodebook</code> will look for these names in the headers of the codebook. The command will also create a '''survey''' variable in the resulting dataset by default. The value label for this variable will contain the name of the surveys that you specify under the <code>surveys()</code> option.


  iecodebook template
* <code>generate()</code>: To change the name of the '''survey''' variable, use the <code>generate()</code> option in both subcommands.
    "/path/to/survey1.dta" "/path/to/survey2.dta" [...]
    using "/path/to/codebook.xlsx"
    , surveys(Survey1Name Survey2Name ...)


As in iecodebook apply, the correct executing command is formed by replacing template with append. The general syntax of the iecodebook append command is therefore:
* <code>report</code>: The <code>report</code> option exports a codebook with the results of the resulting dataset for quick reference.


  iecodebook append
* <code>replace</code>: The <code>replace</code> option allows you to overwrite the existing file which contains the codebook.
    "/path/to/survey1.dta" "/path/to/survey2.dta" ...
    using "/path/to/codebook.xlsx"
    , surveys(Survey1Name Survey2Name ...) [nodrop]
      [missingvalues(# "label" [# "label" ...])]


The surveys() option is required in both steps, and must match between them. As a list of single words, the users should specifiy the names of the surveys (which the command will look for in the codebook headers). The command will also create a survey variable in the resulting dataset, labelled with these names -- so your current dataset cannot have a variabel called survey. To demonstrate the usage, we will create two datasets that have similar data but with different structures, then combine them using a codebook. Run the following:
=== Implementation ===
The following steps show how <code>iecodebook append</code> works in practice.


    // Create demonstration datasets
==== Step 1: Create two different datasets ====
    sysuse auto.dta , clear
First, create two datasets that have similar data but different structures.
    save data1.dta , replace
Run the following code for this.
    rename (price mpg)(cost car_mpg)
<syntaxhighlight lang="Stata" line> sysuse auto.dta , clear
        recode foreign (0=1 "Domestic")(1=0 "Foreign") , gen(origin)
save data1.dta , replace
        drop foreign
rename price cost
    save data2.dta , replace
rename mpg car_mpg
    recode foreign (0=1 "Domestic")(1=0 "Foreign") , gen(origin)  
    drop foreign  
save data2.dta , replace</syntaxhighlight>
'''Note:''' We have used the same dataset, '''"auto.dta"''', to create two different datsets, '''"data1.dta"''' and '''"data2.dta"'''


    // Create harmonization codebook template
==== Step 2: Combine the datasets ====
    iecodebook template         ///
Next, combine the datasets to create a '''harmonization codebook template''' in Excel using <code>iecodebook template</code>. Name this file '''"harmonization.xlsx"''''. This is shown in '''Figure 3'''. Run the following code.
      "data1.dta" "data2.dta"   ///
<syntaxhighlight lang="Stata" line start="8"> iecodebook template "data1.dta" "data2.dta" ///
      using "codebook.xlsx"     ///
    using "harmonization.xlsx" ///
      , surveys(First Second)
    , surveys(First Second)</syntaxhighlight>
'''Note:''' The names of the surveys are '''First''' and '''Second'''. These names appear in the column names in '''Figure 3''' below as follows -  '''''name:First''''', '''''name:Second''''', '''''label:First''''', '''''label:Second''''', and so on.


This should produce the following harmonization codebook template:
[[File:Append-temp.png|1000px|thumb|center|'''Figure 3: <code>iecodebook append</code> harmonization codebook template''']]


[[File:Append-template.png]]
If you use the <code>match</code> option, the '''harmonization template''' will appear as shown in '''Figure 4''' below. Note the following about the <code>match</code> option:
* In this case, the variables are ordered according to how they appear in the first dataset, that is, '''"data1.dta"'''. However, the order of these variables in the underlying dataset does not get affected, it is only for the purpose of the codebook.  


To resolve the differences, the completed codebook would be modified to look as follows. Note the key functionality of harmonization -- variables from different datasets are placed by the user into the same row, and iecodebook append understands this to mean that they should have the same final instructions applied to them so that they append properly (except, of course, recode; which is why there is one recode: column for each survey as well as choices_ sheets for reference).
* <code>iecodebook</code> will never reorder variables beyond the functionality of the built-in <code>append</code> subcommand.


[[File:Append-complete.png]]


There are two important differences from the apply syntax. First, the drop option is the default: that is, if there is no name harmonization specified, that is, if there is no value in the first four columns (name, label, type, choices), variables are dropped. (The nodrop option may be specified to override this behavior, but the user should check the results carefully.) Again, note that you will have to manually recreate the value label lists in the choices sheet, but that the data labels from your original datasets are available for copy-paste from respective choices_ sheets.
[[File:App-match.png|1000px|thumb|center|'''Figure 4: <code>iecodebook append</code> harmonization codebook template using the match option''']]


[[File:Append-choices.png]]
==== Step 3: Modify the codebook ====
Next, to resolve the differences between the two datasets - '''"data1.dta"''' and '''"data2.dta"''' , modify the completed codebook as shown in '''Figure 5'''. In order to '''harmonize''' the datasets, place variables from different datasets into the same row. This means that for <code>iecodebook append</code> to <code>append</code> properly, make sure that the variables have the same final variable names, labels, and value labels applied to them in the harmonization template. In our example, it means that for the variables that you want to <code>append</code>, ensure that each of them have the same values under '''''name:First''''' and '''''name:Second''''' columns, and so on for the '''''label:''''' and '''''choices:''''' columns.  


To execute the command, run:
[[File:Harmonize-comp.png|1000px|thumb|center|'''Figure 5: <code>iecodebook append</code> codebook with harmonization instructions''']]


    // Harmonize and append the datasets
    iecodebook append          ///
      "data1.dta" "data2.dta"  ///
      using "codebook.xlsx"    ///
      , surveys(First Second)


The combined dataset will yield the following crosstabs, and a codebook titled codebook\_appended.xlsx will be created in the same location as the append codebook documenting the final state of the dataset for quick reference.
'''Note:''' Keep the following points in mind when modifying the codebook.
* '''''recode:''''' '''columns:''' You will need to handle the <code>recode</code> commands dataset-by-dataset, instead of just re-arranging the variables. Which is why there is one '''''recode:''''' column for each survey (which acts as a source of data), as well as '''''choices_''''' sheets for reference.
* '''Value labels:''' As with  <code>iecodebook apply</code>, you will have to manually recreate the value label lists in the '''''choices''''' sheet.  However, you can copy-paste data labels from your original datasets to the '''''choices_First''''' and '''''choices_Second''''' sheets respectively. This is shown in '''Figure 6''' below.
 
 
[[File:Choices-harmonize.png|700px|thumb|center|'''Figure 6: ''choices'' sheet in a codebook]]
 
==== Step 4: Append the datasets ====
Finally, <code>append</code> the '''harmonized''' datasets using the following code. Save the codebook with the same name, that is, '''"harmonization.xlsx" '''.
<syntaxhighlight lang="Stata" line start="11"> iecodebook append "data1.dta" "data2.dta" ///
    using "harmonization.xlsx" ///
    , clear surveys(First Second)</syntaxhighlight>
 
'''Note:''' There are two important differences between the syntax for <code>iecodebook append</code> and <code>iecodebook apply</code>:
# '''Default is to drop:''' By default, the <code>iecodebook append</code> command will only keep those variables to which you explicitly give final names in the '''''name''''' column. This is to encourage you to manually review each variable.
# <code>keepall</code>: You may use the <code>keepall</code> option to retain all variables from all datasets, except the variables you have tagged for deletion by entering a single period (<code>.</code>) in the '''''name''''' column. This allows you to override the default drop option. However, you should still check the final dataset carefully because if you <code>append</code> variables without explicitly reviewing them, it may again cause problems with synchronization.
 
==== Output ====
The combined dataset will yield the following crosstabs. If you specify the <code>report</code> option, <code>iecodebook</code> will also create another codebook titled '''"codebook report.xlsx"''', in the same location as the '''harmonization codebook'''. This report [[Data Documentation|documents]] the final state of the dataset for quick reference.


     . ta survey foreign
     . ta survey foreign
Line 121: Line 176:
         Total |      104        44 |      148
         Total |      104        44 |      148


===Export a codebook for an existing dataset===
== Export ==
Finally, <code>iecodebook export</code> provides a simple way to [[Data Documentation|document]] the current state of a dataset, as well as perform simple automated data quality control tasks.
=== Syntax ===
The syntax for <code>iecodebook export</code> is as follows:
<syntaxhighlight lang="Stata">iecodebook export ///
    using "filename.xlsx" ///
    , [replace] [noexcel] [plaintext(compact|detailed)] ///
      [save] [saveas(filename.dta)] ///
      [verify] [signature] [reset] ///
      [trim("filename.do" ["filename.do"] [...])]
  </syntaxhighlight>
 
The basic command will simply produce a record of the dataset’s contents in a specified location in <code>.xlsx</code> format, similar to the features above. Requesting a <code>plaintext</code> version will give either a compact or detailed codebook using the <code>codebook</code> command. The <code>noexcel</code> option will turn off the <code>.xlsx</code> output.
 
=== Data Management and Quality Assurance ===
In addition to the basic codebook creation, <code>iecodebook export</code> offers several tools for automatic data management and quality assurance. First, it allows the user to save a copy of the data in the same location of the codebook using the <code>save</code> option. By default, that <code>.dta</code> file will have the same name as the codebook. Specifying <code>saves()</code> allows the user to change the name of the <code>.dta</code> file.
 
Second, it allows the user to verify the contents of data in several ways. If <code>iecodebook export</code> has already been used to place a codebook in the destination location, the <code>verify</code> option will ensure that the structure of the data being saved exactly matches the data structure the codebook details -- that no variables have been added or removed, that no labels or value labels have changed, and that no data types have changed. This is useful when data is being updated but the user wants to ensure that all downstream code will continue to run.
 
The user can also specify the <code>signature</code> option to create a corresponding <code>datasignature</code> in the destination folder. Specified initially with <code>reset</code>, the command will place a <code>datasignature</code> alongside the saved data. it will be named the same as the saved data file, with <code>-sig.txt</code> replacing <code>.dta</code>.
 
In future runs, the option will check whether a <code>datasignature</code> file with the same name is already present. If it is, and it is different, the command will return an error. This is useful when data should not be changing at all. It will also break if no <code>datasignature</code> is present. As in initializing the signature, <code>reset</code> will override the error and allow an updated signature to be written.
 
Finally, the <code>trim()</code> option allows the user to automatically reduce the variables in the dataset to exactly match those needed for a specific set of do-files. This is useful when creating reproducibility packages, and is also useful for removing PII data (which should not be called in analysis do-files, for instance). If the <code>trim()</code> option is specified, <code>iecodebook export</code> will work as described below:
** '''Read do-file:'''  <code>iecodebook export</code> will first read the contents of the specified '''do-files'''.
** '''Drop irrelevant variables:''' It will then drop any variables that do not match the contents of the '''do-files'''.
** '''Save results.''' Finally, it will save the results in the same location as the codebook, in the form of a '''.dta file''' with the same name.
 
Note that <code>trim()</code> will not work with all do-files. For example, <code>trim()</code> will not correctly '''parse''' code that relies on '''macros''' to select variables; and it may not work correctly with abbreviations or wildcards, which are strongly discouraged. Therefore, you should always check that your results [[Reproducible Research|reproduce]] correctly after using this option.
 
Consider the following example to understand how the <code>iecodebook export</code> subcommand and the <code>trim()</code> option will work. Consider a '''do-file''' titled '''"analysis.do"''', which only contains one line of code:
<syntaxhighlight lang="Stata">sum foreign mpg trunk</syntaxhighlight>
 
Along with this, we also have the dataset we used previously, '''"auto.dta"''' dataset. Now we run the following code:
<syntaxhighlight lang="Stata">iecodebook export using "codebook-trim.xlsx" , trim("analysis.do")</syntaxhighlight>
 
Then this code saves a codebook called '''"codebook-trim.xlsx"''', and a dataset called '''"codebook-trim.dta"''' in the same location. Since only the variables '''foreign''', '''mpg''', and '''trunk''' are mentioned in the '''do-file''' titled '''"analysis.do"''', both - '''"codebook-trim.xlsx"''' and '''"codebook-trim.dta"''' - will only contain these 3 variables.


The iecodebook export command provides a simple utility for documenting the current state of a dataset, and for preparing a trimmed "release" version of a dataset. The syntax is:
== Related Pages ==
[[Special:WhatLinksHere/Iecodebook|Click here for pages that link to this topic.]]<br>
This page is part of the topic <code>[[iefieldkit]]</code>.


  iecodebook export [if] [in] using "/path/to/codebook.xlsx"
== Additional Resources ==
    , [trim("/path/to/dofile1.do" ["/path/to/dofile2.do"] ...)]
* DIME Analytics (World Bank), [https://github.com/worldbank/iefieldkit The <code>iefieldkit</code> GitHub page]
* DIME Analytics (World Bank) [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-6-iecodebook.pdf Data Cleaning with <code>iecodebook</code>]


The base command will simply produce a record of the dataset's contents at the specified location. If the trim() option is specified, iecodebook export will read the contents of the specified dofiles; drop any variables that do not match the contents; restrict the dataset according to if and in as specified; and save the results in the same location as the codebook as a .dta file with the same name. (Note that this is a new functionality and is imperfectly implemented: trim() will not, for example, correctly parse variable lists or wildcards; and it will match variables which are sub-strings of other variables. Therefore, please check that your results run and reproduce correctly after using this option.)
[[Category: Stata ]]
==Additional Resources==
*DIME Analytics' guidelines on [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-6-iecodebook.pdf iecodebook]

Latest revision as of 23:33, 7 February 2022

iecodebook is the final command in the Stata package created by DIME Analytics, iefieldkit. After data collection is complete, iecodebook allows the research team to automatically perform the repetitive steps involved in cleaning data before further analysis. As the name suggests, the iecodebook command is structured around Excel-based codebooks, which allow researchers to perform and document data cleaning tasks in Excel itself, instead of using do-files.

Read First

  • Please refer to Stata coding practices for coding best practices in Stata.
  • iecodebook is part of the package iefieldkit, which has been developed by DIME Analytics.
  • The iecodebook and its subcommands allow the research team to rapidly clean, harmonize, and document datasets using codebooks.
  • Codebooks allow researchers to document the cleaned data in a format that is both human and machine-readable.
  • To install iecodebook and all the commands in the iefieldkit package, type ssc install iefieldkit in Stata.
  • For instructions and available options, type help iecodebook.

Overview

As its name suggests, the iecodebook command creates Excel-based codebooks. The research team can fill these codebooks with data cleaning instructions for Stata. In this way, iecodebook creates a metadata record which is easier to write than a long sequence of data cleaning commands in a do-file. These codebooks in Excel are also easier to read and understand, even if someone does not have knowledge of Stata. There are four subcommands in iecodebook to support its functions:

  • iecodebook apply reads an Excel codebook where the user renames, recodes, and/or labels a large number of variables, and applies these changes to the current dataset.
  • iecodebook append allows two or more datasets to have the same variable names, labels, and value labels. That is, it harmonizes two or more datasets, and then appends them. As you will see, this is different from simply using the append command in Stata.
  • iecodebook export creates an Excel codebook that describes the current dataset. It can also produce an exportable version of the dataset which only contains the variables used in a particular do-file.
  • iecodebook template creates an Excel template that describes the current or targeted dataset(s), and prepares the codebook for the other iecodebook subcommands.

Apply

The most common data cleaning tasks include renaming variables, applying variable and value labels, and recoding values. The iecodebook apply subcommand allows the research team to perform all of these tasks without writing separate lines of code for each task in Stata. The following steps list how the iecodebook apply works.

  1. Create template: iecodebook first converts the dataset into a template in Excel using iecodebook template. In this template, each column describes different aspects of a single variable, including name, label, type, and so on.
  2. Complete template: After this, you can simply fill out the template, which creates the codebook. The codebook lists all the data cleaning tasks that you wish to perform on the dataset.
  3. Apply changes: The iecodebook apply subcommand then reads these commands, and executes them all with just one line of Stata code. The resulting output is a cleaned dataset, along with an easy-to-read record of the cleaning commands you applied.

Syntax

The following line of code creates an apply template with the relevant dataset. The template is named filename.xlsx in this case.

iecodebook template using "filename.xlsx"

The following line of code applies the changes to the dataset. It saves the codebook with the same name, that is, filename.xlsx.

iecodebook apply using "filename.xlsx" ///
    , [drop] [missingvalues(# "label" [# "label" ...])]

Implementation

The following steps use an example to explain how iecodebook apply works in practice.

Step 1: Load the dataset

First, load the dataset that you want to clean. In this case, the dataset is named "auto.dta".

sysuse auto.dta , clear

Step 2: Create template

Next, run the following code to create the template codebook, which is named "cleaning.xlsx" in this case.

iecodebook template using "cleaning.xlsx"

This produces the template codebook in Figure 1, which shows the state of the data before you make changes.

Figure 1: Data cleaning codebook template

Step 3: Complete template

Next, fill-up the following columns in the template to specify the relevant cleaning tasks:

  • name: Fill the name column in the template to specify what the rename command will do to the variables in the dataset. You can use this to rename a variable. For example, in Figure 3, we rename the foreign variable to domestic depending on which of these names is assigned "0" and "1" in the choices sheet (shown in Figure 6).
  • label: Fill the label columns in the template to specify what the label command will do to the variables in the dataset. You can use this to give more information about a variable.
  • choices: Enter a label name in the choices column to apply a particular value label for a variable. Also create the corresponding value label in the choices sheet. Every template already includes a demo yesno label as a guide. For example, in Figure 3, we have applied the yesno value label to the variable with the name domestic, to indicate if a car is domestically made or not. We have also applied the origin value label to the variable with the name foreign. We will also create the origin value label in the choices sheet, for instance by assigning values of "0" to domestic, and "1" to foreign.
  • recode:current: Use the usual syntax (rule) [(rule) ...] in the recode:current column to recode data values. For example, in Figure 3, we use (0=1)(1=0) to indicate that the label for the value of "0" has now been assigned to "1" , and the label for the value of "1" has now been assigned to "0". Using the example above, this means that now foreign has a value of "0", and domestic has a value of "1".

Note: The data types are given for reference only; you cannot use iecodebook to change them. Figure 3 shows how you can make the above changes to the foreign variable.

Figure 2: iecodebook apply template filled out with changes to be applied

Step 4: Apply cleaning commands

Finally, apply the changes using the following command, and save the codebook with the same name as before - "cleaning.xlsx".

iecodebook apply using "cleaning.xlsx"

Note: Keep the following points in mind when using iecodebook apply.

  • Default: By default, all variables where you do not make changes will be the same as before.
  • Dropping variables: You can also use iecodebook apply to drop variables from the dataset, using the drop option, or using single periods (.).
    • drop: You can simply use the drop option to drop variables from the dataset that have no final variable name under the name column.
    • Single period (.): Alternatively, you can place a single period (.) under the name column to drop variables one by one.
  • missingvalues(): The missingvalues() option allows you to add global missing-value codes to all value labels. For example, if you use extended missing values consistently you can have .a or .n have the same meaning in all responses.
  • Value labels: You will have to manually recreate all value label lists in the choices sheet. However, you can copy-paste data labels from your original dataset to the choices_current sheet.

Append and Harmonize

A common task in data collection is combining two or more sequential rounds of surveys, or combining similar survey instruments that were used in different contexts. This is often a tricky process, and at least one of the datasets might not be correctly updated. In such a case, simply using the append command in Stata will not provide you with the desired structure for your dataset. In such cases, it is possible that certain aspects may not be synchronized or harmonized across the two datasets, such as:

  • Variable names
  • Variable labels (including translations)
  • Value labels
  • Data types

In these cases, iecodebook append offers a quick option to document and resolve these differences across multiple datasets. The following steps list how iecodebook append works.

  1. Create template: Just as in the case of iecodebook apply, use iecodebook template to first convert the dataset into a harmonization template in Excel. In this template, each column describes different aspects of a variable, including name, label, type, and so on. The only difference is that in this case, iecodebook also creates a new variable called survey by default. The value label for this variable will contain the name of the surveys that you specify under the surveys() option for iecodebook template.
  2. Complete template: After this, you can simply fill out the template in Excel. In this template, you can specify the rules to append the datasets, and resolve differences across the two datasets. For instance, you can place certain variables in the same row, and iecodebook append will understand this to mean that we want those variables to have the same values under the name, label and choices columns so that they can append properly.
  3. Append datasets: The iecodebook append subcommand then reads the rules that you specify in the template, and uses them to finally append the datasets. The resulting output is a harmonized dataset with all the differences across the two datasets now resolved.

Syntax

The general syntax of iecodebook template in this case is as follows:

iecodebook template "filename.dta" "filename.dta" [...] ///
    using "filename.xlsx" ///
    , surveys(Survey1Name Survey2Name [...]) [match]

The following is the syntax to run iecodebook append subcommand based on the rules you specify in the harmonization template:

iecodebook append "filename.dta" "filename.dta" [...] ///
    using "filename.xlsx" ///
    , surveys(Survey1Name Survey2Name [...]) ///
    [generate(varname)] [keepall] [report] [replace] ///
    [missingvalues(# "label" [# "label" ...])]

The following points explain the options that are used with the iecodebook template and iecodebook append subcommands:

  • match: The match option automatically aligns variables from other datasets in the same row if they share a name with a variable in the first dataset. It is optional for the template subcommand only.
  • surveys(): The surveys() option must be used with both subcommands, and the names you specify under this option must be the same for both subcommands. Specify the names of the surveys as a list of single words - (Survey1Name Survey2Name [...]). iecodebook will look for these names in the headers of the codebook. The command will also create a survey variable in the resulting dataset by default. The value label for this variable will contain the name of the surveys that you specify under the surveys() option.
  • generate(): To change the name of the survey variable, use the generate() option in both subcommands.
  • report: The report option exports a codebook with the results of the resulting dataset for quick reference.
  • replace: The replace option allows you to overwrite the existing file which contains the codebook.

Implementation

The following steps show how iecodebook append works in practice.

Step 1: Create two different datasets

First, create two datasets that have similar data but different structures. Run the following code for this.

 sysuse auto.dta , clear
 save data1.dta , replace
 rename price cost
 rename mpg car_mpg
    recode foreign (0=1 "Domestic")(1=0 "Foreign") , gen(origin) 
    drop foreign 
 save data2.dta , replace

Note: We have used the same dataset, "auto.dta", to create two different datsets, "data1.dta" and "data2.dta"

Step 2: Combine the datasets

Next, combine the datasets to create a harmonization codebook template in Excel using iecodebook template. Name this file "harmonization.xlsx"'. This is shown in Figure 3. Run the following code.

 iecodebook template "data1.dta" "data2.dta" ///
    using "harmonization.xlsx" ///
    , surveys(First Second)

Note: The names of the surveys are First and Second. These names appear in the column names in Figure 3 below as follows - name:First, name:Second, label:First, label:Second, and so on.

Figure 3: iecodebook append harmonization codebook template

If you use the match option, the harmonization template will appear as shown in Figure 4 below. Note the following about the match option:

  • In this case, the variables are ordered according to how they appear in the first dataset, that is, "data1.dta". However, the order of these variables in the underlying dataset does not get affected, it is only for the purpose of the codebook.
  • iecodebook will never reorder variables beyond the functionality of the built-in append subcommand.


Figure 4: iecodebook append harmonization codebook template using the match option

Step 3: Modify the codebook

Next, to resolve the differences between the two datasets - "data1.dta" and "data2.dta" , modify the completed codebook as shown in Figure 5. In order to harmonize the datasets, place variables from different datasets into the same row. This means that for iecodebook append to append properly, make sure that the variables have the same final variable names, labels, and value labels applied to them in the harmonization template. In our example, it means that for the variables that you want to append, ensure that each of them have the same values under name:First and name:Second columns, and so on for the label: and choices: columns.

Figure 5: iecodebook append codebook with harmonization instructions


Note: Keep the following points in mind when modifying the codebook.

  • recode: columns: You will need to handle the recode commands dataset-by-dataset, instead of just re-arranging the variables. Which is why there is one recode: column for each survey (which acts as a source of data), as well as choices_ sheets for reference.
  • Value labels: As with iecodebook apply, you will have to manually recreate the value label lists in the choices sheet. However, you can copy-paste data labels from your original datasets to the choices_First and choices_Second sheets respectively. This is shown in Figure 6 below.


Figure 6: choices sheet in a codebook

Step 4: Append the datasets

Finally, append the harmonized datasets using the following code. Save the codebook with the same name, that is, "harmonization.xlsx" .

 iecodebook append "data1.dta" "data2.dta" ///
    using "harmonization.xlsx" ///
    , clear surveys(First Second)

Note: There are two important differences between the syntax for iecodebook append and iecodebook apply:

  1. Default is to drop: By default, the iecodebook append command will only keep those variables to which you explicitly give final names in the name column. This is to encourage you to manually review each variable.
  2. keepall: You may use the keepall option to retain all variables from all datasets, except the variables you have tagged for deletion by entering a single period (.) in the name column. This allows you to override the default drop option. However, you should still check the final dataset carefully because if you append variables without explicitly reviewing them, it may again cause problems with synchronization.

Output

The combined dataset will yield the following crosstabs. If you specify the report option, iecodebook will also create another codebook titled "codebook report.xlsx", in the same location as the harmonization codebook. This report documents the final state of the dataset for quick reference.

   . ta survey foreign
         Data |        Foreign
       Source |  Domestic    Foreign |     Total
   -----------+----------------------+----------
        First |        52         22 |        74
       Second |        52         22 |        74
   -----------+----------------------+----------
        Total |       104         44 |       148

Export

Finally, iecodebook export provides a simple way to document the current state of a dataset, as well as perform simple automated data quality control tasks.

Syntax

The syntax for iecodebook export is as follows:

iecodebook export ///
    using "filename.xlsx" ///
    , [replace] [noexcel] [plaintext(compact|detailed)] ///
      [save] [saveas(filename.dta)] ///
      [verify] [signature] [reset] ///
      [trim("filename.do" ["filename.do"] [...])]

The basic command will simply produce a record of the dataset’s contents in a specified location in .xlsx format, similar to the features above. Requesting a plaintext version will give either a compact or detailed codebook using the codebook command. The noexcel option will turn off the .xlsx output.

Data Management and Quality Assurance

In addition to the basic codebook creation, iecodebook export offers several tools for automatic data management and quality assurance. First, it allows the user to save a copy of the data in the same location of the codebook using the save option. By default, that .dta file will have the same name as the codebook. Specifying saves() allows the user to change the name of the .dta file.

Second, it allows the user to verify the contents of data in several ways. If iecodebook export has already been used to place a codebook in the destination location, the verify option will ensure that the structure of the data being saved exactly matches the data structure the codebook details -- that no variables have been added or removed, that no labels or value labels have changed, and that no data types have changed. This is useful when data is being updated but the user wants to ensure that all downstream code will continue to run.

The user can also specify the signature option to create a corresponding datasignature in the destination folder. Specified initially with reset, the command will place a datasignature alongside the saved data. it will be named the same as the saved data file, with -sig.txt replacing .dta.

In future runs, the option will check whether a datasignature file with the same name is already present. If it is, and it is different, the command will return an error. This is useful when data should not be changing at all. It will also break if no datasignature is present. As in initializing the signature, reset will override the error and allow an updated signature to be written.

Finally, the trim() option allows the user to automatically reduce the variables in the dataset to exactly match those needed for a specific set of do-files. This is useful when creating reproducibility packages, and is also useful for removing PII data (which should not be called in analysis do-files, for instance). If the trim() option is specified, iecodebook export will work as described below:

    • Read do-file: iecodebook export will first read the contents of the specified do-files.
    • Drop irrelevant variables: It will then drop any variables that do not match the contents of the do-files.
    • Save results. Finally, it will save the results in the same location as the codebook, in the form of a .dta file with the same name.

Note that trim() will not work with all do-files. For example, trim() will not correctly parse code that relies on macros to select variables; and it may not work correctly with abbreviations or wildcards, which are strongly discouraged. Therefore, you should always check that your results reproduce correctly after using this option.

Consider the following example to understand how the iecodebook export subcommand and the trim() option will work. Consider a do-file titled "analysis.do", which only contains one line of code:

sum foreign mpg trunk

Along with this, we also have the dataset we used previously, "auto.dta" dataset. Now we run the following code:

iecodebook export using "codebook-trim.xlsx" , trim("analysis.do")

Then this code saves a codebook called "codebook-trim.xlsx", and a dataset called "codebook-trim.dta" in the same location. Since only the variables foreign, mpg, and trunk are mentioned in the do-file titled "analysis.do", both - "codebook-trim.xlsx" and "codebook-trim.dta" - will only contain these 3 variables.

Related Pages

Click here for pages that link to this topic.
This page is part of the topic iefieldkit.

Additional Resources