Difference between revisions of "Iecodebook"

Jump to: navigation, search
Line 83: Line 83:
* '''Data types.'''
* '''Data types.'''
In these cases, '''<code>iecodebook append</code>''' offers a quick option to [[Data Documentation|document]] and resolve these differences across multiple datasets. The following steps list how the '''<code>iecodebook append</code>''' works.
In these cases, '''<code>iecodebook append</code>''' offers a quick option to [[Data Documentation|document]] and resolve these differences across multiple datasets. The following steps list how the '''<code>iecodebook append</code>''' works.
# '''Create template:''' Just as in the case of '''<code>iecodebook apply</code>''', use the '''template''' subcommand to first convert the dataset into a '''codebook template''' in Excel using '''<code>iecodebook template</code>'''. In this template, each column describes different aspects of a variable, including '''''name''''', '''''label''''', '''''type''''', and so on. The only difference is that in this case, '''<code>iecodebook</code>''' also creates a new variable called '''survey''' by default. The value label for this variable will contain the name of the surveys that you specify under the '''surveys ( )''' option for '''<code>iecodebook template</code>'''.
# '''Create template:''' Just as in the case of '''<code>iecodebook apply</code>''', use '''<code>iecodebook template</code>''' to first convert the dataset into a template in Excel. In this template, each column describes different aspects of a variable, including '''''name''''', '''''label''''', '''''type''''', and so on. The only difference is that in this case, '''<code>iecodebook</code>''' also creates a new variable called '''survey''' by default. The value label for this variable will contain the name of the surveys that you specify under the '''surveys ( )''' option for '''<code>iecodebook template</code>'''.
# '''Complete template:''' After this, you can simply fill out the '''codebook template''' in Excel. This template lists all the rules that you want to use to '''append''' the datasets, and resolve differences across the two datasets.  
# '''Complete template:''' After this, you can simply fill out the template in Excel. In this template, you can specify the rules to '''append''' the datasets, and resolve differences across the two datasets. For instance, you can place certain variables in the same row, and '''<code>iecodebook append</code>''' will understand this to mean that we want those variables to have the same values under the '''''name''''', '''''label''' and '''''choices''''' columns so that they can '''append''' properly.
# '''Append datasets:''' The '''<code>iecodebook append</code>''' subcommand then reads the rules that you specify in the '''codebook template''', and uses them to finally '''append''' the datasets. The resulting output is a '''harmonized''' dataset with all the differences across the two datasets now resolved.
# '''Append datasets:''' The '''<code>iecodebook append</code>''' subcommand then reads the rules that you specify in the '''codebook template''', and uses them to finally '''append''' the datasets. The resulting output is a '''harmonized''' dataset with all the differences across the two datasets now resolved.
=== Syntax ===
=== Syntax ===
The general syntax of '''<code>iecodebook template</code>''' in this case is as follows:
The general syntax of '''<code>iecodebook template</code>''' in this case is as follows:
  iecodebook template
  iecodebook template "''filename.dta''" "''filename.dta''" [...]
    "/path/to/survey1.dta" "/path/to/survey2.dta" [...]
     using "''filename.xlsx''"
     using "/path/to/codebook.xlsx"
     , surveys(''Survey1Name'' ''Survey2Name'' [...])
     , surveys(Survey1Name Survey2Name ...)
     [generate(varname)] [match]  
     [match] [generate(varname)]


As in iecodebook apply, the correct executing command is formed by replacing template with append. The general syntax of the iecodebook append command is therefore:
The following is the syntax to run '''<code>iecodebook append</code>''' subcommand based on the rules you specify in the '''codebook''':
  iecodebook append "''filename.dta''" "''filename.dta''" [...]
    using "''filename.xlsx''"
    , surveys(''Survey1Name'' ''Survey2Name'' [...])
    [generate(varname)] [keepall] [report] [replace]
    [missingvalues(''#'' "''label''" [''#'' "''label''" ...])]


  iecodebook append
The following points explain the options that are used with the '''<code>iecodebook template</code>''' and '''<code>iecodebook append</code>''' subcommands:
    "/path/to/survey1.dta" "/path/to/survey2.dta" ...
* '''match:''' The '''match''' option automatically aligns variables from other datasets in the same row if they share a name with a variable in the first dataset. It is optional for the '''template''' subcommand only.
    using "/path/to/codebook.xlsx"
    , surveys(Survey1Name Survey2Name ...)
      [keepall] [generate(varname)]
      [missingvalues(# "label" [# "label" ...])]


The surveys() option is required in both steps, and must match between them. As a list of single words, the users should specifiy the names of the surveys (which the command will look for in the codebook headers). The command will also create a survey variable in the resulting dataset, labelled with these names -- -- to change the name of that variable, use the generate() option in both commands. To demonstrate the usage, we will create two datasets that have similar data but with different structures, then combine them using a codebook. Run the following:
* '''surveys( ):''' The '''surveys( )''' option must be used with both subcommands, and the names you specify under this option must be the same for both subcommands. Specify the names of the surveys as a list of single words. '''<code>iecodebook</code>''' will look for these names in the '''codebook''' headers. The command will also create a '''survey''' variable in the resulting dataset by default. The value label for this variable will contain the name of the surveys that you specify under the '''surveys ( )''' option.
 
* '''generate( ):''' To change the name of the '''survey''' variable, use the '''generate( )''' option in both subcommands.
 
* '''report:''' The '''report''' option exports a '''codebook''' with the results of the resulting dataset for quick reference.
 
* '''replace:''' The replace option allows you to overwrite the existing file which contains the '''codebook'''.
=== Implementation ===
To demonstrate the usage, we will create two datasets that have similar data but with different structures, then combine them using a codebook. Run the following:


     // Create demonstration datasets
     // Create demonstration datasets

Revision as of 22:16, 11 May 2020

iecompdup is the final command in the Stata package created by DIME Analytics, iefieldkit. After data collection is complete, iecodebook allows the research team to automatically perform the repetitive steps involved in cleaning data before further analysis. As the name suggests, the iecodebook command is structured around Excel-based codebooks, which allow researchers to perform and document data cleaning tasks in Excel itself, instead of using do-files.

Read First

  • Stata coding practices.
  • iefieldkit.
  • The sub-commands of iecodebook allow the research team to rapidly clean, harmonize, and document datasets using codebooks.
  • Codebooks allow researchers to document the cleaned data in a format that is both human and machine-readable.
  • To install iecodebook, type ssc install iecodebook in Stata.
  • To install all the commands in the iefieldkit package, type ssc install iefieldkit in Stata.
  • For instructions and available options, type help iecodebook.

Overview

As its name suggests, the iecodebook command creates Excel-based codebooks. The research team can fill these codebooks with data cleaning instructions for Stata. In this way, iecodebook creates a metadata record which is easier to write than a long sequence of data cleaning commands in a do-file. These codebooks in Excel are also easier to read and understand, even if someone does not have knowledge of Stata. There are four subcommands in iecodebook to support its functions:

  • iecodebook apply: Reads an Excel codebook where the user renames, recodes, and/or labels a large number of variables, and applies these changes to the current dataset.
  • iecodebook append: Allows two or more datasets to have the same variable names, labels, and value labels. That is, it harmonizes two or more datasets, and appends them.
  • iecodebook export: Creates an Excel codebook that describes the current dataset. It can also produce an exportable version of the dataset which only contains the variables used in a particular do-file.
  • iecodebook template: Creates an Excel template that describes the current or targeted dataset(s), and prepares the codebook for the other iecodebook subcommands.

Apply

The most common data cleaning tasks include renaming variables, applying variable and value labels, and recoding values. The iecodebook apply subcommand allows the research team to perform all of these tasks without writing separate lines of code for each task in Stata. The following steps list how the iecodebook apply works.

  1. Create template: iecodebook first converts the dataset into a template in Excel using iecodebook template. In this template, each column describes different aspects of a single variable, including name, label, type, and so on.
  2. Complete template: After this, you can simply fill out the template, which creates the codebook. The codebook lists all the data cleaning tasks that you wish to perform on the dataset.
  3. Apply changes: The iecodebook apply subcommand then reads these commands, and executes them all with just one line of Stata code. The resulting output is a cleaned dataset, along with an easy-to-read record of the cleaning commands you applied.

Syntax

The following line of code creates an apply template with the relevant dataset. The template is named filename.xlsx in this case.

iecodebook template using "filename.xlsx"

The following line of code applies the changes to the dataset. It saves the codebook with the same name, that is, filename.xlsx.

iecodebook apply using "filename.xlsx"
   , [drop] [missingvalues(# "label" [# "label" ...])]

Implementation

The following steps use an example to explain how iecodebook apply works in practice.

Step 1: Load the dataset.

First, load the dataset which you wish to clean. In this case, the dataset is named "auto.dta".

sysuse auto.dta , clear 

Step 2: Create template.

Next, run the following code to create the template codebook, which is named "cleaning.xlsx" in this case.

iecodebook template using "cleaning.xlsx"

This produces the template codebook in Figure 1, which shows the current state of the data.

File:Apply-template.png
Figure 1: Data cleaning codebook template

Step 3: Complete template

Next, fill-up the following columns in the template to specify the relevant cleaning tasks:

  • name: Fill the name column in the template to specify what the rename command will do to the variables in the dataset. You can use this to rename a variable. For example, in Figure 3, we rename the foreign variable to domestic depending on which of these names is assigned "0" and "1" in the choices sheet.
  • label: Fill the label columns in the template to specify what the label command will do to the variables in the dataset. You can use this to give more information about a variable.
  • choices: Enter a label name in the choices column to apply a particular value label for a variable. Also create the corresponding value label in the choices sheet. Every template already includes a demo yesno label as a guide. For example, in Figure 3, we have applied the yesno value label to the variable with the name domestic, to indicate if a car is domestically made or not. We have also applied the origin value label to the variable with the name foreign. We will also create the origin value label in the choices sheet, for instance by assigning values of "0" to domestic, and "1" to foreign.
  • recode:current: Use the usual syntax (rule) [(rule) ...] in the recode:current column to recode data values. For example, in Figure 3, (0=1)(1=0) indicates that the label for the value of "0" has now been assigned to "1" , and the label for the value of "1" has now been assigned to "0". Using the example above, this means that now foreign has a value of "0", and domestic has a value of "1".

Note: The data types are given for reference only; you cannot use iecodebook to change them. Figure 3 shows how you can make the above changes to the foreign variable.

File:Apply-complete.png
Figure 2: iecodebook apply template filled out with changes to be applied

Step 4: Apply cleaning commands

Finally, apply the changes using the following command, and save the codebook with the same name as before - "cleaning.xlsx".

iecodebook apply using "cleaning.xlsx"

Note: Keep the following points in mind when using iecodebook apply.

  • Default: By default, all variables where you do not make changes will be the same as before.
  • Dropping variables: You can also use iecodebook apply to drop variables from the dataset, using the drop option, or using single periods (" . ").
    • drop: You can simply use the drop option to drop variables from the dataset that have no final variable name under the name column.
    • Single period (" . "): Alternatively, you can place a single period (" . ") under the name column to drop variables one by one.
  • missingvalues( ): The missingvalues( ) option allows you to use global missing-value codes for all value labels. For example, this is often " . " or "-99".
  • Value labels: You will have to manually recreate all value label lists in the choices sheet. However, you can copy-paste data labels from your original dataset to the choices_current sheet.

Append and Harmonize

A common task in data collection is combining two or more sequential rounds of surveys, or combining similar survey instruments that were used in different contexts. This is often a tricky process, and at least one of the datasets might not be correctly updated. In such a case, simply using the append command in Stata will not provide you with the desired structure for your dataset. In such cases, it is possible that certain aspects may not be synchronized or harmonized across the two datasets, such as:

  • Variable names.
  • Variable labels (including translations).
  • Value labels.
  • Data types.

In these cases, iecodebook append offers a quick option to document and resolve these differences across multiple datasets. The following steps list how the iecodebook append works.

  1. Create template: Just as in the case of iecodebook apply, use iecodebook template to first convert the dataset into a template in Excel. In this template, each column describes different aspects of a variable, including name, label, type, and so on. The only difference is that in this case, iecodebook also creates a new variable called survey by default. The value label for this variable will contain the name of the surveys that you specify under the surveys ( ) option for iecodebook template.
  2. Complete template: After this, you can simply fill out the template in Excel. In this template, you can specify the rules to append the datasets, and resolve differences across the two datasets. For instance, you can place certain variables in the same row, and iecodebook append will understand this to mean that we want those variables to have the same values under the name, label and choices columns so that they can append properly.
  3. Append datasets: The iecodebook append subcommand then reads the rules that you specify in the codebook template, and uses them to finally append the datasets. The resulting output is a harmonized dataset with all the differences across the two datasets now resolved.

Syntax

The general syntax of iecodebook template in this case is as follows:

iecodebook template "filename.dta" "filename.dta" [...]
   using "filename.xlsx"
   , surveys(Survey1Name Survey2Name [...])
   [generate(varname)] [match] 

The following is the syntax to run iecodebook append subcommand based on the rules you specify in the codebook:

 iecodebook append "filename.dta" "filename.dta" [...]
   using "filename.xlsx"
   , surveys(Survey1Name Survey2Name [...])
   [generate(varname)] [keepall] [report] [replace]
   [missingvalues(# "label" [# "label" ...])]

The following points explain the options that are used with the iecodebook template and iecodebook append subcommands:

  • match: The match option automatically aligns variables from other datasets in the same row if they share a name with a variable in the first dataset. It is optional for the template subcommand only.
  • surveys( ): The surveys( ) option must be used with both subcommands, and the names you specify under this option must be the same for both subcommands. Specify the names of the surveys as a list of single words. iecodebook will look for these names in the codebook headers. The command will also create a survey variable in the resulting dataset by default. The value label for this variable will contain the name of the surveys that you specify under the surveys ( ) option.
  • generate( ): To change the name of the survey variable, use the generate( ) option in both subcommands.
  • report: The report option exports a codebook with the results of the resulting dataset for quick reference.
  • replace: The replace option allows you to overwrite the existing file which contains the codebook.

Implementation

To demonstrate the usage, we will create two datasets that have similar data but with different structures, then combine them using a codebook. Run the following:

   // Create demonstration datasets
   sysuse auto.dta , clear
   save data1.dta , replace
   rename (price mpg)(cost car_mpg)
       recode foreign (0=1 "Domestic")(1=0 "Foreign") , gen(origin)
       drop foreign
   save data2.dta , replace

Harmonize

   // Create harmonization codebook template
   iecodebook template         ///
     "data1.dta" "data2.dta"   ///
     using "codebook.xlsx"     ///
     , surveys(First Second)

This should produce the following harmonization codebook template:

File:Append-template.png

To resolve the differences, the completed codebook would be modified to look as follows. Note the key functionality of harmonization -- variables from different datasets are placed by the user into the same row, and iecodebook append understands this to mean that they should have the same final instructions applied to them so that they append properly (except, of course, recode; which is why there is one recode: column for each survey as well as choices_ sheets for reference). Specifying the match option does this as best as possible by automatically aligning variables that have the same name in the template.

File:Append-complete.png

There are two important differences from the apply syntax. First, the drop option is the default: that is, if there is no name harmonization specified, that is, if there is no value in the first four columns (name, label, type, choices), variables are dropped. (The keepall option may be specified to override this behavior, but the user should check the results carefully.) Again, note that you will have to manually recreate the value label lists in the choices sheet, but that the data labels from your original datasets are available for copy-paste from respective choices_ sheets.

File:Append-choices.png

To execute the command, run:

   // Harmonize and append the datasets
   iecodebook append           ///
     "data1.dta" "data2.dta"   ///
     using "codebook.xlsx"     ///
     , surveys(First Second)

The combined dataset will yield the following crosstabs, and a codebook titled codebook\_appended.xlsx will be created in the same location as the append codebook documenting the final state of the dataset for quick reference.

   . ta survey foreign
         Data |        Foreign
       Source |  Domestic    Foreign |     Total
   -----------+----------------------+----------
        First |        52         22 |        74
       Second |        52         22 |        74
   -----------+----------------------+----------
        Total |       104         44 |       148

Export

The iecodebook export command provides a simple utility for documenting the current state of a dataset, and for preparing a trimmed "release" version of a dataset. The syntax is:

 iecodebook export [if] [in] using "/path/to/codebook.xlsx"
   , [replace] [trim("/path/to/dofile1.do" ["/path/to/dofile2.do"] ...)]

The base command will simply produce a record of the dataset's contents at the specified location. If the trim() option is specified, iecodebook export will read the contents of the specified dofiles; drop any variables that do not match the contents; restrict the dataset according to if and in as specified; and save the results in the same location as the codebook as a .dta file with the same name. (Note that this is a new functionality and is imperfectly implemented: trim() will not, for example, correctly parse macros. Therefore, please check that your results run and reproduce correctly after using this option.)

Additional Resources