Difference between revisions of "Iecodebook"

Jump to: navigation, search
Line 1: Line 1:
== iecodebook==
== iecodebook==


Once data collection is complete, it must be cleaned before it can be analyzed. The iecodebook commands are designed to automate repetitive data cleaning tasks in two typical situations: iecodebook apply, where a large number of variables need to have arbitrary rename, recode, or label commands applied to them; and iecodebook append, when two or more datasets need to be harmonized to have the same variable names, labels, and value labels ("choices") in order to be appended together. iecodebook also provides an export subcommand so that a human-readable record of the variables and their labels in a dataset can be instantly created at any time; and a template subcommand which prepares the codebooks for the other syntaxes.
Once data collection is complete, it must be cleaned before it can be analyzed. The iecodebook commands are designed to automate repetitive data cleaning tasks in two typical situations: iecodebook apply, where a large number of variables need to have arbitrary rename, recode, or label commands applied to them; and iecodebook append, when two or more datasets need to be harmonized to have the same variable names, labels, and value labels ("choices") in order to be appended together. iecodebook also provides an export subcommand so that a human-readable record of the variables and their labels in a dataset can be instantly created at any time; and a template subcommand which prepares the codebooks for the other syntax.


As its name suggests, the iecodebook command is structured around Excel-based "codebooks". The purpose of these codebooks is to document data cleaning after data are exported from the primary data collection software in a format that is both human and machine-readable. By completing these codebooks with data cleaning instructions for Stata, a metadata record is created which is easier to write than a long sequence of data cleaning commands in a dofile; and easier to read later. This functionality is implemented via four subcommands:
As its name suggests, the iecodebook command is structured around Excel-based "codebooks". The purpose of these codebooks is to document data cleaning after data are exported from the primary data collection software in a format that is both human and machine-readable. By completing these codebooks with data cleaning instructions for Stata, a metadata record is created which is easier to write than a long sequence of data cleaning commands in a dofile; and easier to read later. This functionality is implemented via four subcommands:




* iecodebook template, which  creates an Excel template that describes the current or targeted dataset(s), with empty columns for you to specify the changes or harmonizations for the other iecodebook commands.
* iecodebook template, which  creates an Excel template that describes the current or targeted dataset(s), with empty columns for you to specify the changes or harmonization for the other iecodebook commands.
* iecodebook apply, which reads an Excel codebook that specifies renames, recodes, variable labels, and value labels, and applies them to the current dataset.
* iecodebook apply, which reads an Excel codebook that specifies renames, recodes, variable labels, and value labels, and applies them to the current dataset.
* iecodebook append, which  reads an Excel codebook that specifies how variables should be harmonized across two or more datasets - rename, recode, variable labels, and value labels - applies the harmonization, and appends the datasets.
* iecodebook append, which  reads an Excel codebook that specifies how variables should be harmonized across two or more datasets - rename, recode, variable labels, and value labels - applies the harmonization, and appends the datasets.
Line 50: Line 50:
===Append and harmonize multiple datasets===
===Append and harmonize multiple datasets===


A common downstream task in data collection is to combine two or more sequential rounds of surveys; or, similarly, to combine similar survey instruments conducted in different settings. This is always harder than it first sounds. Inevitably, updates and/or localizations have been made to at least one of the datsets, such that a simple append command will not produce the desired data structure. Most often, these changes cause desynchronisation of:
A common downstream task in data collection is to combine two or more sequential rounds of surveys; or, similarly, to combine similar survey instruments conducted in different settings. This is always harder than it first sounds. Inevitably, updates and/or localization have been made to at least one of the datsets, such that a simple append command will not produce the desired data structure. Most often, these changes cause desynchronisation of:


* Variable names
* Variable names

Revision as of 23:23, 29 January 2019

iecodebook

Once data collection is complete, it must be cleaned before it can be analyzed. The iecodebook commands are designed to automate repetitive data cleaning tasks in two typical situations: iecodebook apply, where a large number of variables need to have arbitrary rename, recode, or label commands applied to them; and iecodebook append, when two or more datasets need to be harmonized to have the same variable names, labels, and value labels ("choices") in order to be appended together. iecodebook also provides an export subcommand so that a human-readable record of the variables and their labels in a dataset can be instantly created at any time; and a template subcommand which prepares the codebooks for the other syntax.

As its name suggests, the iecodebook command is structured around Excel-based "codebooks". The purpose of these codebooks is to document data cleaning after data are exported from the primary data collection software in a format that is both human and machine-readable. By completing these codebooks with data cleaning instructions for Stata, a metadata record is created which is easier to write than a long sequence of data cleaning commands in a dofile; and easier to read later. This functionality is implemented via four subcommands:


  • iecodebook template, which creates an Excel template that describes the current or targeted dataset(s), with empty columns for you to specify the changes or harmonization for the other iecodebook commands.
  • iecodebook apply, which reads an Excel codebook that specifies renames, recodes, variable labels, and value labels, and applies them to the current dataset.
  • iecodebook append, which reads an Excel codebook that specifies how variables should be harmonized across two or more datasets - rename, recode, variable labels, and value labels - applies the harmonization, and appends the datasets.
  • iecodebook export, which creates an Excel codebook that describes the current dataset, and optionally produces an export version of the dataset with only variables used in specified dofiles.


Apply cleaning commands to the open dataset

The most common data cleaning tasks are renaming variables, applying variable and value labels, and recoding values. The iecodebook apply subcommand provides a workflow to execute an arbitrary number of these commands without writing out an arbitrary number of lines of Stata code. Instead, the dataset is first translated into a codebook with each line describing the contents of a single variable. Then, the user fills out a template specifying the cleaning commands they wish to execute. The iecodebook apply subcommand reads these commands and executes them all with a single line of Stata code. The resulting output is a cleaned dataset and a highly-readable record of the cleaning commands applied to it.

First, create an apply template with the dataset open:

   iecodebook template using "/path/to/codebook.xlsx"

Next, fill out the template with the specific instructions desired, save it, then apply it to the open data by writing:

   iecodebook apply using "/path/to/codebook.xlsx"
   , [drop] [missingvalues(# "label" [# "label" ...])]

For example, running:

   // Load data
   sysuse auto.dta , clear
   // Create cleaning template
   iecodebook template using "codebook.xlsx"

Produces the following template codebook reflecting the current state of the data:

File:Apply-template.png

To apply changes to the data, complete the "name" and "label" columns to prepare rename and label variable commands for the current dataset, respectively. To apply value labels, enter a label name in the "choices" column and create the corresponding value label in the choices sheet (every template includes a demo yesno label as a guide). To recode data values, use the usual syntax (rule) [(rule) ...] in the "recode:current" column. You cannot change data types with this command; these are provided for reference only. For example, you might write the following to make some adjustments to the foreign variable:

File:Apply-complete.png

To apply the changes, you would then run the following command:

   // Apply cleaning commands to open dataset
   iecodebook apply using "codebook.xlsx"

Note that the correct command is created by replacing template with apply. By default, all variables with no adjustments will be left as-is. However, this is not required: the drop option orders all variables that have no final variable name in the name column to be dropped from the dataset. The missingvalues() option allows global missing-value codes to be propagated to all value labels. Note also that you will have to manually recreate all value label lists in the choices sheet, but that the data labels from your original dataset is available for copy-paste from the choices_current sheet.

Append and harmonize multiple datasets

A common downstream task in data collection is to combine two or more sequential rounds of surveys; or, similarly, to combine similar survey instruments conducted in different settings. This is always harder than it first sounds. Inevitably, updates and/or localization have been made to at least one of the datsets, such that a simple append command will not produce the desired data structure. Most often, these changes cause desynchronisation of:

  • Variable names
  • Variable labels (including translation)
  • Value labels
  • Categorical options
  • Data types

The iecodebook append subcommand offers a rapid workflow for documenting and resolving these differences across multiple datasets. The general syntax of the templating command is:

 iecodebook template
   "/path/to/survey1.dta" "/path/to/survey2.dta" [...]
   using "/path/to/codebook.xlsx"
   , surveys(Survey1Name Survey2Name ...)

As in iecodebook apply, the correct executing command is formed by replacing template with append. The general syntax of the iecodebook append command is therefore:

 iecodebook append
   "/path/to/survey1.dta" "/path/to/survey2.dta" ...
   using "/path/to/codebook.xlsx"
   , surveys(Survey1Name Survey2Name ...) [nodrop]
     [missingvalues(# "label" [# "label" ...])]

The surveys() option is required in both steps, and must match between them. As a list of single words, the users should specifiy the names of the surveys (which the command will look for in the codebook headers). The command will also create a survey variable in the resulting dataset, labelled with these names -- so your current dataset cannot have a variabel called survey. To demonstrate the usage, we will create two datasets that have similar data but with different structures, then combine them using a codebook. Run the following:

   // Create demonstration datasets
   sysuse auto.dta , clear
   save data1.dta , replace
   rename (price mpg)(cost car_mpg)
       recode foreign (0=1 "Domestic")(1=0 "Foreign") , gen(origin)
       drop foreign
   save data2.dta , replace
   // Create harmonization codebook template
   iecodebook template         ///
     "data1.dta" "data2.dta"   ///
     using "codebook.xlsx"     ///
     , surveys(First Second)

This should produce the following harmonization codebook template:

File:Append-template.png

To resolve the differences, the completed codebook would be modified to look as follows. Note the key functionality of harmonization -- variables from different datasets are placed by the user into the same row, and iecodebook append understands this to mean that they should have the same final instructions applied to them so that they append properly (except, of course, recode; which is why there is one recode: column for each survey as well as choices_ sheets for reference).

File:Append-complete.png

There are two important differences from the apply syntax. First, the drop option is the default: that is, if there is no name harmonization specified, variables are dropped. (The nodrop option may be specified to override this behavior, but the user should check the results carefully.) Again, note that you will have to manually recreate the value label lists in the choices sheet, but that the data labels from your original datasets are available for copy-paste from respective choices_ sheets.

File:Append-choices.png

To execute the command, run:

   // Harmonize and append the datasets
   iecodebook append           ///
     "data1.dta" "data2.dta"   ///
     using "codebook.xlsx"     ///
     , surveys(First Second)

The combined dataset will yield the following crosstabs, and a codebook titled codebook\_appended.xlsx will be created in the same location as the append codebook documenting the final state of the dataset for quick reference.

   . ta survey foreign
         Data |        Foreign
       Source |  Domestic    Foreign |     Total
   -----------+----------------------+----------
        First |        52         22 |        74
       Second |        52         22 |        74
   -----------+----------------------+----------
        Total |       104         44 |       148

Export a codebook for an existing dataset

The iecodebook export command provides a simple utility for documenting the current state of a dataset, and for preparing a trimmed "release" version of a dataset. The syntax is:

 iecodebook export [if] [in] using "/path/to/codebook.xlsx"
   , [trim("/path/to/dofile1.do" ["/path/to/dofile2.do"] ...)]

The base command will simply produce a record of the dataset's contents at the specified location. If the trim() option is specified, iecodebook export will read the contents of the specified dofiles; drop any variables that do not match the contents; restrict the dataset according to if and in as specified; and save the results in the same location as the codebook as a .dta file with the same name. (Note that this is a new functionality and is imperfectly implemented: trim() will not, for example, correctly parse variable lists or wildcards; and it will match variables which are sub-strings of other variables. Therefore, please check that your results run and reproduce correctly after using this option.)