Iefieldkit
Primary data collection and cleaning involve highly repetitive but extremely important processes that contribute to high quality reproducible research. DIME Analytics has developed iefieldkit
as a package in Stata to standardize and simplify best practices involved in primary data collection. iefieldkit
consists of commands that automate: error-checking for electronic Open Data Kit (ODK)-based survey modules; duplicate checking and resolution; data cleaning and survey harmonization; and codebook creation.
Read First
- DIME Analytics Bootcamp on Reproducible Research.
- Stata coding practices.
iefieldkit
currently consists of four commands:ietestform
,ieduplicates
,iecompdup
, andiecodebook
.- Each of these commands can be used independently in a wide range contexts.
- The
iefieldkit
open-source code is available on GitHub for public contribution and comments. - To install the package, type
ssc install iefieldkit
in the Stata command box.
Objective
One of the most important developments in economics over the past two decades has been the rise of empirical research, through primary as well as secondary data collection. The authors of iefieldkit
have developed the package to support data collection by researchers directly in a wide range of fields like agriculture, health, energy and environment, transport, financial and private sector development, gender, governance, and fragility, conflict and violence (FCV). iefieldkit
therefore supports general best practices in primary data collection from start to finish:
- Before data collection:
ietestform
- During data collection:
ieduplicates
andiecompdup
- After data collection:
iecodebook
These four commands in this package make sure that inputs and outputs are significantly more human-readable by working with spreadsheets instead of Stata do-files. In doing so, they allow field personnel who do not specialize in code tools to understand and review the tasks involved in primary data collection. iefieldkit
thus recognizes the vital role played by field personnel in supporting data management and data cleaning even if they are not proficient in Stata.
Before Data Collection
In Open Data Kit (ODK)-based electronic survey kits, including SurveyCTO, survey forms (or questionnaires) are typically built in Excel using a specialized structured syntax. Before the research team starts with field data collection, they can use ietestform
to test Open Data Kit (ODK)-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms.
Most ODK servers, including SurveyCTO's servers, have a built-in test feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform
complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and warns users who use practices we have learnt are prone to data quality errors.
During Data Collection
While data collection is ongoing, ieduplicates
and iecompdup
allow researchers to test for, and resolve duplicate
entries in the dataset. The commands combine four key tasks to deal with duplicate ID values:
- Identifying duplicate entries.
- Comparing observations with the same ID value.
- Tracking and documenting any changes made to the identifying variable.
- Applying the necessary corrections to the data.
Together these commands ensure that the collected data will be a correct record of the sample, and can be merged with the master database. Both these commands were previously part of the ietoolkit
package, but have now been moved to iefieldkit
.
After Data Collection
After data collection is complete, iecodebook
allows the research team to automate the repetitive tasks involved in cleaning data before it can be analyzed. As the name suggests, the iecodebook
command is structured around Excel-based codebooks, which allows researchers to perform and document data cleaning tasks in Excel itself, instead of do-files. Therefore, codebooks allow researchers to document the cleaned data in a format that is both human and machine-readable. iecodebook
implements this through 4 subcommands:
iecodebook apply
applies rename, recode, and/or label commands to a large number of variables in the dataset.iecodebook append
harmonizes two or more datasets, and appends them. That is, it allows two or more datasets to have the same variable names, labels, and value labels.iecodebook export
creates an Excel codebook that describes the current dataset. It can also produce an exportable version of the dataset which only contains the variables used in a particular do-file.iecodebook template
creates an Excel template that describes the current or targeted dataset(s), and prepares the codebook for the other subcommands iniecodebook
.
Related Pages
Click here to see pages that link to this topic.
This page is part of the topic Stata coding practices.
Additional Resources
- DIME Analytics (World Bank), The
iefieldkit
GitHub page