Jump to: navigation, search

Primary data collection and cleaning involve highly repetitive but extremely important processes that contribute to high quality reproducible research. DIME Analytics has developed iefieldkit as a package in Stata to standardize and simplify best practices involved in primary data collection. Iefieldkit consists of commands that automate: error-checking for electronic Open Data Kit (ODK)-based survey modules; duplicate checking and resolution; data cleaning and survey harmonization; and codebook creation.

Read First


One of the most important developments in economics over the past two decades has been the rise of empirical research, through primary as well as secondary data collection. The authors of iefieldkit have developed the package to support data collection by researchers directly in a wide range of fields like agriculture, health, energy and environment, transport, financial and private sector development, gender, governance, and fragility, conflict and violence (FCV). iefieldkit therefore supports general best practices in primary data collection from start to finish:

These four commands in this package make sure that inputs and outputs are significantly more human-readable by working with spreadsheets instead of Stata do-files. In doing so, they allow field personnel who do not specialize in code tools to understand and review the tasks involved in primary data collection. iefieldkit thus recognizes the vital role played by field personnel in supporting data management and data cleaning even if they are not proficient in Stata.

Before Data Collection

In Open Data Kit (ODK)-based electronic survey kits, including SurveyCTO, survey forms (or questionnaires) are typically built in Excel using a specialized structured syntax. Before the research team starts with field data collection, they can use ietestform to test Open Data Kit (ODK)-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms. For example, the SurveyCTO server has a built-in test feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and is of high quality.

During Data Collection

While data collection is ongoing, ieduplicates and iecompdup allow researchers to test for, and resolve duplicate entries in the data set. The commands combine four key tasks to deal with duplicate ID values:

  • Identifying duplicate entries.
  • Comparing observations with the same ID value.
  • Tracking and documenting any changes made to the identifying variable.
  • Applying the necessary corrections to the data.

Together these commands ensure that the collected data will be a correct record of the sample, and can be merged with the master database. Both these commands were previously part of the ietoolkit package, but have now been moved to iefieldkit.

After Data Collection

After data collection, the iecodebook commands provide a workflow for rapidly cleaning, harmonizing, and documenting datasets. iecodebook uses input specified in an Excel sheet, which provides a much more well-structured and easy to follow overview – especially for non-technical users – than the same operations written directly to a dofile.

Additional Resources

  • Visit the iefieldkit GitHub page here